In part 5 in my latest series showcasing the six pillars of the AWS Well-Architected Framework, we continue to take a look at the Operational Excellence pillar. The Operational Excellence pillar underpins how teams implement good practices across an organisation and covers the organisation, operational models, preparation and evolution of services. This pillar is one of my personal favourites.
If you’d like to learn more about the other pillars of the Well-Architected Framework, check out the other blogs in this series via the links below. Otherwise, let’s get stuck in!
What we will be covering
- What is Operational Excellence?
- Common pitfalls of operations
- Implementing good operational practices
- Improving operational excellence in your organisation
Why we are learning this
- To help others navigate and understand how to improve operational practices across an organisation
- Using the AWS Well-Architected: Operational Excellence Pillar for guidance to operate efficiently and effectively
How this will help me
- Understand good practices for operations covering people, processes and technology
- Be able to help champion operational excellence across your organisation
- Understand the intersection between operations and deployment of new services
What is Operational Excellence?
Operational Excellence covers the people, processes and technology that underpins the operation and supply of services to your customers. Those customers could be internal or external to your organisation. The goal of operational excellence is to increase the velocity of deploying new services and support existing services through well implemented processes covering observability, patching, automation, deployment and maintenance of solutions and their supporting elements. Further, this includes having a common mindset and clear goals to ensure everyone knows their role and function across an organisation. We will cover these points in greater detail in the next sections.
Common Pitfalls of Operations
In many organisations, solutions are often implemented to suit a single goal or address a single problem. The creation of services is commonly seen as separate from the ongoing operations of that component or service. Even organisations that have good practices often see deployment and operations as two separate entities. This is often driven by an operational model where responsibilities are divided amongst teams that focus on a specific area (ie; DevOps, Projects, Security, Operations, Architecture, Systems and Networks). It is common for decisions to be made in isolation and there is a lack of cohesion between teams (who rarely interact with each other unless they need each other or during an incident).
Implementing Good Practices
Implementing good practices can seem daunting at first, however using the guide below, this can be made much easier and often naturally results in better outcomes for organisations.
Understand the Gaps and Create a “North Star”
Understanding the gaps between the current way of operating and what good looks like is an effective way of creating a “north star” to aspire to. Using a framework such as Cevo’s DevOps Maturity Assessment (DOMA) on AWS Marketplace is a great exercise in learning where on the journey to “good” your organisation and pave the way for improving the current operational model through a DevSecOps lens – leading to implementation of good practices targeted at operational excellence. This helps you understand priorities for internal and external customers, governance and compliance requirements, the threat landscape and trade-offs.
Establish a Cloud Centre of Excellence
A lot of conjecture exists when discussing a Cloud Centre of Excellence (CCoE) as a way of bridging the gap between teams and projects and creating a cohesive view. A CCoE is most effective when there is a diverse set of skills and Subject Matter Expertise (SME) representation from multiple teams. The idea of a CCoE is to establish a governing body that focuses on cohesively building a common, opinionated and cohesive view of what the “cloud” should look like for their organisation. This inspires the team to do what is in the best interest of everyone involved. In my career I have helped in establishing and improving CCoE’s to operate efficiently and effectively with outstanding results when done well.
Building and maintaining a cloud operating model that establishes clear roles and responsibilities across an organisation is not a trivial task. It is extremely important to have an effective and concise operating model in place to succeed with operating the cloud in the most efficient and secure way possible. Establishing the CCoE using a diverse set of skills and attributes encourages cross-team collaboration and ensures decisions made are evaluated against risk and with an informed opinion.
Automate, Automate, Automate
Consider where processes can be automated and work on implementing automation into the deployment and ongoing maintenance of solutions. A lot of operational maintenance (such as inventory reporting, Common Vulnerabilities and Exploits (CVE) reporting, patching and remote access) can be achieved using an effective implementation of native AWS services. These services include:
- AWS Systems Manager
- Elastic Container Registry (ECR) with Enhanced Scanning turned on
- Leveraging ECR continuous patching
- Elastic Kubernetes Service (EKS)
- Elastic Container Service (ECS)
Automation in deployment requires well implemented integration and deployment pipelines. Your pipelines are actually your biggest asset in automation so investment in improving on these will yield very high value. Consider using solutions such as pre-commit hooks in your Continuous Integration (CI) pipelines. Git hook scripts are useful for identifying simple issues before submission to code review. Running hooks on every commit to automatically point out issues in code such as missing semicolons, trailing whitespace, and debug statements. By pointing these issues out before code review, this allows a code reviewer to focus on the architecture of a change while not wasting time with trivial style nitpicks.
Running security scanning tools as part of your pipelines allows for good governance and compliance is being employed as part of moving Security “left” in your stack – Aiming for SecDevOps as opposed to DevSecOps. Using these scanning tools such as AquaSec and kube-bench as part of your release stages allows you to evaluate and be assured deployed solutions are compliant with industry standards. This is a shift towards “security by default”.
The cultural aspects of effectively operating cloud infrastructure and services is crucial. In traditional on-premise models, teams are often risk averse due to cost and resource constraints. Moving to a cloud operating model, this changes the rhetoric by allowing to “fail fast” through experimentation, pay-by-the-hour/minute/second and easier automation through Infrastructure as Code (IaC) with consolidated and comprehensive API’s. Cloud adoption (when compared to traditional models) also introduces new challenges at scale – with constant change and new services being added and deprecated at a rapid pace.
With adoption of IaC, the pace of change is difficult to keep up with and this challenges teams to also increase the pace at which changes want or need to occur. From 3rd-party solutions like Terraform, Ansible, Chef, Puppet and Terragrunt, through to more cloud native solutions like CDK – this creates a challenge in updating those tools regularly to take advantage of new services and capabilities to existing services needed by the business and customers. To keep pace, there is a need to adopt a “fail-forward” approach where updates to IaC solutions are regularly upgraded and where breakages occur, there is a penchant for rolling forward as opposed to back.
For risk averse organisations this can be profoundly difficult as this is seen as a high risk and existing processes typically won’t support this without change. Without a “fail-forward” culture, those organisations slow down their ability to innovate and often face lengthy and complex delays when upgrades do occur. The key to success is frequent, small and reversible changes where breaking changes are easier to contain and action when they occur. This exemplifies a bias for action and commitment to change – increasing velocity of innovation and adoption of new services.
Design for Operations
Implementing good observability through robust monitoring, logging, visualisation and alerting is key to ensuring support of cloud resources. Use of version control (such as Git), configuration, build management and maintenance are all crucial in effectively operating the cloud. Embracing change as part of an organisation culture with robust testing and environment separation helps teams make thoughtful consideration to deployment of services. Training teams in the Well-Architected Framework helps to support a healthier operating model and with less resistance to controls. Automation takes the human element out of the equation and improves both compliance and governance to lower risk.
The key to supporting these changes is to adapt to an effective change management implementation that encourages calculated risk taking to prevent stifling innovation. In addition, the operating model should recognise and encourage support models to help teams adapt to ever-changing service landscapes safely. Using patterns developed specifically for automated and complete monitoring such as the following help ensure this is covered effectively:
Learn, Share and Improve
Encourage teams to be malleable as sitting still does not increase innovation or encourage experimentation. The key to improving is to reflect on the ways of working, continuous learning and sharing of ideas. Again this harks back to establishing a culture where individuals and teams can openly share ideas without judgement and incite conversations. In AWS, they have a culture of always being Day-1. It’s not to say your organisation needs to be like AWS to the depth at which they operate, however encourage them to consider how continuous improvement should be aimed for.
Improving Operational Excellence in Organisations
To improve operational excellence in your organisation, it is important to gain help from the right resources.
Lean on Partners
Using partners such as Cevo (who regularly work with and help to improve operational practices across many organisations) is a great way to start improving operational excellence. Our insights, skills and experience are highly valuable in addition to having frameworks and tools to expedite maturity in good practices. We can also help you rapidly establish an effective and efficient CCoE.
Creating and fostering a culture of innovation, open sharing and open learning is a fantastic way of challenging the classic rhetoric of traditional models. Embrace change as a normal part of cloud operations. Your customers will thank you for it through better customer experience.
Where something can be automated – do it. Not only will this save time, cost and resources but it will allow resources to focus more on innovation and less on day-to-day running operations.
Evolve and Adapt
Embracing a new way of working and a modern operating model will vastly improve your cloud operation effectiveness and lead to better outcomes. Where something isn’t working – change it and challenge current ways of working to continually improve.