In our previous post about the Well-Architected Framework (WAF), we explored its use through the lens of a Well-Architected Review (WAR).
In this post, we explore how the WAF can go beyond the bounds of being a pure application architecture assessment to drive a discussion around improved operational insight and observability of system health and performance.
Obviously one of the 5 pillars of the WAF is “Operational Excellence”; in this pillar we normally assess how we are operating the platform, as well as how the business of operations is conducted.
A strong operational approach to a solution is to balance needs of both functional and non-functional requirements (NFR’s). The questions and discussions around NFR’s can be quite extensive, and can often involve political discussions around whose are bigger!
It’s really easy to get caught in the weeds digging into specific latency questions or error rates of different components, forgetting the bigger picture in the process. By leveraging the five pillars of the WAF, you have a great scaffold to consider all dimensions.
If we look at the pillars of the WAF from an operational and observability perspective, we start to see some interesting features emerge.
During a WAR, the security pillar focuses on identity and traceability as well as operational practices to reduce risk. Operational awareness of security is getting better, but in a number of organisations there is a belief that a “Secure Architecture” is sufficient and often “security observability” is a forgotten aspect of operations.
In modern system design this is just not enough – not only do we need to have a secure architecture, but we also need to have continual visibility of our security context.
When we look through the security lens for observability and operability we start to see things like monitoring of denial rates and failure rates on APIs, which may indicate some sort of brute force attack. We also need to look at data ingress/egress volumes, and overlay anomaly detection to identify possible causes of data exfiltration.
Designing with security in mind is critical, but ensuring the design is coupled to a strong security focus on observation and monitoring is key to protecting your workload.
When we look at the reliability of our architecture, we are essentially talking about designing things to survive failure, and using automation to do so. But as per security, having a strong architecture here is not sufficient to deliver reliability.
Many cloud services implement default models that will recover from major failures, like an entire server failing or even an availability zone. These failures are easy for the platform to detect and remediate, and a good architecture will get this mostly for free.
The detail is where we find partial failures: errors introduced through bad builds or malformed data that trigger edge conditions in our systems, producing a degraded customer experience, but don’t trip the platform level remediation. Worst case, an edge condition may trip the remediation process and you end up in an everlasting boot loop on new services or serverless function retries that are not getting closer to resolution.
Ensuring there’s a strong operational view into the reliability aspects is key to ensuring the automation activities are doing what is expected.
We need to look for things like unexpected auto-scaling events, or error rates through our load balancers. Latency on requests and multiple retry events may hide an underlying application error, while our resilient solution is working hard to route around the issue. Having the right visibility into these metrics is critical – not only knowing customers are getting a good experience, but in ensuring there are no lurking issues in the system.
Performance efficiency is becoming one of the more complicated aspects of a modern system design. As we have moved from monoliths to more distributed systems, the number of data points available to us have exploded.
No longer can we simply look in a single access log for all requests to a system – our solutions are now a mix of different technologies and integration patterns.
The ability to observe end-to-end performance across these solutions has become increasingly more difficult to achieve. Tools such as AWS X-Ray and other distributed tracing solutions help to close this gap, but require a structured approach to monitoring and development.
Designing with observability and tracing in mind is something that needs to be considered early in the piece, and is often expensive and time consuming to retrofit. It’s important to take a layered approach to observability – making sure to cover both the micro and the macro performance aspects.
The micro aspects reflect individual components such as the performance of individual lambda functions or the individual performance of a specific database. At the macro scale, we focus on how this information is aggregated together to provide an end-to-end solution.
Performance really comes into its own when we are able to call out key dependencies and visualise the flow of information through our solutions with service maps.
The last pillar, cost optimisation is another critical aspect which is becoming an increasingly important demand on operational teams.
No longer do we have to lean on “ahead of time” architectural decisions to lock us into a fixed platform and fixed cost. In the cloud we are able to flex with demand, and failure – and while this is regularly sold as “turning off what you don’t use” it can soon become “why do we have so much running”.
A poorly managed operational view of cost can see your move to the cloud as the most expensive data centre that you’ll ever run. This is even more true if you bring your existing “fixed infrastructure” operational playbook to the cloud table. The most expensive mistake you can make is to treat the cloud like a traditional data centre.
Management of operational cost is not as simple as turning things off if not used – it starts with understanding why something exists in the first place, and being able assess in real time, not only if the resource is being used, but if it is over or under scaled.
Sometimes, smaller is not the best cost optimisation. Take AWS Lambda for example: you can control the amount of CPU and RAM available to your Lambda functions and, depending on your implementation, these can have a direct impact on the execution time. As you reduce the capacity, the cost per millisecond goes down, but the duration of execution goes up. There is a sweet spot – and this is different for each workload deployed. Ensuring you have a solid view into the real world execution of your functions is critical to cost tuning your solution.
I hope this post has helped you look at the Well Architected Framework as more than a tool to assess the suitability of a given workload to have “best practice” architecture – it goes so much further and should be something that stays in your toolkit throughout the lifecycle of any system.
There is obviously a lot of experience and finesse required in designing an operational view of a solution – something our team would love to talk to you about. Want to explore how you can use the Well Architected Framework to provide guidance into the operational view of your systems? contact us for more details.