AWS Well-Architected: Reliability and Availability

In part 1 of 6 in my latest series showcasing the six pillars of the AWS Well-Architected Framework, we take a look at the Reliability pillar. If you’d like to learn more about the other pillars of the Well-Architected Framework, check out the other blogs in this series via the links below.

What we will be covering

  1. AWS Well-Architected overview
  2. AWS Well-Architected: Reliability pillar deep dive
  3. Defining recovery objectives
  4. Defining availability terms
  5. Intersection of availability terms and recovery objectives
  6. Evaluating reliability requirements
  7. Architecture examples
  8. Architecting to requirements

Why we are learning this

  1. To help others understand how to design and implement architecture that aligns to availability and reliability requirements
  2. Using the AWS Well-Architected: Reliability pillar for guidance to build reliable infrastructure

How this will help me

You will:

  1. Be able to successfully define availability requirements
  2. Have a better understanding of the intersection of reliability and availability
  3. Have a better understanding of correct use of availability terms
  4. Be able to build effective solutions aligned to the AWS Well-Architected: Reliability pillar

AWS Well-Architected Overview

AWS Well-Architected is a framework comprising six “pillars” for building infrastructure with good practice in mind. 

By aligning architecture and infrastructure to the AWS Well-Architected Framework, customers can ensure their solutions are secure, reliable, efficient, sustainable, cost-effective and operationally supportable.

Aligning to the framework is not mandatory; however doing so creates good architecture and can help in making critical decisions and risk mitigation.

Comprised of six pillars of excellence:

  1. Security
  2. Reliability
  3. Operational excellence
  4. Performance efficiency
  5. Cost optimisation
  6. Sustainability

The AWS Well-Architected Wheel

The following infographic depicts the components and focus areas for each of the six pillars:

Well-Architected: Security

The Security pillar covers:

  1. Incident response
  2. Data protection
    • Encryption in flight
    • Encryption at rest
    • Data classification
  3. Infrastructure protection
    • Compute protection
    • Network protection
  4. Detective controls
    • Event management
    • Security awareness
  5. Identity Access Management (IAM)
    • Who should/can access solution
    • How they access solution

Well-Architected: Reliability

The Reliability pillar covers:

  1. Change management
  2. Monitoring
  3. Alerting
  4. Failure management*
    • Disaster recovery
    • Data backup
    • Resiliency
    • Resiliency testing
  5. Scaling to demand

*This is what we will cover in depth today

Well-Architected: Operational Excellence

The Operational Excellence pillar covers:

  1. Operational readiness
  2. Deployment methods
    • Automation at core
    • Integration methods
    • Workload insights
  3. Operate
    • Workload health
    • Operation health
    • Event response
    • Patching and upgrades
    • Inventory management
  4. Evolve
    • Operational evolution

Well-Architected: Performance Efficiency

The Performance Efficiency pillar covers:

  1. Selection of resources
    • Architecture choice (server/serverless)
    • Compute
    • Storage
    • Database
    • Networking
  2. Performance monitoring
  3. Performance tradeoffs
  4. Reviewing architecture choices over time

Well-Architected: Cost Optimisation

The Cost Optimisation pillar covers:

  1. Expenditure awareness
    • Cost tracking
    • Total Cost of Ownership (TCO)
    • Usage governance
    • Resource decommissioning
  2. Cost effective resources
    • Service selection
    • Right-sizing workloads
    • Data transfer optimisation
    • Licensing cost reduction
  3. Matching supply with demand
    • Scaling and efficiency
  4. Reviewing architecture choices over time

Well-Architected: Sustainability

The Sustainability pillar covers:

  1. Resource selection
    • Right-sizing workloads
    • Effective use of infrastructure architecture
  2. Reducing impact to environmental factors
    • Power and cooling efficiency
    • Elastic resources
    • Awareness of and reduction to the carbon footprint
  3. Often supported by cost reduction

Reliability: Common Terms

Recovery Objective Terms

Often, we don’t have a clear understanding of the requirements applications and systems have in context to the business or customers. It is helpful to understand the objectives we want to achieve:

  1. Recovery Point Objective (RPO)
    • Defines how much data you are willing to lose
    • Influences how often data needs to be backed up / snapshot intervals
  2. Recovery Time Objective (RTO)
    • Defines how long an outage can be tolerated from failure to recovery
    • Influences the method used for recovery of resources in an outage
  3. Together, the RPO and RTO effectively defines what an architecture solution should look like and significantly influences architectural choices if correctly implemented

Recovery Terminology

Disaster Recovery (DR)

  • Often used interchangeably / confused with “HA”
  • Longest recovery time
  • Cheapest to implement
  • Doesn’t have to have an immediately available resource

High-Availability (HA)

  • Often confused with “FT”
  • Shorter recovery time than “DR”
  • Can be cheap to implement; more expensive than “DR”
  • Doesn’t mean no outage – just faster to recover

Fault-Tolerance (FT)

  • May still have diminished performance
  • Most expensive to implement
  • Highest availability target

Recovery Terminology Analogies

Sometimes it can be helpful to remember these terms using real-world analogies. These are some of the analogies I have used to help understand the differences between each term in an effort to be more conscience about using the correct terminology:

  1. Fault Tolerance (FT):
    • A guitar with many strings
    • Continue operating with reduced performance for a period of time
  2. High Availability (HA):
    • A spare tube on a bicycle
    • Some downtime while the new tube is fitted
  3. Disaster Recovery (DR):
    • A puncture repair kit
    • Find the leak, patch the leak and reinflate
    • Continue on your journey some time later

Intersection of RPO/RTO and Recovery Strategy

There is synergy with reliability and availability as discussed further here. There is a benefit-cost ratio depending on what your specific requirements are.

There is alignment between recovery strategy and availability at a high-level as shown below:

  1. Disaster Recovery (DR)
    • Backup and restore
  2. High-Availability (HA)
    • Pilot light
    • Warm standby**
  3. Fault-Tolerance (FT)
    • Warm standby**
    • Active/Active

**A warm standby environment may allow for continued operation with reduced performance but may require a brief outage to switch over.

Evaluating Reliability Requirements

Despite terms being used interchangeably, there is one hard and fast rule: High availability is not disaster recovery.

  1. Availability focuses on components of the workload
  2. Disaster recovery focuses on discrete copies of the entire workload
  3. There is a need to meet availability objectives for availability related events
  4. Disaster recovery objectives are built around workload failover during a disaster
  5. Data and workload may have different requirements to each other
  6. S3 is highly available across AZ’s – but how do you recover your objects?
  7. An EC2 instance may be ephemeral but the database is likely not
  8. Failure scenarios impact your availability and disaster recovery implementation

Defining Requirements

An RPO and RTO is a target set of recovery objectives but it needs to be relevant to the scenarios you are anticipating. Failure scenarios might be:

  1. Infrastructure: Loss of AZ, loss of instances, data corruption / exhaustion
  2. Security Event: crypto, DDoS, injection attacks, application-specific attacks
  3. Frontend: Web or application (instances, containers)
  4. Backend: Application and/or database (instances, containers, databases)


Understand the impacts each of these has on your application:

  1. Customers (internal or external)
  2. Financial
  3. Reputation
  4. Business continuity
  5. The impact should justify the spend on reliability

Example of Poor Implementation

Objective: A target RPO of 1 hour and target RTO of 2 hours
Failure scenarios: Data corruption, loss of AZ

  1. EC2 writes to persistent storage using distributed storage (EFS)
  2. EFS across multiple AZ’s
  3. EFS backed up once per day
  4. Multiple EC2 instances attached to ALB
  5. ALB added to multiple AZ’s
  6. Instances in multi-AZ

Note: This example meets AZ failure but not data corruption scenario

Example of Achievable Implementation

Objective: A target RPO of 1 hour and target RTO of 2 hours

Failure scenarios: Data corruption, loss of AZ

  1. EC2 is ephemeral built with an AMI
  2. Multiple EC2 instances attached to ALB
  3. ALB added to multiple AZs
  4. Instances in multi-AZ
  5. Persistent data written to RDS database
  6. RDS database in single AZ
  7. Database backed up once per day

Note: This example utilises multiple recovery strategies

Note: RDS utilises log shipping every 5 minutes – this allows for a point-in-time recovery to within 5 mins since the last backup was taken. A backup can be used to restore an RDS database in another AZ and made available to a minimum of 3 AZs due to being stored on S3.

Note: Assumes corruption was caught within an hour of the event occurring.

Example for Thought

Objective: A target RPO of 1 hour and target RTO of 2 hours

Failure scenarios: Data corruption, loss of AZ

  1. EC2 is ephemeral built with an AMI
  2. Multiple EC2 instances attached to ALB
  3. ALB added to multiple AZs
  4. Instances in multi-AZ
  5. Persistent data written to custom database
  6. Custom database on EC2 in single AZ
  7. Database backed up once per day


See the problem?

A custom database running on EC2 instead of RDS (such as RDS Custom) can’t take advantage of the capabilities. This requires deeper consideration for both availability and recoverability to meet the recovery objectives and the failure scenarios being protected against.

Architect to Requirements

  1. It is important to architect your solution to meet your own individual needs
  2. Validate your requirements against business objectives – they aren’t always aligned
  3. Validate your business continuity plan through testing
  4. Validate restoration and evaluate if the process meets your RTO/RPO targets
  5. Use tools to validate a mean time to recovery – then adjust accordingly
  6. Be flexible with your objectives – they can change over time
  7. The most successful implementations are when requirements are known, challenged and adopted with flexibility
  8. Objectives are targets but shouldn’t be considered hard and fast

Enjoyed this blog?

Share it with your network!

You may also like

Move faster with confidence