A Comprehensive Guide to Resilient Architecture in AWS

The purpose of this blog is to provide some insight and guidance on building solutions with resiliency in mind for business critical workloads, accounting for failure scenarios, and understanding your objectives and risk to an unexpected outage.

Background

A recent outage of a prominent cloud provider highlighted the fact that cloud (any cloud) is susceptible to a disruptive outage at any time. Werner Vogels (AWS CTO) famously stated that “everything fails all the time”. Not to take this literally, the idea is that stuff happens, things break, we aren’t always in control and sometimes this causes frustration. Many will use the opportunity to highlight the shortcomings of public cloud. However, this also highlights that there are a lot of confusing facts about how services operate in the cloud.

Understanding Risk

Taking a look at the definition of risk is interesting in itself. Let’s look at the context as a noun and a verb from the Oxford Dictionary:

We understand that risk is attributed to something bad or a trade-off against a reward. In IT, risks include hardware and software failure, human error, spam, viruses and malicious attacks, as well as natural disasters such as fires, cyclones or floods. By looking at how your business uses IT, you can understand and identify the types of IT risks.

We tend to define risk in the context of likelihood of something happening, cost as a monetary value or impact. The reality is it is all three of those things combined however typically this gets watered down into how much this will cost to implement. Seldom is there a discussion on the true cost of not having resilience. Understanding the scenarios you need to protect against is crucial, however also understanding that scope in terms of likelihood and impact. To help better quantify that, let’s take a look at how AWS provides resilience so we can better understand how to take advantage of it to mitigate risk.

How AWS Provides Resilience

In a VPC (Virtual Private Cloud) there are a few concepts to understand. First there are Availability Zones (AZ’s) and these are per-region. Not all regions have the same number of AZ’s. Most regions have three AZ’s for parity, while others have four (Seoul, Tokyo, Oregon) or six (Northern Virginia). An AZ is an independent failure domain consisting of one or more datacenters. Each AZ is isolated from fault domains of another AZ such as geolocation, power, cooling, fire, flood and other natural disasters. While it may be possible to have multiple AZ’s affected by a singular event, it is unlikely that all will be affected simultaneously to take down an entire region. While this has occurred in the past, a regional outage is highly uncommon.

 

When we place an EC2 instance, container, EBS Volume, a private Lambda, a single RDS instance or other types of infrastructure in a given subnet – we are deploying into a single AZ. Some services declare which subnets the resources will be deployed to for multi-AZ deployment such as RDS multi-AZ, load-balancers (Classic, Gateway, Application or Network load-balancers), VPC Endpoints, DocumentDB, Amazon MQ, API Gateway and other solutions. During placement, we are making conscious decisions on where to place these resources across AZ’s. Placement is important as is the traffic flow between AZ’s – as this is charged. If using S3 One Zone – Immediate Access tier – you are placing objects in a single AZ – objects are not distributed across AZ’s like S3 Standard or S3 Glacier tiers.

Did you know?

The placement of AZ’s is unique per VPC – even in the same account. What this means is that ap-southeast-2a (Sydney region) in one VPC may not be the same as ap-southeast-2a in another VPC! Absolute references are provided so you can align resources into the same AZ across different mappings known as Availability Zone ID’s. These map across different VPC’s. 

If you have traffic crossing from one VPC to another, to avoid cross-AZ charges you can reference the AZ ID across each VPC. The following illustration shows how the AZ IDs are the same for every account even though the Availability Zone names can map differently for each account.

Managed Services

Not all AWS managed services are created equal in terms of their availability across AZ’s. Some services require deployment of managed services across multiple AZ’s to provide redundancy.

NAT Gateways

A NAT gateway is a Network Address Translation (NAT) service. You can use a NAT gateway so that instances in a private subnet can connect to services outside your VPC but external services cannot initiate a connection with those instances. Each NAT gateway is created in a specific AZ and implemented with redundancy in that zone. If you have resources in multiple AZ and they share one NAT gateway, and if the NAT gateway’s AZ is down, resources in the other AZs lose internet access. Below we can visually see this concept.

To provide resiliency for applications, a NAT Gateway must be deployed to each AZ with a default route in each private subnet pointed to a NAT Gateway inside a public subnet within its own AZ.

Elastic Load-Balancers

Elastic Load Balancing automatically distributes your incoming traffic across multiple targets, such as EC2 instances, containers, and IP addresses, in one or more Availability Zones. When deployed across multiple AZ’s, this increases the fault tolerance of your applications. Elastic Load Balancing detects unhealthy instances and routes traffic only to healthy instances. There are four types of load-balancers in AWS. These are Network Load-Balancers (TCP/UDP/SSL), Application Load-Balancers (HTTP/HTTPS), Classic Load-Balancers (TCP/HTTP/SSL) and Gateway Load-Balancers (IP).

Network Load-Balancer (NLB)

A Network Load Balancer functions at the fourth layer of the Open Systems Interconnection (OSI) model. It can handle millions of requests per second. After the load balancer receives a connection request, it selects a target from the target group for the default rule. It attempts to open a TCP connection to the selected target on the port specified in the listener configuration. 

When you enable an Availability Zone for the load balancer, Elastic Load Balancing creates a load balancer node in the Availability Zone. By default, each load balancer node distributes traffic across the registered targets in its Availability Zone only. If you enable cross-zone load balancing, each load balancer node distributes traffic across the registered targets in all enabled Availability Zones.

Application Load-Balancer (ALB)

An Application Load Balancer functions at the application layer, the seventh layer of the Open Systems Interconnection (OSI) model. After the load balancer receives a request, it evaluates the listener rules in priority order to determine which rule to apply, and then selects a target from the target group for the rule action. You can configure listener rules to route requests to different target groups based on the content of the application traffic. 

When you create an Application Load Balancer, you must enable the zones that contain your targets. To enable a zone, specify a subnet in the zone. Elastic Load Balancing creates a load balancer node in each zone that you specify. You must select at least two Availability Zone subnets. If you register targets in a zone but do not enable the zone, these registered targets do not receive traffic from the load balancer so if you need target groups in each AZ, you must specify all AZ’s when you provision your ALB.

Gateway Load-Balancer (GLB)

Gateway Load Balancers enable you to deploy, scale, and manage virtual appliances, such as firewalls, intrusion detection and prevention systems, and deep packet inspection systems. It combines a transparent network gateway (that is, a single entry and exit point for all traffic) and distributes traffic while scaling your virtual appliances with the demand.

A Gateway Load Balancer operates at the third layer of the Open Systems Interconnection (OSI) model, the network layer. It listens for all IP packets across all ports and forwards traffic to the target group that’s specified in the listener rule. It maintains stickiness of flows to a specific target appliance using 5-tuple (for TCP/UDP flows) or 3-tuple (for non-TCP/UDP flows). The Gateway Load Balancer and its registered virtual appliance instances exchange application traffic using the GENEVE protocol on port 6081.

Gateway Load Balancers use Gateway Load Balancer endpoints to securely exchange traffic across VPC boundaries. A Gateway Load Balancer endpoint is a VPC endpoint that provides private connectivity between virtual appliances in the service provider VPC and application servers in the service consumer VPC. You deploy the Gateway Load Balancer in the same VPC as the virtual appliances. You register the virtual appliances with a target group for the Gateway Load Balancer.

Traffic to and from a Gateway Load Balancer endpoint is configured using route tables. Traffic flows from the service consumer VPC over the Gateway Load Balancer endpoint to the Gateway Load Balancer in the service provider VPC, and then returns to the service consumer VPC. You must create the Gateway Load Balancer endpoint and the application servers in different subnets. This enables you to configure the Gateway Load Balancer endpoint as the next hop in the route table for the application subnet.

The concept is that each virtual appliance resides within a subnet within a VPC. If there is no redundancy of a virtual appliance across AZ’s, then you risk having an outage to the service during an AZ outage event. It is therefore important to consider placement of your virtual appliances and the GLB’s that are serving traffic to the appliance.

Lambda

Lambda runs your function in multiple Availability Zones to ensure that it is available to process events in case of a service interruption in a single zone. If you configure your function to connect to a virtual private cloud (VPC) in your account, specify subnets in multiple Availability Zones to ensure high availability. If you only specify a single AZ, then Lambda could fail during an AZ failure.

API Gateway

You can use Route 53 health checks to control DNS failover from an API Gateway API in a primary region to an API Gateway API in a secondary region. This can help mitigate impacts in the event of a Regional issue. If you use a custom domain, you can perform failover without requiring clients to change API endpoints.

When you choose Evaluate Target Health for an alias record, those records fail only when the API Gateway service is unavailable in the Region. In some cases, your own API Gateway APIs can experience interruption before that time. To control DNS failover directly, configure custom Route 53 health checks for your API Gateway APIs. To achieve this, you can use a CloudWatch alarm that helps operators control DNS failover.

Custom Failover Health Checks for Private VPC Resources

You may want to monitor your resources with private IP addresses or private domain names in VPCs. You could associate that health check with a record in a Route 53 hosted zone (public or private) to achieve a failover scenario when the primary record is unhealthy.

You can use an AWS CloudFormation template to perform TCP, HTTP, and HTTPS health checks for private resources in a VPC. The term private resource refers to any resource in a VPC not accessible over the internet.

EC2 Container Service (ECS)

The clusters are deployed across multiple AZ’s, however when creating services, these are deployed to one or more subnets. A task definition describes each container that is required to run as part of a service. You must select at least 2 subnets in different AZ’s to deploy services to in order to have multi AZ resiliency.

Elastic Kubernetes Service (EKS)

Amazon EKS runs and scales the Kubernetes control plane across multiple AWS Availability Zones to ensure high availability. Amazon EKS automatically scales control plane instances based on load, detects and replaces unhealthy control plane instances, and automatically patches the control plane. After you initiate a version update, Amazon EKS updates your control plane for you, maintaining high availability of the control plane during the update.

When you create a cluster, you specify a VPC and at least two subnets that are in different Availability Zones. When configuring pods in your workspaces for running your containers, you’ll need to consider your load-balancing strategy for running your application in your EKS cluster. 

In versions 2.5 and newer, the AWS Load Balancer Controller becomes the default controller for Kubernetes service resources with the type: LoadBalancer and makes an AWS Network Load Balancer (NLB) for each service. It does this by making a mutating webhook for services, which sets the spec.loadBalancerClass field to service.k8s.aws/nlb for new services of type: LoadBalancer. You can turn off this feature and revert to using the legacy Cloud Provider as the default controller, by setting the helm chart value enableServiceMutatorWebhook to false. The cluster won’t provision new Classic Load Balancers for your services unless you turn off this feature. Existing Classic Load Balancers will continue to work.

The AWS Load Balancer Controller manages AWS Elastic Load Balancers for a Kubernetes cluster. The controller provisions the following resources:

Kubernetes Ingress

The AWS Load Balancer Controller creates an AWS Application Load Balancer (ALB) when you create a Kubernetes Ingress.

Kubernetes service of the LoadBalancer type

The AWS Load Balancer Controller creates an AWS Network Load Balancer (NLB) when you create a Kubernetes service of type LoadBalancer. In the past, the Kubernetes network load balancer was used for instance targets, but the AWS Load balancer Controller was used for IP targets. With the AWS Load Balancer Controller version 2.3.0 or later, you can create NLBs using either target type. For more information about NLB target types, see Target type in the User Guide for Network Load Balancers.

Horizontal Pod Autoscaler

It’s easy to confuse this with the concepts of auto-scaling across AZ’s, however this isn’t the case unless nodes are spread across multiple AZ’s. The horizontal pod autoscaler is designed to scale your application across multiple nodes; however those nodes could be part of a node-group that is only in one AZ. It’s important you understand the node configuration so you know where your application can run across AZ’s if this is required.

The Kubernetes Horizontal Pod Autoscaler automatically scales the number of Pods in a deployment, replication controller, or replica set based on that resource’s CPU utilisation. This can help your applications scale out to meet increased demand or scale in when resources are not needed, thus freeing up your nodes for other applications. When you set a target CPU utilisation percentage, the Horizontal Pod Autoscaler scales your application in or out to try to meet that target.

The Horizontal Pod Autoscaler is a standard API resource in Kubernetes that simply requires that a metrics source (such as the Kubernetes metrics server) is installed on your Amazon EKS cluster to work. You do not need to deploy or install the Horizontal Pod Autoscaler on your cluster to begin scaling your applications. For more information, see Horizontal Pod Autoscaler in the Kubernetes documentation.

Relational Database Service (RDS)

Amazon RDS uses the MariaDB, MySQL, Oracle, and PostgreSQL DB engines’ built-in replication functionality to create a special type of DB instance called a read replica from a source DB instance. Updates made to the source DB instance are asynchronously copied to the read replica. You can reduce the load on your source DB instance by routing read queries from your applications to the read replica. Using read replicas, you can elastically scale out beyond the capacity constraints of a single DB instance for read-heavy database workloads. You can promote a read replica to a standalone instance as a disaster recovery solution if the source DB instance fails. For some DB engines, Amazon RDS also supports other replication options.

Read Replicas

Read replicas can be deployed across multiple AZ’s to allow for database reads if there is an outage to an AZ. As a read replica can be promoted, write capability can be rapidly provided to another AZ through promotion in the event of an AZ failure. Amazon RDS provides high availability and failover support for DB instances using Multi-AZ deployments. Amazon RDS uses several different technologies to provide failover support. Multi-AZ deployments for Oracle, PostgreSQL, MySQL, and MariaDB DB instances use Amazon’s failover technology. SQL Server DB instances use SQL Server Database Mirroring (DBM).

Backups

Amazon RDS creates and saves automated backups of your DB instance. Amazon RDS creates a storage volume snapshot of your DB instance, backing up the entire DB instance and not just individual databases.

Amazon RDS creates automated backups of your DB instance during the backup window of your DB instance. Amazon RDS saves the automated backups of your DB instance according to the backup retention period that you specify. If necessary, you can recover your database to any point in time during the backup retention period. You can also back up your DB instance manually, by manually creating a DB snapshot.

You can create a DB instance by restoring from this DB snapshot as a disaster recovery solution if the source DB instance fails. The backups themselves are stored in S3 to provide resilience against an AZ outage, allowing you to restore to another RDS instance in another AZ.

RDS Multi-AZ Deployment Options

RDS Multi-AZ uses standby instances in another AZ to allow for graceful takeover in the event of a loss of an AZ. The standby instances are not written to directly – data is synchronously written to the standby instance. There are two types of standby architectures using native Multi-AZ: single standby and multiple standby.

Single Standby Multi-AZ Deployment:

In an Amazon RDS Multi-AZ deployment, Amazon RDS automatically creates a primary database (DB) instance and synchronously replicates the data to an instance in a different AZ. When it detects a failure, Amazon RDS automatically fails over to a standby instance without manual intervention.

Multi Standby Multi-AZ Deployment:

Deploys highly available, durable MySQL or PostgreSQL databases in three AZs using Amazon RDS Multi-AZ with two readable standbys. Gain automatic failovers in typically under 35 seconds, up to 2x faster transaction commit latency compared to Amazon RDS Multi-AZ with one standby, additional read capacity, and a choice of AWS Graviton2– or Intel–based instances for compute.

Comparison of single and multiple standby deployments:

Feature

Single-AZ

Multi-AZ with one standby

Multi-AZ with two readable standbys

Available engines

  • Amazon RDS for MariaDB
  • Amazon RDS for MySQL
  • Amazon RDS for PostgreSQL
  • Amazon RDS for Oracle
  • Amazon RDS for SQL Server
  • Amazon RDS for MariaDB
  • Amazon RDS for MySQL
  • Amazon RDS for PostgreSQL
  • Amazon RDS for Oracle
  • Amazon RDS for SQL Server
  • Amazon RDS for PostgreSQL
  • Amazon RDS for MySQL

Additional Read
capacity

  • None: the read capacity is limited to your primary
  • None: Your standby DB instance is only a passive failover target for high availability
  • Two standby DB instances act as failover targets and serve read traffic
  • Read capacity is determined by the overhead of write transactions from the primary 

Lower latency (higher throughput) for transaction commits

 

 

  • Up to 2x faster transaction commits compared to Amazon RDS Multi-AZ with one standby

Automatic failover duration

  • Not available: a user, a user-initiated point-in-time-restore operation will be required.
  • This operation can take several hours to complete
  • Any data updates that occurred after the latest restorable time (typically within the last 5 minutes) will not be available
  • A new primary is available to serve your new workload in as quickly as 60 seconds
  • Failover time is independent of write throughput
  • A new primary is available to serve your new workload in typically under 35 seconds
  • Failover time depends on length of replica lag

Higher resiliency to AZ outage

  • None: in the event of an AZ failure, your risk data loss and hours of failover time
  • In the event of an AZ failure, your workload will automatically failover to the up-to-date standby
  • In the event of a failure, one of the two remaining standbys will takeover and serve the workload (writes) from the primary

Lower jitter for transaction commits

  • No optimization for jitter
  • Sensitive to impairments on the write path
  • Uses the 2-of-3 write quorum: insensitive to up to one impaired write path



DynamoDB

DynamoDB automatically spreads the data and traffic for your tables over a sufficient number of servers to handle your throughput and storage requirements, while maintaining consistent and fast performance. All of your data is stored on solid-state disks (SSDs) and is automatically replicated across multiple Availability Zones in an AWS Region, providing built-in high availability and data durability. In this way, DynamoDB is already highly resilient to an AZ failure. If you need to protect against a regional outage and require your DynamoDB data available across multiple regions, you can deploy Global Tables.

DynamoDB Accelerator (DAX)

Importantly, if you are deploying DynamoDB Accelerator (DAX) for read caching of read-heavy DynamoDB tables, the DAX cache instances are deployed into your VPC so they can only be placed in a single region as a result. A single DAX cache cannot be used with DynamoDB Global Tables.

A DAX cluster in an AWS Region can only interact with DynamoDB tables that are in the same Region. For this reason, ensure that you launch your DAX cluster in the correct Region. If you have DynamoDB tables in other Regions, you must launch DAX clusters in those Regions too. 

Don’t place all of your cluster’s nodes in a single Availability Zone. In this configuration, your DAX cluster becomes unavailable if there is an Availability Zone failure.

For production usage, we strongly recommend using DAX with at least three nodes, where each node is placed in different Availability Zones. Three nodes are required for a DAX cluster to be fault-tolerant.

A DAX cluster can be deployed with one or two nodes for development or test workloads. One- and two-node clusters are not fault-tolerant, and we don’t recommend using fewer than three nodes for production use. If a one- or two-node cluster encounters software or hardware errors, the cluster can become unavailable or lose cached data.

CloudFront

Amazon CloudFront is a Content Delivery Network (CDN) web service that speeds up distribution of your static and dynamic web content, such as .html, .css, .js, and image files, to your users. CloudFront delivers your content through a worldwide network of data centres called edge locations. When a user requests content that you’re serving with CloudFront, the request is routed to the edge location that provides the lowest latency (time delay), so that content is delivered with the best possible performance.

If the content is already in the edge location with the lowest latency, CloudFront delivers it immediately. If the content is not in that edge location, CloudFront retrieves it from an origin that you’ve defined—such as an Amazon S3 bucket, a MediaPackage channel, or an HTTP server (for example, a web server) that you have identified as the source for the definitive version of your content.

CloudFront speeds up the distribution of your content by routing each user request through the AWS backbone network to the edge location that can best serve your content. Typically, this is a CloudFront edge server that provides the fastest delivery to the viewer. Using the AWS network dramatically reduces the number of networks that your users’ requests must pass through, which improves performance. Users get lower latency—the time it takes to load the first byte of the file—and higher data transfer rates.

You also get increased reliability and availability because copies of your files (also known as objects) are now held (or cached) in multiple edge locations around the world.

The correlation between edge and regional edge locations can be shown below:

CloudFront has edge locations all over the world. Our cost for each edge location varies and, as a result, the price that AWS charges varies depending on which edge location serves the requests. You can choose which regions you can deploy a CloudFront distribution to based on both price and who your target audiences are in a given region. 

One of the purposes of using CloudFront is to reduce the number of requests that your origin server must respond to directly. With CloudFront caching, more objects are served from CloudFront edge locations, which are closer to your users. This reduces the load on your origin server and reduces latency.

The more requests that CloudFront can serve from edge caches, the fewer viewer requests that CloudFront must forward to your origin to get the latest version or a unique version of an object. To optimise CloudFront to make as few requests to your origin as possible, consider using a CloudFront Origin Shield. 

The proportion of requests that are served directly from the CloudFront cache compared to all requests is called the cache hit ratio. You can view the percentage of viewer requests that are hits, misses, and errors in the CloudFront console.

CloudFront Origin Shield is an additional layer in the CloudFront caching infrastructure that helps to minimise your origin’s load, improve its availability, and reduce its operating costs. With CloudFront Origin Shield, you get the following benefits:

Better cache hit ratio

Origin Shield can help improve the cache hit ratio of your CloudFront distribution because it provides an additional layer of caching in front of your origin. When you use Origin Shield, all requests from all of CloudFront’s caching layers to your origin go through Origin Shield, increasing the likelihood of a cache hit. CloudFront can retrieve each object with a single origin request from Origin Shield to your origin, and all other layers of the CloudFront cache (edge locations and regional edge caches) can retrieve the object from Origin Shield.

Reduced origin load

Origin Shield can further reduce the number of simultaneous requests that are sent to your origin for the same object. Requests for content that is not in Origin Shield’s cache are consolidated with other requests for the same object, resulting in as few as one request going to your origin. Handling fewer requests at your origin can preserve your origin’s availability during peak loads or unexpected traffic spikes, and can reduce costs for things like just-in-time packaging, image transformations, and data transfer out (DTO).

Better network performance

When you enable Origin Shield in the AWS Region that has the lowest latency to your origin, you can get better network performance. For origins in an AWS Region, CloudFront network traffic remains on the high throughput CloudFront network all the way to your origin. For origins outside of AWS, CloudFront network traffic remains on the CloudFront network all the way to Origin Shield, which has a low latency connection to your origin.

You incur additional charges for using Origin Shield however the additional costs might be outweighed by an increased end-user experience as well as offsetting some of this cost with the need to run fewer and/or smaller instances of the origin data sources.

Conclusion

The cloud is not immune to outage. While it is easy for those uninitiated in the ways of the cloud and without a fundamental understanding of how it can fail, basing a decision about infrastructure redundancy purely on the basis of financial costs alone is seldom a good idea. To make an informed decision, one needs to understand the true cost of not having resiliency against a failure. 

We can’t predict when a failure will occur however we can predict the impact an outage can have on business continuity and impacts to customers (be it internal or external). Be open to the concept that everything fails all the time and understand the scenarios you wish to protect against. Build your critical infrastructure around what your business needs and accept that there is a cost to recover from an outage. The risk is up to you.

In my next post, I’ll discuss how to quantify your risks and understand your goals to make smart decisions around infrastructure and architecture.

Enjoyed this blog?

Share it with your network!

Move faster with confidence