AWS Resilience Hub – A Primer

Have you heard of AWS Resilience Hub? It was a recent discovery of mine, even though the service has been available to the public since 2021. With the recent CrowdStrike incident causing outages still fresh on our minds, it is a timely reminder to revisit the resiliency of your applications and services. I invite you to read on if you would like to learn more about the Resilience Hub and how it could help you improve the resilience of your applications. 

Resilience Hub: Explained in the time it takes to make a cup of tea

AWS Resilience Hub offers a single place where you can track, assess, improve, and test the resiliency of your applications on AWS. Once applications are registered in Resilience Hub, you can perform the following activities: 

  • Define resilience targets – RPO (Recovery Point Objective) and RTO (Recovery Time Objective) 
  • Perform gap analysis to determine whether applications are meeting their resilience targets 
  • Receive recommendations for improvement based on the AWS Well Architected Framework 
  • Test changes by simulating a range of failures
     

This makes Resilience Hub a good pairing with Resilience Lifecycle Framework to improve the overall resilience to your applications 

What is the Resilience Lifecycle Framework?

The Resilience lifecycle framework is a five-stage process that helps organisations design, build, deploy and operate resilient applications on AWS. You can read more about the Resilience lifecycle framework here. The Resilience lifecycle framework provides complimentary guidance to the Well Architected Framework – it closely aligns with the guidance provided in the Reliability Pillar. 

A note about establishing RTO and RPO targets for workloads in Resilience Hub

Before we continue, let us recap on some terms we will encounter in this blog: 

  • Service Level Objectives (SLO): SLOs are specific, measurable targets that an organisation sets for its applications and services. They define the level of service that the organisation commits to delivering to its customers and/or users. Some common examples of high-level measurable targets are Availability, Latency and Throughput. For example, an SLO could be “99.99% availability for the CatPix website”. More specific targets like RTO and RPO can also be defined within an SLO. 
  • Recovery Time Objective (RTO): RTO is the maximum acceptable time for restoring a service or application after a disruption. It defines how quickly the application needs to be restored to normal operation. For example, an RTO of 1 hour means the application should be restored within 1 hour of a disruption. 
  • Recovery Point Objective (RPO): RPO is the maximum acceptable amount of data loss in a disruption. It defines how much data the organisation can afford to lose. For example, an RPO of 1 hour means the application should be able to recover data up to 1 hour before the disruption occurs. 

The high-level SLO metrics are closely tied to the RTO and RPO targets. The SLOs set the overall service level targets, while the RTO and RPO define the specific resiliency requirements for recovering from disruptions. 

How much acceptable delay and data loss between service interruption and service restoration depends on an organisation’s priorities and risk appetite. Organisations often conduct business impact analysis to understand the consequences of service interruptions, and how that will weigh on business priorities. Those findings will be used to formulate Service Level Objectives (SLO); the RPO and RTO targets will be defined as part of the SLO. 

It is recommended to refer to your application’s SLO to obtain or derive your RPO and RTO targets. 

Once you have a clear understanding of your application’s RPO and RTO targets, you can enter those values into Resilience Hub during the registration process and begin assessing and testing your application’s current resilience posture against its objectives. 

If you would like more guidance setting resilience objectives for your application, please refer to Stage 1 of the Resilience lifecycle framework for more details. 

What is a common workflow for using AWS Resilience Hub?

Here is a common workflow for using AWS Resilience Hub:

Step 1: Add an application into Resilience Hub

There are several options available to import an application: 

  • Reference the application’s CloudFormation stack 
  • Select application from AppRegistry 
  • Reference the application’s Terraform state files 
  • Reference application resources in an EKS cluster 

Like the Security Hub, you can add applications and resources from multiple accounts. 

Step 2: Set your application's resilience policy in terms of RTO and RPO targets, configure permissions, scheduled runs, and drift notifications

  • This is where you set RPO and RTO targets that is aligned with business priorities and SLOs 
  • Resiliency Hub helpfully suggests a few resiliency policies based on 5 resiliency tiers: 
    • Foundational IT core services 
    • Mission critical 
    • Critical 
    • Important 
    • Non-critical 

While these suggestions provide a reasonable set of RPO & RTO targets to work with, it is recommended to refer to your SLO documentation for your resiliency targets where available 

  • By default, Resilience Hub assumes a single region application. There is an option to set a for multi-region applications
  • You will be asked to configure an IAM role to grant permissions to discover and assess your applications’ resources. We recommend assigning your role the AWS managed `AWSResilienceHubAsssessmentExecutionPolicy` IAM policy should be sufficient to get started
  • Configure whether to schedule a resiliency assessment daily and receive notifications of assessment results via SNS. Things to note here:
    • There is not currently an option via the console to choose something else besides a daily recurring schedule.

       

    • If a daily recurring schedule is not what you are looking for, there are other options available:

Step 3: Perform a point-in-time resilience assessment of your application

  • Once your application has been successfully added to Resilience Hub, we can initiate an assessment. The assessment process could take a few minutes to complete
  • Upon completion of the assessment, you should see an assessment report summary like the one below, giving a broad overview of where policy breaches have been identified:
  • The report provides two types of recommendations: Resiliency recommendations, and Operational recommendations
  • Resiliency recommendations drill in on the application components that did not meet resilience goals and provide up to three recommendations for each . Each of the suggested recommendations provides guidance on how to optimise for:
    • Cost – what is the lowest cost change that still meets resiliency targets?
    • Minimal changes – what is the least effort change, that still meets resiliency targets?
    • Availability Zone RTO/RPO – what change provides the lowest estimated workload recovery times during an AZ disruption?
  • Operational recommendations provide suggestions to set up CloudWatch and AWS FIS experiments via CloudFormation templates. There will be many selections to choose from; take your time choose the ones that fit your use case. See the screenshots below to get an idea of the kinds of recommendations provided.

Step 4: Implement the policy recommendations

  • To address the policy breaches in the Resiliency recommendations, update the affected resources with the suggested changes detailed in the recommended options, then reassess the application to ensure that the policy breaches have been fixed.
  • To apply the suggestions from the Operational recommendations, follow the instructions to select the suggested alarms, SOPs and AWS FIS experiments you want to create and deploy a CloudFormation stack.
  • You may find yourself having to iterate a few times between steps 3 and 4 as you go back to step 3 to re-assess each of the implementation changes you have made to your application, until you have reached a state where all policy breaches and recommendations have been addressed. This is expected, and a desired part of the process.

Step 5: Test application resilience with Fault Injection Service (FIS)

  • After making improvements to your application’s resilience, you can ‘stress-test’ your application to validate that the application behaves as expected during certain failure scenarios. AWS FIS is a managed service that enables you to design and perform failure scenarios against your application. There is much to talk about FIS, which we will dive into in a separate blog coming soon!

Step 6: Track resilience posture of application

  • Monitor the ongoing resilience posture of your application using Resilience Hub and make further improvements as needed. Resilience Hub can also be integrated into your CI/CD pipeline to continuously validate the resilience of your application as changes are made.

SOPs were provided as part of the Assessment Report's Operational Recommendations. What are they?

A SOP (Standard Operating Procedures) is a set of commands that is used to perform an  in your AWS account (NOT to be confused with the same term, that commonly refers to a set of documentation that gives detailed instructions for carrying out routine operations); in this case, SOPs can be used to efficiently recover your application in the event of an outage or alarm. SOPs are based on SSM Automation documents. While Resilience Hub provides a wide selection of standard SOPs to go with your operational recommendations, they can also be customised to perform other actions.

SOPs are a useful tool by itself; using it to codify common systems operations will lead to higher operational efficiency and reduce human-related errors.

What are the costs associated with using AWS Resilience Hub?

There is a flat monthly cost of $15 for each application added to Resilience Hub, regardless of how often you generate assessment reports for your applications. Remove the application from Resilience Hub to stop incurring charges.

Of course, you will still need to pay for the underlying resources used by your applications.

Conclusion

We have briefly covered the key benefits of Resilience Hub, namely how it:

  • Provides a central location to manage and determine the resilience of your applications across multiple accounts
  • Runs an assessment against defined resiliency targets to uncover resilience weaknesses
  • Provides resiliency recommendations for resources that have breached resiliency targets. Provides at most 3 options to choose from, optimising based on cost, and AZ/Region RTO/RPO targets
  • Provides operational recommendations on SOPs and alarms; and deploys them using CloudFormation stacks, making the changes manageable via IaC
  • Conducts using FIS

We briefly touched on the managed FIS service, and ideas for integrating Resilience Hub CI/CD pipelines – these are topics that will be further discussed in subsequent blogs. Stay tuned!

Some parting thoughts on Resilience Hubs

Having had the opportunity to experiment with Resilience Hub over the past few weeks, I want to mention that one does not need to use all the features of Resilience Hub together; that it could be used incrementally, as the need arises, or as the team/organisation’s awareness of their resilience posture grows. For example:

  • One could initially use Resilience Hub as part of an extension of a Well Architected Review for an app, with a focus on the Resilience pillar, as an assessment tool. This could be an activity that is manually done periodically (quarterly, biannually, annually).
  • One could utilise Resilience Hub as tool to provide recommendations on improving the resilience posture of an app, as an activity that is manually done periodically (quarterly, biannually, annually).
  • One could also integrate Resilience Hub into their CI/CD, with resilience testing built in, to provide a more rigorous treatment of their application’s resiliency.

Cost is a key factor to consider. Unlike Security Hub, the pricing for Resilience Hub is not as fine grained – charging at a per app per month basis, as opposed to Security Hub’s finer grained per check per month basis. The costs could quickly stack up for organisations with many small to medium-sized applications and services. Things that we could explore to keep costs down are:

  • How we could ‘group’ similar applications together, and register them as one ‘application’ in Resilience Hub, to save on costs
  • For applications that do not need to be monitored monthly, we could investigate a process to efficiently ‘save’ the resilience assessment of an application, so that we could re-register the application to perform another assessment at a later point, without going through the whole setup process for the application again

Like Security Hub, the Resilience Hub can also be used to provide a ‘single dashboard’ to track resilience targets for all applications spread across multiple AWS accounts. That said, it can be just as effectively used at the ‘team’ level to help a team track the resilience targets of the applications they manage. It will be interesting to see how different organisations configure Resilience Hub for their use case.

 

The inclusion of some template resilience policies based on Resilience Hub’s Five Resiliency Tiers is a great starter too. But that is all it should be – a starting point to discuss your application’s resiliency. One might be tempted to be content with using the default values provided by Resilience Hub – but I think we would be missing the big picture, and miss the opportunity to ask ourselves curiosity-driven questions like: ‘How do the RTO/RPO targets align with our business objectives?’, ‘Where can I find my SLO?’, ‘Is it enough?’, ‘Is it too much?’, ‘Why is that so?’,  and many more. It is more work, but it will be more interesting, and you are guaranteed to learn more about how your applications support business operations.

These are all some of the ideas I look forward to exploring in more detail soon; feel free to reach out to us if you are interested!

More resources to help you get started with Resilience Hub

Enjoyed this blog?

Share it with your network!

Move faster with confidence