Disaster recovery is one aspect of business continuity and is a method of regaining access and functionality to its IT infrastructure after events like a natural disaster, cyber attack, or even business disruptions. A variety of disaster recovery (DR) methods form a disaster recovery plan.
Disaster recovery relies upon the replication of data in a different region not affected by the disaster. When the system is down because of a natural disaster, equipment failure, cyber attack or account compromise, a business needs to recover lost data from the location where the data is backed up. Ideally, an organization can completely migrate its workloads to that remote location as well in order to continue operations.
Two key features of a disaster recovery are:
Recovery Time Objective (RTO) is defined by the organization as the maximum acceptable delay between the interruption of service and restoration of service. This determines what is considered an acceptable time window when service is unavailable.
Recovery Point Objective (RPO) is defined by the organization as the maximum acceptable amount of time since the last data recovery point. This determines what is considered an acceptable loss of data between the last recovery point and the interruption of service.
In this use case, we will be dealing with AWS Pilot Light DR strategy which is a mechanism to replicate data from one region to another and provision a copy of the workload infrastructure. Resources required to support data replication and backup such as databases and object storage are always ON. Other elements such as application servers are loaded with application code and configurations, but are switched off and are only used during testing or when Disaster Recovery failover is invoked.
Solution Overview
The scope of this solution is to obtain an automated environment to stand-up the required workloads and bring back the business continuity within the accepted RPO and RTO in the disaster recovery plan under the disaster scenarios of Account compromise and Data Change/Corruption. Let’s assume our solution caters for Recovery Point Objective (RPO) of 1 Hour and Recovery Time Objective (RTO) of 3 Hours.
Automation in place to have RDS instances in the source AWS account (production account) set to share it’s latest snapshots (manual) to the destination AWS account (DR account) so that the RDS instances in the destination account always in sync with the production RDS instance until the time of DR fail-over. To achieve the RPO of 1 hour, the latest snapshot is set to be taken every 1 hour on the production RDS instance.
PREREQUISITES FOR RDS SNAPSHOTS AUTOMATION
1. AWS KMS Customer Keys should exist on both the source and destination account. Appropriate permissions are assigned to both the keys. Make sure that the destination account has access to the KMS key in the source account. This is required as the copy snapshot operation needs the source KMS key for it to copy the shared snapshot (which was encrypted using the same source KMS key) locally.
2. RDS instance encrypted with AWS Customer KMS Key is required for cross account sharing of the RDS snapshots to work. The same AWS Customer KMS key should be used to encrypt the snapshots which will be shared to the destination account. Cross Account sharing of RDS Snapshots is not supported when an RDS instance is encrypted with AWS managed KMS Keys.
Note: This solution doesn’t provide a CloudFormation template for deploying RDS instances. It is recommended to include a template within this solution to deploy an encrypted RDS instance with the same AWS CMK key that is created as part of this solution. When AWS CMK is provided externally to this solution, the CFN template to deploy encrypted RDS instances (using the externally provided AWS CMK) is not required.
3. Manual RDS Snapshots Limit should not be hit on the account level. By default, 100 manual RDS snapshots are allowed per RDS per AWS account. This can be increased by raising AWS support Limit Increase Request and the maximum limit can be increased is 700. This constraint has to be dealt with in both AWS accounts.
Tools involved to support the solution
The solution is completely defined as infrastructure-as-code (IaC) leveraging AWS CloudFormation and Cloudreach Sceptre.
Sceptre is a python based tool with which CloudFormation API is invoked. Sceptre is used to pass parameters to the respective CloudFormation templates (mapped inside the sceptre configs) and chain stacks as dependencies. Parameters are stored in the configs directory with CloudFormation templates stored in the templates directory. Both config and templates directories should be located within the sceptre directory.
AWS CloudFormation templates have been developed to support multi-environments which are achieved through input parameters and outputs. Sceptre will retrieve outputs from the stacks which have dependencies on each other. Sceptre can be considered throw-away as AWS Cloudformation provides the primary interaction with the AWS Platform. Information on getting started with sceptre can be found here. Few of the AWS CloudFormation templates are generated based on the Jinja templates to support few of the functionalities of sceptre. More info on Jinja templates integration in sceptre here.
Solution Architecture
Solution Workflow
In the Source Account
1. A CloudWatch Event is scheduled to trigger Step Function State Machine every hour to Take Snapshots. That state machine invokes a lambda function to take the snapshot of the source RDS instance and applies some standard tags. It matches RDS instances using a regular expression on their names (which helps to target multiple RDS instances if there are more than one with the matching naming pattern). The TakeSnapshot step definition looks as below.
2. There are two other state machines and lambda functions. The state machine for Share Snapshots looks for new snapshots created by the TakeSnapshots lambda function. When it finds them, it shares them with the destination account. This state machine is set to be triggered every 90 minutes by another CloudWatch event rule. If it finds a new snapshot that is intended to be shared, it shares the snapshot. The ShareSnapshot step definition looks as below.
3. The other state machine is to delete old snapshots and it calls it’s specific lambda function to Delete Snapshots according to the RetentionDays parameter when the stack is launched. This state machine is set to run at a particular time in the night every day. If it finds a snapshot that is older than the retention time, it deletes the snapshot. The RetentionDays parameter is currently set to 1 day to avoid hitting AWS Manual Snapshots limit and running cost of RDS snapshots. The DeleteSnapshot step definition looks as below.
4. Monitoring is set on the Step Functions to monitor the states of the state machines running. Any failed states will be notified via SNS notifications. The above three Step Functions have dedicated CloudFormation templates which are provisioned in the source account targeting the source RDS instances.
In the Destination Account
1. There are two state machines and corresponding lambda functions. The state machine for Copy Snapshots looks for new snapshots that have been shared but have not yet been copied. When it finds them, it creates a copy in the destination account, encrypted with the KMS key that has been stipulated. This state machine is set to be triggered every 100 minutes by another CloudWatch event rule. The RetentionDays parameter is currently set to 2 days as the lambda function that is executed for copying snapshots will copy only those snapshots that are newer than the retention days. If the retention is 1 day or less than 1 day old, the latest snapshots will be missed copying. Copy Snapshot step definition on the destination account looks as below.
2. The other state machine is just like the corresponding state machine and function in the source account. The state machine for Delete Snapshots and it calls it’s corresponding lambda function to delete snapshots according to the RetentionDays parameter when the stack is launched. This state machine is set to run at a particular time in the night every day. If it finds a snapshot that is older than the retention time, it deletes the snapshot. The RetentionDays parameter is currently set to 1 day to avoid hitting AWS Manual Snapshots limit and running cost of RDS snapshots. Delete Snapshot step definition on the destination account looks as below.
3. Monitoring is set on the Step Functions to monitor the states of the state machines running. Any failed states will be notified via SNS notifications. Above two automation has it’s dedicated CloudFormation templates which are provisioned in the destination account targeting the destination RDS instances.
Solution Walkthrough
The main moving part in the solution which does the heavy lifting for us is the lambda functions. sceptre/lambda_code/ directory has all the lambda functions required for this solution which will be deployed in both the source and destination account. The lambda folders within this directory are named with respect to the functionality which is going to be performed by that lambda function. We will go through each lambda function under this section in detail.
There are implementations that have been added in the code base to support having the solution deployed for automation running end to end within the same account and where cross account environment is not involved. The following directories can be enhanced to have a single AWS account and multi region solution and not cross accounts. sceptre/lambda_code/delete_old_snapshots_rds_no_x_account/ and sceptre/lambda_code/copy_snapshots_rds_no_x_account/.
sceptre/lambda_code/take_snapshot_rds/
This lambda function takes a snapshot of RDS instances (in the source AWS account) according to the environment variables PATTERN and INTERVAL. Set PATTERN to a regex that matches your RDS Instance identifiers. Set INTERVAL to the amount of hours between backups. This function will list available manual snapshots and only trigger a new one if the latest is older than INTERVAL hours. Set FILTERINSTANCE to True to only take snapshots for RDS Instances with tag CopyDBSnapshot set to True.
This lambda calls the get_latest_snapshot_ts function imported from lambda_code/take_snapshot_rds/snapshots_tool_utils.py which gets the latest snapshot for a specific DBInstanceIdentifier to determine the age of the snapshot whether it is new or old. If no latest snapshot is found, a new snapshot is triggered by calling the create_db_snapshot API of the RDS client.
def get_latest_snapshot_ts(instance_identifier, filtered_snapshots):
timestamps = []
for snapshot,snapshot_object in filtered_snapshots.items():
if snapshot_object[‘DBInstanceIdentifier’] == instance_identifier:
timestamp = get_timestamp_no_minute(snapshot, filtered_snapshots)
if timestamp is not None:
timestamps.append(timestamp)
if len(timestamps) > 0:
return max(timestamps)
else:
return None
def lambda_handler(event, context):
client = boto3.client(‘rds’, region_name=REGION)
response = client.create_db_snapshot(
DBSnapshotIdentifier=snapshot_identifier,
DBInstanceIdentifier=db_instance[‘DBInstanceIdentifier’],
Tags=[{‘Key’: ‘CreatedBy’, ‘Value’: ‘Snapshot Tool for RDS’},
{‘Key’: ‘CreatedOn’, ‘Value’: timestamp_format},
{‘Key’: ‘shareAndCopy’, ‘Value’: ‘YES’},{‘Key’: ‘Platform’, ‘Value’: PLATFORM}])
sceptre/lambda_code/share_snapshot_rds/
This Lambda function shares snapshots (in the source AWS account) created by take_snapshots_rds with the destination AWS account number set in the environment variable DEST_ACCOUNT. It will only share snapshots tagged with shareAndCopy and a value of YES by calling the search_tag_shared function imported from lambda_code/share_snapshot_rds/snapshots_tool_utils.py which takes a describe_db_snapshots response, searches for our CreatedBy tag and finally the lambda function issues a modify_db_snapshot_attribute API of the RDS client to share the snapshot. The snapshots only in the Available state will be shared and any snapshots in InProgress state will be skipped until it is made available.
def search_tag_shared(response):
for tag in response[‘TagList’]:
if tag[‘Key’] == ‘shareAndCopy’ and tag[‘Value’] == ‘YES’:
for tag2 in response[‘TagList’]:
if tag2[‘Key’] == ‘CreatedBy’ and tag2[‘Value’] == ‘Snapshot Tool for RDS’:
return True
def lambda_handler(event, context):
client = boto3.client(‘rds’, region_name=REGION)
response_modify = client.modify_db_snapshot_attribute(
DBSnapshotIdentifier=snapshot_identifier,
AttributeName=‘restore’,
ValuesToAdd=[DEST_ACCOUNTID])
sceptre/lambda_code/delete_old_snapshots_source_rds/
This Lambda function will delete snapshots (in the source AWS account) that have expired and match the regex set in the PATTERN environment variable. It will also look for a matching timestamp in the following format: YYYY-MM-DD-HH-mm. Set PATTERN to a regex that matches the RDS Instance identifiers.
The lambda function filters the list of snapshots and calculates the age of each snapshots by subtracting the creation date with today’s date and if the difference is greater than the value set in the RETENTION_DAYS parameter, the snapshot will be deleted by issuing a delete_db_snapshot API of the RDS client. The creation date of a snapshot is obtained by calling the get_timestamp function imported from lambda_code/delete_old_snapshots_source_rds/snapshots_tool_utils.py which searches for a timestamp on a snapshot name.
def get_timestamp(snapshot_identifier, snapshot_list):
pattern = ‘%s-(.+)’ % snapshot_list[snapshot_identifier][‘DBInstanceIdentifier’]
date_time = re.search(pattern, snapshot_identifier)
if date_time is not None:
return datetime.strptime(date_time.group(1), _TIMESTAMP_FORMAT)
def lambda_handler(event, context):
client = boto3.client(‘rds’, region_name=REGION)
filtered_list = get_own_snapshots_source(PATTERN, response)
for snapshot in filtered_list.keys():
creation_date = get_timestamp(snapshot, filtered_list)
if creation_date:
difference = datetime.now() – creation_date
days_difference = difference.total_seconds() / 3600 / 24
logger.debug(‘%s created %s days ago’ %
(snapshot, days_difference))
# if we are past RETENTION_DAYS
if days_difference > RETENTION_DAYS:
# delete it
logger.info(‘Deleting %s‘ % snapshot)
client.delete_db_snapshot(DBSnapshotIdentifier=snapshot)
sceptre/lambda_code/copy_snapshots_dest_rds/
This lambda function will copy shared RDS snapshots that match the regex specified in the environment variable SNAPSHOT_PATTERN, into the account and region where it runs (in the destination AWS account). If the snapshot is shared and exists in the local region, it will copy it to the region specified in the environment variable DEST_REGION.
The lambda function calls the functions copy_local and copy_remote from lambda_code/copy_snapshots_dest_rds/snapshots_tool_utils.py to copy the appropriate snapshots to it’s destined region. If it finds that the snapshots are shared, exist in the local and destination regions, it will delete them from the local region. Copying snapshots cross-account and cross-region need to be separate operations. This function will need to run as many times necessary for the workflow to complete. Set SNAPSHOT_PATTERN to a regex that matches your RDS Instance identifiers. Set DEST_REGION to the destination AWS region.
def copy_local(snapshot_identifier, snapshot_object):
client = boto3.client(‘rds’, region_name=_REGION)
tags = [{
‘Key’: ‘CopiedBy’,
‘Value’: ‘Snapshot Tool for RDS’
},
{
‘Key’: ‘Platform’,
‘Value’: PLATFORM
}]
if snapshot_object[‘Encrypted’]:
logger.info(‘Copying encrypted snapshot %s locally’ % snapshot_identifier)
response = client.copy_db_snapshot(
SourceDBSnapshotIdentifier = snapshot_object[‘Arn’],
TargetDBSnapshotIdentifier = snapshot_identifier,
KmsKeyId = _KMS_KEY_SOURCE_REGION,
Tags = tags)
else:
logger.info(‘Copying snapshot %s locally’ %snapshot_identifier)
response = client.copy_db_snapshot(
SourceDBSnapshotIdentifier = snapshot_object[‘Arn’],
TargetDBSnapshotIdentifier = snapshot_identifier,
Tags = tags)
return response
def copy_remote(snapshot_identifier, snapshot_object):
client = boto3.client(‘rds’, region_name=_DESTINATION_REGION)
if snapshot_object[‘Encrypted’]:
logger.info(‘Copying encrypted snapshot %s to remote region %s‘ % (snapshot_object[‘Arn’], _DESTINATION_REGION))
response = client.copy_db_snapshot(
SourceDBSnapshotIdentifier = snapshot_object[‘Arn’],
TargetDBSnapshotIdentifier = snapshot_identifier,
KmsKeyId = _KMS_KEY_DEST_REGION,
SourceRegion = _REGION,
CopyTags = True)
else:
logger.info(‘Copying snapshot %s to remote region %s‘ % (snapshot_object[‘Arn’], _DESTINATION_REGION))
response = client.copy_db_snapshot(
SourceDBSnapshotIdentifier = snapshot_object[‘Arn’],
TargetDBSnapshotIdentifier = snapshot_identifier,
SourceRegion = _REGION,
CopyTags = True) return response
sceptre/lambda_code/delete_old_snapshots_dest_rds/
This lambda function will delete manual RDS snapshots that have expired in the region specified in the environment variable DEST_REGION (in the destination AWS account), and according to the environment variables SNAPSHOT_PATTERN and RETENTION_DAYS. Set SNAPSHOT_PATTERN to a regex that matches your RDS Instance identifiers. Set DEST_REGION to the destination AWS region. Set RETENTION_DAYS to the amount of days snapshots need to be kept before deleting.
The lambda function filters the list of snapshots and calculates the age of each snapshots by subtracting the creation date with today’s date and if the difference is greater than the value set in the RETENTION_DAYS parameter, the snapshot will be deleted by issuing a delete_db_snapshot API of the RDS client. The creation date of a snapshot is obtained by calling the get_timestamp function and filters the snapshot from the list by calling the function search_tag_copied which searches for a tag indicating that we copied this snapshot. These functions are imported from lambda_code/delete_old_snapshots_dest_rds/snapshots_tool_utils.py which searches for a timestamp on a snapshot name.
def search_tag_copied(response):
for tag in response[‘TagList’]:
if tag[‘Key’] == ‘CopiedBy’ and tag[‘Value’] == ‘Snapshot Tool for RDS’:
return True
def lambda_handler(event, context):
client = boto3.client(‘rds’, region_name=DEST_REGION)
filtered_list = get_own_snapshots_dest(PATTERN, response)
for snapshot in filtered_list.keys():
creation_date = get_timestamp(snapshot, filtered_list)
if creation_date:
snapshot_arn = filtered_list[snapshot][‘Arn’]
response_tags = client.list_tags_for_resource(
ResourceName=snapshot_arn)
if search_tag_copied(response_tags):
difference = datetime.now() – creation_date
days_difference = difference.total_seconds() / 3600 / 24
# if we are past RETENTION_DAYS
if days_difference > RETENTION_DAYS:
# delete it
logger.info(‘Deleting %s. %s days old’ %
(snapshot, days_difference))
client.delete_db_snapshot(DBSnapshotIdentifier=snapshot)
Solution Implementation
The code base is in this public GitHub repository. Clone the repository as below
$ git clone https://github.com/sathakatheef/rds-snapshot-automation.git
The sceptre/ directory is the sceptre project which contains the sceptre config/ directory to pass parameters to the corresponding Cloudformation templates under templates/ directory.
Sceptre Custom Resolvers (sceptre/custom_resolvers/sceptre-ssm-resolver/) are integrated for retrieving SSM parameters values if any using !ssm prefix in front of the name of the parameter name as below for example. More info on the sceptre resolvers here.
parameters:
InstanceNamePattern: !ssm /aws/dev/rds/master/username
In order to use the ssm customer resolver, it needs to be installed as below.
$ cd sceptre/custom_resolvers/sceptre-ssm-resolver && pip install . –user
Sceptre Custom Hooks (sceptre/custom_hooks/sceptre_s3_packager/) are integrated for zipping and uploading the lambda code package to the s3 bucket in order to deploy it to the lambda function. The S3 upload task will be executed on 2 different scenarios called hooks, before_create is when before the stack (in which the hook is called) gets created and before_update is when before the same stack is updated. !sceptre_s3_upload prefix needs to be added followed by the path to the lambda code to package. Example usage is as below. More info on the sceptre hooks in here.
hooks:
before_create:
– !sceptre_s3_upload lambda_code/take_snapshots_rds
before_update:
– !sceptre_s3_upload lambda_code/take_snapshots_rds
S3Key: !sceptre_s3_key lambda_code/take_snapshots_rds
In order to use the s3 upload and s3 key customer hooks, it needs to be installed as below.
$ cd sceptre/custom_hooks/sceptre_s3_packager && pip install . –user
Provisioning in the Source Account
Follow the below steps to provision the AWS infrastructure in the source account:
1. Configure the AWS profile locally with the profile name equal to profile parameter in the config.yaml file under sceptre/config/prod/. The profile parameter is to lock the AWS profile name in-order to avoid conflict as in provisioning in the wrong account.
2. Make sure all required parameters are passed in all the config files within sceptre/config/prod/ directory. (Make sure all the prerequisites mentioned above are applied before provisioning).
3. Traverse to the sceptre/ directory and issue the sceptre launch command as below to start provisioning the infrastructure for the solution. With sceptre, it isn’t required to be inside/target the config/ directory in the command, as sceptre is intelligent enough to target the config directory by itself. The sceptre launch command has the ability to provision multiple stacks and can update if any changes to the existing stack.
$ sceptre launch prod –yes
4. Targeting the prod directory (without the trailing slash) will provision all the required resources with the sceptre launch command and the sceptre can handle the dependencies by itself when provisioning in terms of which resource needs to be provisioned first.
5. (Optional) In order to provision individual resources, then it is best to target individual config files as below to control the provisioning.
$ sceptre launch prod/rds-snapshot-automation/take-snapshot.yaml –yes
(or)
$ sceptre create prod/rds-snapshot-automation/take-snapshot.yaml –yes
6. Note: Other AWS resources that will be provisioned to support the environment are:
S3 Bucket: To package lambda code and deploy from it.
KMS Key: Source KMS key to encrypt the snapshot.
SNS Topic: To alert the failed status of step functions.
If the details of resources can be provided externally, the resource provisioning (of the above resources) can be avoided by just targeting the rds-snapshot-automation/ directory under sceptre/config/prod/ directory as below.
$ sceptre launch prod/rds-snapshot-automation –yes
7. We now have the below infrastructure provisioned in the source account.
Provisioning in the Destination Account
Follow the below steps to provision the AWS infrastructure in the destination account:
1. Configure the AWS profile locally with the profile name equal to profile parameter in the config.yaml file under sceptre/config/drprod/. The profile parameter is to lock the AWS profile name in-order to avoid conflict as in provisioning in the wrong account.
2. Make sure all required parameters are passed in all the config files within sceptre/config/drprod/ directory. (Make sure all the prerequisites mentioned above are applied before provisioning).
3. Traverse to the sceptre/ directory and issue the sceptre launch command as below to start provisioning the infrastructure for the solution. With sceptre, it isn’t required to be inside/target the config/ directory in the command, as sceptre is intelligent enough to target the config directory by itself. The sceptre launch command has the ability to provision multiple stacks and can update if any changes to the existing stack.
$ sceptre launch drprod –yes
$ sceptre launch drprod/rds-snapshot-automation/copy-snapshot.yaml –yes
(or)
$ sceptre create drprod/rds-snapshot-automation/copy-snapshot.yaml –yes
6. Note: Other AWS resources that will be provisioned to support the environment are:
S3 Bucket: To package lambda code and deploy from it.
KMS Key: Source KMS key to encrypt the snapshot.
SNS Topic: To alert the failed status of step functions.
If the details of resources can be provided externally, the resource provisioning (of the above resources) can be avoided by just targeting the rds-snapshot-automation/ directory under sceptre/config/drprod/ directory as below.
$ sceptre launch drprod/rds-snapshot-automation –yes
7. We now have the below infrastructure provisioned in the destination account.
Clean Up
In the Source Account
1. Configure the AWS profile locally with the profile name equal to profile parameter in the config.yaml file under sceptre/config/prod/. The profile parameter is to lock the AWS profile name in-order to avoid conflict as in provisioning in the wrong account.
2. To destroy the environment cleanly, issue the delete command in sceptre for the stacks created as below. More info on the delete command here.
$ sceptre delete prod –yes
The above command will destroy all the CloudFormation stacks created as part of the prod directory, sceptre is smart enough to handle the dependency while deleting stacks as well. If wanted to control the destroy process, then issue the delete command for each stack. For example,
$ sceptre delete prod/rds-snapshot-automation/take-snapshot.yaml –yes
$ sceptre delete prod/rds-snapshot-automation/share-snapshot.yaml –yes
$ sceptre delete prod/rds-snapshot-automation/delete-snapshot.yaml –yes
$ sceptre delete prod/kms/cross-account-key.yaml –yes
$ sceptre delete prod/s3/sceptre-function-code-bucket.yaml –yes
$ sceptre delete prod/s3/snstopics.yaml –yes
Note: When destroying individual stacks, make sure to maintain the dependency.
In the Destination Account
$ sceptre delete drprod –yes
The above command will destroy all the CloudFormation stacks created as part of the drprod directory, sceptre is smart enough to handle the dependency while deleting stacks as well. If wanted to control the destroy process, then issue the delete command for each stack. For example,
$ sceptre delete drprod/rds-snapshot-automation/copy-snapshot.yaml –yes
$ sceptre delete drprod/rds-snapshot-automation/delete-snapshot.yaml –yes
$ sceptre delete drprod/kms/cross-account-key.yaml –yes
$ sceptre delete drprod/s3/sceptre-function-code-bucket.yaml –yes
$ sceptre delete drprod/s3/snstopics.yaml –yes
Note: When destroying individual stacks, make sure to maintain the dependency.
Conclusion
Currently the solution is configured to support 2 different AWS accounts. Further improvements can be made that can support many-to-one relationships – having multiple AWS accounts (source accounts like pilot, non-prod and prod) to share the snapshots to one single destination AWS account (DR account) to have the data managed in one central location and many-to-many relationships – Having multiple AWS accounts (source accounts like pilot, non-prod, prod etc) to share the snapshots to multiple destination AWS account (dr-prod, dr-non-prod, dr-pilot etc) to have the data managed in it’s dedicated DR account.