Zero downtime RDS Migration

BLOG ARTICLE

downtime

I recently learned some amazing AWS tech that solved one of our customer’s biggest problems very quickly, cost-efficiently and space-efficiently. I wanted to make a small contribution back to the community by sharing this information with you all lovely people. I hope it might help someone, somewhere in the universe.

 

Background

During reviewing infrastructure for one of our customer’s AWS stack, We noticed that one of the RDS databases had public accessibility. And our first reaction was OMG. As per our duty as consultants, we raised the alarm. Our customer understood the gravity of the situation, given that having their RDS accessible via the internet is a security threat and anyone could misuse the data. Thankfully their RDS was still password protected and no one outside the organisation could access it.

To fix the issue urgently, we quickly tried making the RDS private, however, it broke the application connectivity to RDS. On further analysis, we realised that the ECS cluster and RDS were deployed in different VPC’s, therefore ECS resources were talking to RDS over the internet. And hence, when we made RDS private, ECS resources could not reach RDS and broke the connectivity. 

As the below diagram shows, ECS cluster and services were deployed in ECS cluster VPC and RDS was deployed in RDS VPC.

Amazon RDS


So now this issue wouldn’t simply be rectified by switching the RDS public to private . We now needed to migrate the RDS to ECS VPC to collocate the resources in the same VPC so that ECS can connect to RDS over internal endpoints rather than going through the internet. 

Amazon RDS

The question comes to how to move the RDS to a new VPC. You can do it in three ways:

  • Create a clone in a different VPC.
  • Take a snapshot and then restore the snapshot in a different VPC.
  • Set up replication using binary logging (MySQL only).

Let’s talk about these options in a bit more detail.


Create a clone in a different VPC

Here comes the AWS magic. Fortunately, the customer was using AWS Aurora DB cluster MySQL 5.6. Thankfully, AWS provides pretty cool tech to clone the Aurora DB cluster.

 

How Aurora Cloning works

Aurora cloning works at the storage layer of an Aurora DB cluster. It uses a copy-on-write protocol that’s both fast and space-efficient in terms of the underlying durable media supporting the Aurora storage volume.

 

Understanding the copy-on-write protocol

An Aurora DB cluster stores data in pages in the underlying Aurora storage volume.

For example, in the following diagram you can find an Aurora DB cluster (A) that has four data pages, 1, 2, 3, and 4. Imagine that a clone, B, is created from the Aurora DB cluster. When the clone is created, no data is copied. Rather, the clone points to the same set of pages as the source Aurora DB cluster.

Amazon RDS

When the clone is created, no additional storage is usually needed. The copy-on-write protocol uses the same segment on the physical storage media as the source segment. Additional storage is required only if the capacity of the source segment isn’t sufficient for the entire clone segment. If that’s the case, the source segment is copied to another physical device.

In the following diagrams, you can find an example of the copy-on-write protocol in action using the same cluster A and its clone, B, as shown preceding. Let’s say that you make a change to your Aurora DB cluster (A) that results in a change to data held on page 1. Instead of writing to the original page 1, Aurora creates a new page 1[A]. The Aurora DB cluster volume for cluster (A) now points to page 1[A], 2, 3, and 4, while the clone (B) still references the original pages.

Amazon RDS

On the clone, a change is made to page 4 on the storage volume. Instead of writing to the original page 4, Aurora creates a new page, 4[B]. The clone now points to pages 1, 2, 3, and to page 4[B], while the cluster (A) continues pointing to 1[A], 2, 3, and 4.

Amazon RDS

As more changes occur over time in both the source Aurora DB cluster volume and the clone, more storage is needed to capture and store the changes.

Using the above method we were able to migrate the RDS to a new VPC with just a few clicks. 

Now we know cloning is super cool tech which helped us to move RDS with just a few clicks. However there are still some limitations with cloning.


 

Limitations of Aurora cloning

Aurora cloning currently has the following limitations:

  • You can’t create a clone in a different AWS Region than the source Aurora DB cluster.
  • You can’t create an Aurora Serverless v1 clone from a non encrypted provisioned Aurora DB cluster.
  • You can’t create a Aurora Serverless v1 clone from a MySQL 5.6-compatible provisioned cluster, or a provisioned clone of a MySQL 5.6-compatible Aurora Serverless v1 cluster.
  • You can’t create more than 15 clones based on a copy or based on another clone. After creating 15 clones, you can create copies only. However, you can create up to 15 clones of each copy.
  • You can’t create a clone from an Aurora DB cluster without the parallel query feature to a cluster that uses parallel query. To bring data into a cluster that uses parallel query, create a snapshot of the original cluster and restore it to the cluster that’s using the parallel query feature.
  • You can’t create a clone from an Aurora DB cluster that has no DB instances. You can only clone Aurora DB clusters that have at least one DB instance.
  • You can create a clone in a different virtual private cloud (VPC) than that of the Aurora DB cluster. If you do, the subnets of the VPCs must map to the same Availability Zones.


Restoring from a DB cluster snapshot

If you have any above limitations you can move RDS to the new VPC by moving RDS snapshot manually.

Amazon RDS creates a storage volume snapshot of your DB cluster, backing up the entire DB instance and not just individual databases. You can create a DB cluster by restoring from this DB cluster snapshot. When you restore the DB cluster, you provide the name of the DB cluster snapshot to restore from and then provide a name for the new DB cluster that is created from the restore. You can’t restore from a DB cluster snapshot to an existing DB cluster; a new DB cluster is created when you restore.

 

Set up replication using binary logging using MySQL

  1. Create a new Aurora cluster in the target VPC.
  2. Set up manual MySQL replication between the two Aurora clusters.
  3. Promote the replica to be a standalone cluster.


     

THE CLONING METHOD

So after evaluating the above options we decided to go with the cloning option since the distributed storage engine for Aurora allows us to do things that are normally not feasible or cost-effective with a traditional database engine. By creating pointers to individual pages of data the storage engine enables fast database cloning. Then, when you make changes to the data in the source or the clone, a copy-on-write protocol creates a new copy of that page and updates the pointers. This means my 2TB snapshot restore job that used to take an hour is now ready in about 5 minutes – and most of that time is spent provisioning a new RDS instance.

The time it takes to create the clone is independent of the size of the database since we’re pointing at the same storage. It also makes cloning a very cost-effective operation since users only have to pay storage costs for the changed pages instead of an entire copy. The database clone is still a regular Aurora Database Cluster with all the same durability guarantees.

 

Conclusion

AWS RDS cloning is a cost effective, quick and space efficient way of moving Aurora RDS from one VPC to another with just a few clicks. With the other option such as snapshotting the RDS or using DMS involves manual steps of setting up the new cluster. However since, cloning uses the same storage in the backend and it just repoints the new cluster to the same storage it reduces downtime, cost and avoids manual steps in moving the RDS.