Reliability: Control Plane vs Data Plane

Following on from a previous post on reliability to meet business objectives, this blog focuses on control plane and data plane in the context of reliability. Previously, we discussed Availability Zones (AZ’s) and utilising services across multiple AZ’s for maximum availability against unforeseen or unexpected outages. Understanding how AWS can fail helps in building solutions that are reliable and directly or indirectly improve the end-user experience of the services you provide. Control planes and data planes together contribute to overall reliability. Here we will focus on the areas of compute, storage and networking.

What are control and data planes?

Both of these terms originated from networking concepts. The data plane is where packets are flowing, allowing for data to be dispersed over the network. The control plane refers to policies like routing tables and security rules that govern how data is transmitted over the network. With this in mind, modern use of the terms extends to other infrastructure such as the distinction between storing data vs the controls that govern it. Databases, for example have to be run on top of physical hosts (control plane), while the data itself must reside on storage device (data plane). Using EC2 instances as an example, the workload runs on physical hosts, and the network interfaces have security groups and perhaps ACLs attached (control plane) while data is stored on EBS, EFS, FSx or S3 – or data stored in database backends (data plane).

Control Plane

The control plane requires itself to be generally available. It’s how we interact with services such as API’s, CLI’s, SSH and RDP, for example. The AWS console itself is a form of control plane, as is CloudFormation and things like EKS where Kubernetes is the control plane. When we think about services such as web services and applications, it’s important to consider the control plane and the front end. It’s what allows customers and consumers to access the service or interact with it. To provide this reliably, we have to think about outages that may affect access and what events might cause those outages. This becomes your guide on what solution you put in place to mitigate the scenarios you want to protect them from. Some examples might be an AZ outage or allowing for a scale of resources to prevent an unexpected outage. Think of the control plane as the compute resources.

Data Plane

The data plane is equally as important as it focuses on the storage of the data. It must be reliable and be able to recover against corruption, deletion and scale to meet the demands placed upon it. Here we also must consider performance to allow the data plane to prevent being a bottleneck. To design and implement a reliable data plane requires consideration of the placement of the medium, availability of the medium, the performance in terms of IOPS, scaling for volume, the method as well and how often backups are taken. This extends not just to block, object and file storage but also to managed databases and how records are stored, maintained and accessed. All of this is supported by the network.

Network Design is Critical

Network design is crucial to ensure access is possible as data is moved, accessed, and, in the event of a failure, able to be restored and accessed if not available in the original location. A lot of emphasis is placed on inbound connectivity for web-based services. CloudFront, as an example, allows for a caching layer that extends across multiple regions – extending the data plane beyond the origin region. This is supported by the AWS backbone network infrastructure extending globally between regions. For inbound connectivity, it is important to recognise single points of failure or single AZ availability. This extends to Direct Connect (physically not redundant) services and NAT Gateways (highly available within an AZ, not across AZ’s).

For VPN connectivity, there are multiple options, each with its own considerations for availability. While most solutions provide for redundant connections, using third-party solutions are often built as EC2 instances and are therefore bound by limitations of single AZ availability (requiring the support of other infrastructure such as load-balancers for high availability).

VPN connectivity option

Description

AWS Site-to-Site VPN

You can create an IPsec VPN connection between your VPC and your remote network. On the AWS side of the Site-to-Site VPN connection, a virtual private gateway or transit gateway provides two VPN endpoints (tunnels) for automatic failover. You configure your customer gateway device on the remote side of the Site-to-Site VPN connection.

AWS Client VPN

AWS Client VPN is a managed client-based VPN service that enables you to securely access your AWS resources or your on-premises network. With AWS Client VPN, you configure an endpoint to which your users can connect to establish a secure TLS VPN session. This enables clients to access resources in AWS or on-premises from any location using an OpenVPN-based VPN client.

AWS VPN CloudHub

If you have more than one remote network (for example, multiple branch offices), you can create multiple AWS Site-to-Site VPN connections via your virtual private gateway to enable communication between these networks.

Third-party software VPN appliance

You can create a VPN connection to your remote network by using an Amazon EC2 instance in your VPC that’s running a third-party software VPN appliance. AWS does not provide or maintain third-party software VPN appliances; however, you can choose from a range of products provided by partners and open-source communities.

Cross-VPC communication via VPC peering connections is not bound by the limitations of traditional networking constructs. AWS uses the existing infrastructure of a VPC to create a VPC peering connection. A VPC peering connection is not a gateway or an AWS Site-to-Site VPN connection, and it does not rely on a separate piece of physical hardware. There is no single point of failure for communication or a bandwidth bottleneck.

Transit Gateways utilise the Virtual Private Gateways (VGW), which are available across all AZ’s by default when creating a transit gateway attachment to a VPC. It is important to note that transit gateways are regional, so if there is a need to failover to another region for a given solution, then multiple transit gateways will be required.

Control Plane Reliability

Most managed services have an option for Multi-AZ or can be supported using Elastic Load-Balancing via Network Load-Balancers (NLB) or Application Load-Balancers (ALB). An NLB is not burdened with deep packet inspection like an ALB is and instead relies on TCP/UDP for decisions and health checks. This makes them much better at scaling than ALB’s. The power of an ALB is being able to make deeper decisions and health checks to ensure traffic is routed only to those targets that are completely healthy services. Load-balancing is critical to frontend services like web services to allow for AZ failure or other AZ-dependent failures. The use of PrivateLink services is built using NLBs.

To prevent overwhelming NAT gateways for accessing AWS service APIs from within or via your VPC, the role of VPC Endpoints (Gateway and Interface) to access services directly via the AWS backbone can be valuable. For each outbound connection to public service APIs, this consumes a NAT TCP port per connection. The role of the NAT Gateway is to overload a single IP by tracking each session using a unique TCP port. Once all 65,535 (typically uses only ephemeral ports 1,024-65535) connections are consumed, there is a need to have another elastic IP address and NAT Gateway to handle the traffic. This can be aided for heavy AWS API workloads to offload this to VPC endpoints instead. Interface VPC Endpoints must be placed across multiple AZs to provide redundancy.

API Gateways help scale incoming traffic and allow appropriate steering of traffic across backend targets. Supported by backend load-balancers, API Gateways can help to enhance the security and reliability of services by scaling for peak load and filtering out unwanted traffic before it hits the backend. 

The role of Route53 in providing a reliable control plane is substantial. By checking the health of resources and redirecting traffic across services and environments, Route53 provides the mechanisms to automatically recover from failures providing the services are available for it to route traffic to. It is important to adjust and reflect any crucial Time-To-Live (TTL) and health check frequencies for critical records to ensure redirection to available resources aligns with any expected Recovery Time Objectives (RTO). 

Utilising services such as AWS Shield, GuardDuty, CloudFront, and AWS Web Application Firewall (WAF) can help ensure the reliability of web-facing applications and your AWS accounts from malicious activity. Protecting these services from common and uncommon exploits (via heuristics/anomaly detection) is a key factor in increasing the reliability of those services. This has a flow-on on effect to your customers as they reliably enjoy consuming your services. Security of the control plane is a large topic and this blog does not go into details about using these services at scale – rather general guidance that these services can be used to secure and assist in the reliability of your control plane.

Data Plane Reliability

Creating a reliable data plane requires a deep understanding of the flow of data and ensuring the availability of data during an outage event. Using AWS Backup can be really useful here as snapshots for data volumes and database transaction logs are durably stored on S3 across all AZ in a given region. This makes access to your snapshots highly available and also, during an AZ outage, the ability to restore to a new volume in another AZ for recovery. To avoid the need for restoration, consider using EFS or FSx for persistent file storage across several AZs. It is important to note that when restoring an EFS volume via a snapshot restore, it writes the entire contents to a temporary directory (as opposed to overwriting the original file location). This is due to EFS being backed by S3 storage – which doesn’t allow incremental changes to objects. Each object is static and cannot be “updated” – it is overwritten as a new object.

Ensuring the availability of your data plane requires creating resources across AZs. A common misunderstanding when learning about EBS Multi-Attach is that it permits multiple instances to mount a shared EBS volume across AZs. This, however, is not the case, as an EBS volume is isolated to only a single AZ. The role of EBS Multi-Attach is for High Performance Computing (HPC) where a high-performance compute cluster is used for large data crunching and there is a need for access to a shared volume for the cluster. When using EBS and FSx volumes, it is important to understand these are provisioned storage solutions (that is, they are created at a given size). A feature of EBS volumes is to allow scaling up of the volume size without the need to create a new volume (which wasn’t always the case). When using FSx, you must restore a snapshot to a new volume and then mount to the same mount point to increase the size. A benefit of EFS is that it scales on demand without the need for pre-provisioning the size of the volume.

Considering databases, it is important to consider the use of read replicas which can be promoted to read/write capability during an outage to an AZ or utilise Multi-AZ clusters on those database engines that support it. As there could be scenarios where the frontend and backend might be independently affected by an outage, implementing health checks, DNS, and networking failover is crucial to allowing continued service delivery if individual components of the stack fail. The use of event-based automation can be supported through lambda and API integrations into services like Systems Manager Run Command and automation documents to help with rapid recovery however, this also relies on the use of those capabilities (which many organisations do not use or are not aware of such capabilities).

Conclusion

Consider the failure scenarios that could impact your applications and utilise tools such as Application Discovery Service or AWS Migration Hub to better understand your applications and their downstream/upstream dependencies. Consider the application stack and the outages that may affect individual components and how to recover from individual component failures. Implement robust security and take advantage of services with Multi-AZ capability to protect control plane. Build a reliable data plane for persistent storage and consider replication or native consistency models to provide robust connectivity across AZs to critical data. Consider protection against data corruption with regular snapshots and/or the use of version control where supported. Consider log shipping or the use of modern cloud-native managed databases and robust backup scheduling to allow for rapid recovery for point-in-time recovery of systems.

This article has dived deep into the differences between a control plane and the data plane to provide perspective and guidance on creating reliable cloud solutions. For further information, please reach out to the Cevo Australia team so we can build reliable solutions together.

Enjoyed this blog?

Share it with your network!

Move faster with confidence