re:Invent 2020 Wrap-Up: Infrastructure Keynote with Peter DeSantis

BLOG ARTICLE

re:invent Peter DeSantis

This keynote in past years has been the “Monday night special”, where we get to delve into some of the really interesting “under the covers” bits of AWS, to see more about how the sausage is made and what shiny new sausage parts are available. If you’re interested in the lower-level of the cloud, this is the keynote for you, and I’m here to sum it up for you.

Let’s get stuck in!

We begin with a review of how AWS has enabled customers to move faster (increases in speed of 10, 20, 100 times for regular business processes) before the dancefloor opens and Peter DeSantis and his mighty beard appear.

 

How AWS Operates

Can you be like AWS? Can you learn from how they work? Well, Peter hopes so — but he wants you to be aware that there’s no compression algorithm for experience (a phrase that those of you who’ve seen re:Invent talks before will have heard often).

AWS spends a lot of time anticipating failures, and building systems and services to manage those inevitable events — this is because, at the scale of AWS, “everything fails, all the time”.

Peter describes the two axes of understanding failure: first, the “blast radius” concept. When something fails, how many other things depend on it, and will be affected? The second is the complexity of the component which affects how likely it is to fail, how difficult it might be to discover the problem, and the amount of time required subsequently to fix it.

Therefore, adding redundancy to a given design doesn’t always make a system more reliable if the added components bring additional complexity. As an example, AWS powers its compute hardware by two completely independent and fairly simple power systems; if one fails, it can’t affect the other, and both have to fail concurrently to take down an environment. With this approach, the availability of your average AWS data centre sits at 99.99997% (which means just 9 seconds of downtime per year).

Next, AWS takes action to reduce complexity in system components. They can do this partly because of the scale they have — it makes sense to design a lot of their own hardware components, because they don’t have to cater for “all options”. The example is replacing large, complex power-handling UPSes for entire data centres with multiple, simple redundant rack-level batteries.

 

Availability Zones and Regions

If you’ve done any reading about the AWS infrastructure, you know that Availability Zones are a key part of how they operate: completely separated, redundant data centre environments that are “close” to each other in speed-of-light and latency terms, to allow for synchronous replication. A Region comprises at least 3 Availability Zones (and it may interest you to know that an Availability Zone is not a single data centre — for example, the Sydney region has at least 8 actual physical data centre environments). Availability Zones need to be within hundreds of microseconds of each other, leading to the design that we all use and take for granted with AWS today.

Contrast this with Google and Azure, who “usually” have control plane, power, cooling and networking isolated across zones, or who have separate availability zones available in “select regions” (and where availability zones may actually be in the same physical data centre) — AWS designs and builds AZs the same way, in every region, for every customer.

At the Region level, every region is completely isolated from each other (with the exception of services like IAM, which is global).

 

Melbourne Region!

In case you hadn’t heard, Australia gets a second region (sometime in 2022) — Melbourne! This will really assist customers who want multi-region availability but have to consider data tenancy to ensure that their customer data remains on-shore. 

 

Supply Chain Redundancy

AWS takes care to ensure that they not only have physical redundancy of their existing environments, but that they can continue to source and deploy new capacity so that customers can continue to scale and grow: since 2015, they’ve nearly tripled the number of suppliers for critical components to reduce the chance of supply bottleneck.

Contrast this with the Azure region in Sydney where, earlier this year I went to sign up for an account in order to launch some test workloads for a customer: I was told that, unless I could commit to a certain level of spend for a year, there was no capacity for me. That’s just not a great customer experience at all.

 

Chip Design and Custom Silicon

 

Nitro

AWS works at such scale that they can (and do) design their own silicon that meets the exact needs of their (and our) workloads. Because of the scale, they can also afford to invest across the entire stack from the silicon up through the hypervisor to the operating layer on top of that. Chips like the Nitro, which not only enables strong segregation of customer workloads on existing x86 silicon, but has also been used for the same purpose on the new Apple Mac instances with zero changes to the Mac Mini hardware.

Here’s a shot of the Mac Mini in its tray in an AWS data centre. If you look at the bottom of the photo, you can just see the edge of the board with the Nitro controller on it, which connects to the Mac via Thunderbolt3 and enables all the usual features of an EC2 instance: access to EBS volumes, multiple network ENIs, and all other standard AWS services, with zero cost to the workloads running on the Mini.

Nitro 4 chips enable things like the c6gn EC2 instance with up to 100 Gbps network bandwidth, and up to 38 Gbps EBS bandwidth at utterly magnificently low levels of latency and jitter.

 

Machine Learning

Inferentia is the custom AWS silicon for machine learning inference, and allows up to 45% lower cost per inference compared to off-the-shelf GPUs, and the somewhat dubiously-named “Trainium” which will deliver for a lower cost for training new machine learning models when it’s actually available.

 

Graviton2

ARM is the new hotness, and is showing a real threat to the incumbent x86 architecture. You’re probably familiar with Raspberry Pi, and the new Mac M1 chip also uses an ARM architecture — as does the original Graviton and the new Graviton2 processor.

Well-architected cloud-based workloads take advantage of the scale-out paradigm, rather than scale-up: you do more work by working with multiple small units in parallel, rather than gigantic big-bang single tasks. This informed the design of Graviton2, which steers away from the common “hyperthreading” approach that x86 CPUs use in order to provide a more secure, predictable workload experience by providing completely independent cores on the die each with their own completely independent local caches (each core has 64KB of L1 cache, and 1MB of L2 cache all of its own).

By comparison, C5 instance hardware has about 25 actual cores available while a C6g has the actual 64 cores fully available for workloads. On price-performance, the largest size m6g.16xlarge instances are nearly 60% better on cost against the closest m5.24xlarge, with 20% better absolute performance.

 

Sustainability

AWS recently announced the Climate Pledge, which commits companies to achieving net-zero carbon emissions by 2040, 10 years ahead of the Paris Agreement. This isn’t just them: 30 more companies have signed the Climate Pledge as well, including some big ones like Verizon, Siemens, and Unilever.

You can’t just sign a pledge though — that doesn’t do anything — so AWS is focussing on 2 approaches to reduce energy consumption: reduce waste, and use renewable resources. By removing centralised UPSes from their data centre design, they’ve saved 35% in energy conversion losses and by adopting Graviton2 you can get 2-3.5 times better performance-per-watt than any other CPU available on AWS (I can tell you, I’m moving as many of my workloads to Graviton2 as I can). Overall, the average AWS data centre runs almost 88% more carbon-efficient than standard enterprise data centres.

AWS is also building large-scale wind and solar plants to power their data centre environments. Prior to 2019, they built 1,036 MW of capacity in the USA; in 2019, 14 new wind and solar projects across the USA, Europe and Australia raised that amount to 2,348 MW; and in 2020 another 3,400 MW of additional capacity bringing AWS to a staggering  6,500 MW of renewable energy capacity worldwide. At this point, AWS believes they will be 100% renewable by 2025.

They’re also working to drive demand for more sustainable concrete in building their data centres, and hope to be able to reduce the embodied carbon in a data centre by up to 20%, and in other ways through the use of their investments from the US$2bn Climate Pledge Fund.

 

In Summary

I love hearing about all the things that AWS can do because of their scale; but it’s not just about doing cool stuff, because you can do cool things anywhere — doing them well is the hard thing, and there really isn’t another cloud provider that does as much as they do, as well as they do. When I think of “cloud first”, this is what it means to me: driving efficiency and reliability at scale, for the benefit of all, from the lowest level of the hardware to the highest abstractions of the services. There really is none like it.