re:Invent 2020 Wrap-Up: Werner Vogels Keynote

Colin Panisset

17 December, 2020

And so we come to the final and, for many, most anticipated keynote, that of Werner (“guess my t-shirt”) Vogels the CTO of Amazon.com. A perennial favourite for those more interested in the builder and devops parts of the ecosystem, I live-blog the keynote for the highlights so you don’t have to (although I recommend watching it anyway, there’s always nuance, customer stories, and occasionally a live demo that reading a blog post won’t give you)

DeepRacer League

First we start with the DeepRacer final race (if you haven’t heard about DeepRacer, and the DeepRacer League check out https://aws.amazon.com/deepracer/league/ ); 8 finalists from around the world have trained their virtual DeepRacer models to race against each other on a virtual track; it’s a gripping race, but I won’t spoil it for you: check out the winner (and a reply of the race) on YouTube. The most amazing thing about it was just how well the machine-learning models performed (virtually) — watching those cars zoom around the “track” you could see how they were picking racing lines, and optimising their performance on the track; very interesting.

The Keynote

And now we’re off to Werner — starting with a timelapse cycle around Amsterdam (a truly lovely city) and setting the stage for his keynote where he links history, commerce and technology: set in “Sugar City”, an old factory just outside of Amsterdam. We’re reintroduced to the “Snow Cone”, a ruggedised and portable compute and storage unit from AWS that could be used to distribute IoT control into remote (and harsh) environments such as process control.

He talks about how AWS and Amazon have been able to survive a totally distributed workplace, thanks to 2020: by using distributed, product-focussed, autonomous teams who are goal driven rather than directed. There’s no expectation that AWS will in any way return to centralised offices at all, and it seems very likely that business travel and centralised office environments are now a thing of the past.

Then we hear from Lea von Bidder, CEO and co-founder of Ava, a company focusing on women’s health, and how they’re combining AI/ML with clinical research to provide better outcomes for women’s reproductive health; and how they were able to pivot their personal health data based service to help detect symptoms of COVID-19 in their users, and alert them to get tested. It’s an excellent example of the use of technology combined with massive data sets to serve unanticipated needs.

Developer Tools

If you haven’t encountered Cloud9 yet, it’s an in-browser developer-focused IDE; you can work collaboratively with others, and it has tight integration with many AWS services as well as built-in pipeline concepts for rapid delivery. It’s also an excellent way to address some of the challenges posed by using tightly locked-down corporate devices, since all the tools and services you need to be an effective developer are right there. You can try it out it in the Cloud9 service in the AWS console.

If you’ve been wanting a CLI, but don’t have the local capacity (or time) to set it up and manage it, the announcement of AWS CloudShell could be what you’re looking for: a Linux CLI in the console, with the same permissions as the account that you’re signed in to. This brings AWS up to date with Google’s shell offering, and is a welcome addition. It comes with a bunch of other tools pre-installed, and up to 1GB of persistent storage that lasts between sessions.

The Amazon Builder Library also gets a mention; you can think of it as a library of patterns that you can review, consume, and integrate as you’re developing. It’s a great learning resource too, full of examples.

Sustainability

A big part of the AWS message this year has been around sustainability — with Peter DeSantis’s Infrastructure Keynote talking a lot about improvements in efficiency and AWS’ focus on sustainable energy and building, and Werner continues the message with a description of how AWS (and partners) can help customers be more efficient with their workloads (both from an architectural and operational perspective). By being more efficient, you’re automatically more sustainable. We hear again about the new ARM-based Graviton2 processor, which is lower-cost with a higher price-performance ratio than regular x86 CPUs.

Dependability

We’re introduced to the concept of dependability — which is a business-level concept, not a purely technological one. Is the system delivered with the avoidance of “unacceptable failures”. Note that acceptable is defined by the business needs, and encompasses many other “-ilities”, including availability, confidentiality, and observability.

We hear from Nicole Yip of Lego, who tells us how they’ve managed to evolve their systems to handle gigantic spikes in demand, by eliminating infrastructure ownership of slow-moving on-premises systems and adopting serverless, scaling automatically to match demand, and adopting composable architectures to allow rapid innovation and boundary-pushing. They were able to do this by adopting a self-service model, doubling down on automation, and pushing strongly towards a DevOps model where teams design, build, and run their own services. A key requirement is that every service deployment must be done via a canary model, to ensure that unanticipated failures are caught with minimal impact to the end-user. A fan-out operations model, where alerts are caught centrally and then distributed based on business criticality, allows for diverse teams comprising members at all levels of skill and experience to contribute to the robust running of the environment.

By distributing from monolithic designs to serverless, Lego was able to increase their dependability as well as freeing up engineering time to work on high-value problems.

Another way to approach dependability is by formal proof of correctness which is extremely difficult to achieve, especially with any application that’s even faintly complicated. If one could convert a program into a mathematical specification, you could then perform formal verification on it, but that’s enormously difficult to do. NASA and Intel invest in this kind of method of development, as does AWS for services like S3 and KMS.

S3 Strong Consistency

This has implications for eventual consistency as well, as reasoning about systems that are eventually consistent is much harder to do. S3 has traditionally been eventually-consistent (you write an update to an object, and if you read it again immediately, you might get an old version) because that’s been a key tradeoff that had to be accepted in order to deliver the level of reliability, and attempting to make it strongly consistent risked raising some very hairy edge cases. The adoption of formal verification methods for S3 has allowed AWS to provide strongly consistent read-after-write for all requests, with no impact to performance and no increase in cost.

VPC Reachability Analyzer

Understanding how things talk to each other in a VPC environment, transiting subnets, Network ACLs, security groups, route tables, etc has required a lot of manual eyeballing; introduction of the Reachability Analyzer allows you to verify automatically whether any given EC2 instance or network interface can reach another, without actually sending any packets; this is achieved through automated reasoning methods as well.

Zelkova

All of these things, along with Macie, AWS Config, IAM Access Analyzer, etc make use of a back-end service called “Zelkova”, developed by the AWS Automated Reasoning Group. It makes use of mathematical models of systems to provide strong proof (not just assertions, or heuristics) of the correctness of specific assertions, such as “this system can reach that system”. It’s cool, and we don’t have to pay any extra for the level of certainty that we get from AWS’ use of it.

Operations are Forever

Next, Werner talks about how the time spent developing an application is nothing compared to the amount of time and energy involved in operating and maintaining it. This led to the adoption of Rust for the next generation of Load Balancers within AWS, because it made sense for the delivery teams to learn a new language if it meant more reliable systems, because that would reduce the load on the operations end of the equation. The success they’ve had has led to the adoption of Rust among more teams within AWS, as it allows them to make stronger assertions about the readiness of a given system or service for deployment and ongoing use.

Fault Handling

Similar to automated reasoning, the use of fault injection can be used to verify that error-handling paths in code are actually correct. One approach to fault injection is to use libraries and tools that “fuzz” inputs — generating random (or pseudo-random) inputs that, over time, should traverse the sample space of possible inputs that a system might receive, so that you can make stronger assertions about the correctness of the error-handling code.

Fuzzing can be tricky to set up though, as it’s difficult to think about just what “shapes” you should be trying to send. AWS makes extensive use of fuzzing to test both their APIs and backend code.

Fuzzing is just one part of fault testing though, with other failure injections being required to test complex distributed systems properly; Netflix’s well-known Chaos Monkey was a tool that implemented some of the basic Chaos Engineering principles, by randomly killing services and blocking network access in production systems to validate that the ecosystem as a whole would be able to withstand sudden failures. Going further, you also need to be able to inject network latencies, dropped packets, and so forth.

AWS Fault Injection Simulator

This leads us to the announcement (coming soon, in early 2021) of a new service to allow automated fault injection/chaos engineering for AWS services that you’ll be able to use to test your systems built in the cloud. It allows chaos engineering against systems of all kinds, including failures against control plane entities like API rate limiting. Metrics and results go to CloudWatch, for analysis and decision making.

You could use FIS for game days, for CI/CD, and for testing not just resilience but also performance of your applications. It can be used to help teams build awareness and knowledge of edge states and how systems actually operate under conditions of weirdness.

This could actually be a game-changer, democratising access to fault injection for organisations that haven’t had the capacity to even consider adopting it. I look forward to giving it a whirl soon!

Observability

“How can you infer the internal state of a system from its outputs?” is the question we ask when we consider the observability property of a system. This includes functional outputs (like outputs from APIs), and non-functional outputs like metrics, logs, and traces.

Monitoring is problematic, though, as it is based on the assumption that you already know what you need to be looking at — it doesn’t allow for mistakes or unknowns. Think of the car mechanic who can tell whether a given engine has a particular problem just by listening, but may not be able to make the same diagnosis with a new engine; or the apprentice who’s never heard that before.

Monitoring allows for reaction after failure; it doesn’t provide prediction of failure. As systems complexity increases, you can’t put everything on a dashboard because it becomes overwhelming, and doesn’t even tell you what’s actually going on because of the interaction between components.

Monitoring, as it turns out, is not the same as observability (something I’ve been banging on about for ages now).

The adage “log everything” is a great principle, but difficult to analyse. CloudWatch Logs provides the centralised collection and analytics service within AWS, and pretty much everyone who’s used the AWS platform knows about it. Getting metrics out (eg publishing to CloudWatch Metrics) has been difficult though, so use of structured logging (eg in JSON format) allows for high-cardinality metrics to be emitted as logs, where there might be a huge range of dimensions and values involved.

CloudWatch Contributor Insights is a service that allows for real-time analysis of high-cardinality data in a rule-based model.

CloudWatch Synthetics, and Canaries, can provide a way of investigating deployed services and integrates with CloudWatch Logs and X-Ray so that you can tell, in advance, not just when something has performed oddly, but when something might be about to start performing oddly.

Amazon Managed Services for Grafana and Prometheus

Prometheus is a well-known open-source metrics collection and distribution platform, based on time-series databases, which has been deployed in hundreds of thousands of environments; it’s tightly and natively integrated into Kubernetes, for example, and provides a standardised way to get metrics out of your applications and systems. Setting it up and running it effectively though can be difficult, and requires both time and experience; hence the new Managed Service for Prometheus offering.

Grafana is a similarly well-know dashboarding and metrics analytics package that’s very often paired with Prometheus. Again, deploying and managing a Grafana environment for a large ecosystem has high levels of complexity, and so a managed service for this is very welcome too.

Both are available in preview at the moment, although at the time of writing there’s still an application form for Grafana.

OpenTelemetry

Werner also announces the AWS Distro for OpenTelemetry, which contains collectors that are built into applications and exporters that send data to different backend analytics targets. OTEL supports CloudWatch, of course, as well as a number of vendors and other third-party providers. The idea is that you can instrument your applications once and use the same telemetry against multiple backends without having to rewrite or re-integrate.

And That's It!

Not a lot of new product or feature announcements in Werner’s keynote this time, although the focus on Observability, Dependability, and “meeting developers where they are” is a good thing. I’m personally looking forward to having a play with the Prometheus and Grafana offerings, combining them with the OTEL distro; and I’m sure CloudShell will prove useful in many situations in the coming months.

That wraps up this series overview of the AWS re:Invent 2020 keynotes. Stay tuned for more in-depth investigations into some of the new services and features announced, both by myself and other Cevons; and remember, if you’d like a hand with any of this stuff, give us a call!