Meltdown, Spectre and Linux on AWS: Security vs Performance?

Colin Panisset

10 January, 2018

An analysis of some corner-case performance issues with Meltdown patches

MELTDOWN, SPECTRE AND LINUX ON AWS: SECURITY VS PERFORMANCE?

The recent announcement of the Meltdown and Spectre attacks against bugs in Intel (and other) CPUs has attracted rapid response from many vendors; Amazon Web Services’ (AWS) response shows that they’ve already patched and protected their infrastructure but you still have work to do. AWS’ Shared Responsibility Model means that you are responsible for patching the operating system running on your EC2 instances, and this is where things get … complicated.

GIVE IT TO ME STRAIGHT, DOC

If you want the TL;DR from all this, here are a few general rules to follow:

Run your EC2 instances using the most recent AMI that you can which uses the HVM virtualisation mode
Patch your operating systems to make sure you have the Meltdown fixes applied
Update to more recent EC2 instance families
Run the latest Linux kernel you can to ensure you have PCID support

PROBLEMS AT THE LOWEST LEVEL

Let’s start off with some basics. The bugs, which exist in all Intel CPUs manufactured since about 2013 (codenamed “Haswell” and later), allow malicious processes to steal information that whould normally be protected, such as passwords, credit card numbers, and so forth, while that data is being processed by the CPU. This is due to flaws in the CPU itself, and has nothing to do with Windows, Linux, Mac OSX, or any other operating system. The CPU cannot be patched – it’s hardware – and so we must rely on fixes to the systems that run on top of those CPUs.

FIXES APPLIED ONE LEVEL HIGHER

There are generally two classes of system which run directly on a CPU: an operating system, like Linux or Windows; or a hypervisor, like VMware ESXi, Xen, or Amazon’s KVM-based proprietary hypervisor.

If a hypervisor is run on the CPU, it hosts other operating systems (like Linux and Windows).

Applying patches to this first layer can protect against both Spectre and Meltdown attacks, with varying degrees of performance impact.

Virtual machines running on top of the hypervisor still need to be patched in order to protect processes running within their operating systems from exploits. These patches will themselves apply potential performance impacts as well.

WELL-KNOWN PERFORMANCE IMPACTS

Intel expects that performance impacts of around 6% will be imposed as a result of fixes for the vulnerabilities (see References); independent testing on Linux systems has measured 5-30% performance impacts (depending on the certain workload); Microsoft estimates performance impacts but is being cagey about actual numbers (see References).

On Linux, patches against Meltdown implement a feature called “Kernel Page Table Isolation” (KPTI), which impose performance impacts whenever a user-land process executes a system call, transferring control from the application code into the kernel (for example, whenever data needs to be read from or written to a disk, or whenever network communication happens).

These performance impacts depend on exactly what kind of work an application does, based on how often these system calls need to be executed, but in general the performance penalty should be restricted to that application and not affect other processes on the same system.

Right?

Well, not quite.

AN OBSCURE FEATURE BECOMES CRITICAL

Intel CPUs since 2010 (codename “Westmere”) have supported a feature called PCID (process context ID) which, for the past 7 years has been fairly boring and unsupported by Linux kernels, because it didn’t really do anything much for performance or security. Starting with kernel 4.14, it’s been supported – though more from completeness for a minor capability improvement than as a critical feature.

It turns out that PCIDis important in alleviating some of the performance impacts of the KPTI patches, and in preventing one application from killing system performance for all other applications. You see, the kernel maintains a Translation Lookaside Buffer (TLB), which is kind of like an index for the mappings between kernel and userland memory pages; when a system call crosses that userland/kernel boundary, kernels running on processors without PCID support must throw away the TLB and start again, increasing the amount of time it takes to execute frequent operations.

But just because all modern CPUs and Linux kernels support this feature, doesn’t mean that you can use it on AWS.

HVM, PV, AND INSTANCE FAMILIES, OH MY!

AWS’ original EC2 instances all ran on top of a hypervisor which provided paravirtualised (“PV”) interfaces to the guest operating systems, which hide some of the features of the underlying CPU, including the PCID capability.

More recent instance families (along with some of the older ones) run on a newer hypervisor which exposes more of the underlying capabilities; this virtualisation mode is called “HVM”, which stands for “Hardware Virtual Machine”.

Although almost all EC2 instance families (like t2, m3, c4) are available in the HVM mode, they don’t all actually expose the PCID feature, which you need in order to avoid the worst performance penalties.

HELP ME, OBI-WAN!

Lucky for you, we’ve done some research and mapped the EC2 instance families against virtualisation modes and CPU features to tell which combinations are least-affected. The following table shows what’s what:

You can see, at a glance, that no PV instance types provide PCID – avoid these, to avoid the worst performance impacts.

You can also see that even if you choose HVM as your virtualisation type, some instance families still don’t expose the PCID feature – you should avoid these as well.

Frustratingly, the hs1 (or “high storage”) instance type is perhaps worst affected; it’s most commonly used for workloads that need to make a lot of disk I/O system calls, thereby bringing the highest amount of performance overhead from the Meltdown patches, and it doesn’t support PCID, meaning that you’re losing out twice over.

THIS IS ALL VERY CONFUSING …

I’ve tried to keep a balance between enough technical information and, where possible, useful simplifications.

If you’d like some assistance working through this mess, please contact us and we’d be happy to see what we can do to help out.

REFERENCES

No good article is complete without references, right?

Colin Panisset

Enjoyed this blog?

Share it with your network!

Meltdown, Spectre and Linux on AWS: Security vs Performance?

MELTDOWN, SPECTRE AND LINUX ON AWS: SECURITY VS PERFORMANCE?

GIVE IT TO ME STRAIGHT, DOC

PROBLEMS AT THE LOWEST LEVEL

FIXES APPLIED ONE LEVEL HIGHER

WELL-KNOWN PERFORMANCE IMPACTS

AN OBSCURE FEATURE BECOMES CRITICAL

HVM, PV, AND INSTANCE FAMILIES, OH MY!

HELP ME, OBI-WAN!

THIS IS ALL VERY CONFUSING …

REFERENCES

The governance gap: Why AI-driven threats demand organisational change, not just technical fixes

How to use AI coding tools like a senior engineer

Building trust through AI Governance with Australian regulators

Meltdown, Spectre and Linux on AWS: Security vs Performance?

MELTDOWN, SPECTRE AND LINUX ON AWS: SECURITY VS PERFORMANCE?

GIVE IT TO ME STRAIGHT, DOC

PROBLEMS AT THE LOWEST LEVEL

FIXES APPLIED ONE LEVEL HIGHER

WELL-KNOWN PERFORMANCE IMPACTS

AN OBSCURE FEATURE BECOMES CRITICAL

HVM, PV, AND INSTANCE FAMILIES, OH MY!

HELP ME, OBI-WAN!

THIS IS ALL VERY CONFUSING …

REFERENCES

AWS doubled SCP limits. Here’s what to review in your landing zone

The governance gap: Why AI-driven threats demand organisational change, not just technical fixes

How to use AI coding tools like a senior engineer

Building trust through AI Governance with Australian regulators