The last couple weeks have seen the public disclosure of the Downfall and Zenbleed CPU vulnerabilities. (We discussed the Zenbleed vulnerability and how we addressed it in this blog post). These vulnerabilities are just the latest in a series of CPU vulnerabilities over the last few years, the other most notable being the Spectre and Meltdown vulnerabilities in 2018.
With these vulnerabilities in the news we would like to take a moment to clarify what they are and how they were addressed at GoDaddy to protect our customers.
CPUs today can execute thousands of instructions in the time it takes to fetch data from main memory. This simple fact has caused CPU manufacturers to lean heavily on two ideas to improve performance:
- Long ‘pipelines’ that reorder and speculatively execute instructions to reduce the time spent waiting on memory
- Cache as many results of these operations on the CPU as possible
Keeping this much state resident on the CPU conflicts with another fundamental promise of CPUs:
- Each process should be isolated from all other processes
At a high level, all the CPU vulnerabilities over the last decade can be traced back to CPU state being retained even after a CPU switches to another process (called a ‘context switch’).
Modern CPUs try to ‘read ahead’ as much as they can in the instruction stream. If they see an instruction that can be executed with data already in the CPU, they may speculatively execute that instruction and cache the result. If they see an instruction that needs data from memory, they may trigger a prefetch. So by the time the CPU gets to the instruction that needs the data, it’s already available.
All of this is possible because of the incredible number of transistors available inside a modern CPU. They can speculatively execute instructions and simply discard the results if they’re not needed.
The inherent problem with pipelines is all that speculative execution ends up caching a lot of data in the CPU. Over the years, as the pipelines have gotten longer and the speculative execution engines have become more complicated, unintended side effects have inevitably crept in.
Each CPU vulnerability targets a different aspect of state management within modern CPUs.
Meltdown was possible because, in many processors, speculative memory fetches happen before the privilege check to determine if a process should be able to read the memory location. When this happens, a memory read to an inaccessible memory location is cached even though the process doesn’t get the resulting value. Since the value is already cached on the CPU, subsequent accesses to that same memory location are faster. This makes what should be an internal CPU implementation detail observable to any process that can read memory and the result is a ‘timing attack’ that can slowly read memory from outside its allowed memory.
Spectre was a similar class of attack to Meltdown that focused on how pipelines try to guess which instruction in a stream will be dispatched next (called ‘branch prediction’).
Downfall similarly takes advantage of the implementation of the
gather instruction on Intel CPUs. The retained state across process boundaries is a ‘temporal buffer’ that caches memory read operations. In pursuit of performance, memory addresses that may be referenced by a
gather instruction are speculatively fetched into a buffer that is shared across processor security domains.
Zenbleed is a conceptually similar attack to Downfall, except it targets the
vzeroupper instruction in AMD CPUs. Speculative execution fetches data into the shared CPU state (the register file, in this case) and a clever sequence of instructions from another process enables that process to read a register populated with speculatively fetched data.
The approach by CPU manufacturers to mitigate each of these vulnerabilities has been ‘turn off the part of the speculative execution feature that populates the data being leaked’. This is usually done with a microcode update from the CPU manufacturer that reconfigures proprietary internals of the CPU and selectively disables features. Operating systems can also provide additional layers of mitigation by adjusting the way they handle context switches.
Of course, these mitigations also disable CPU features that are responsible for significant performance gains for certain workloads.
GoDaddy’s Hosting environment is susceptible to these sorts of vulnerabilities, since in many cases customers share physical hardware. This environment is also the most sensitive to performance regressions (no one wants a slow website, right?), so each mitigation has to be evaluated to determine how it may affect customer workloads.
Since most of these vulnerabilities target specific CPU subsystems or even individual instructions, determining whether GoDaddy workloads are affected can be challenging. The instruction exploited by Zenbleed is used in system libraries which exposes all processes running on a system, while other instructions are only used in more specialized workloads like video encoding which may not be applicable to GoDaddy customers.
Within GoDaddy hosted workloads, CPU mitigations over the last five years have resulted in a 25% overall reduction in server performance (averaging a variety of CPU metrics across our server fleet). This is in line with the reported impact on consumer workloads. Since we didn’t want to pass on these performance reductions to our customers, we compensate by adjusting how we spread workloads across servers and by procuring new CPUs that have incorporated mitigations into their design and don’t suffer the same performance penalties.
GoDaddy’s Response to Downfall
The details of the Downfall vulnerability were published on August 8, 2023.
This vulnerability affects specific Intel CPU models, and Intel posted a page detailing which models were affected.
The first step within GoDaddy was identifying which of the thousands of servers within the GoDaddy fleet were affected. This was performed through a combination of our server inventory management system and individual teams running CPU model detection automation. It should be noted that while AMD and Intel identified the vulnerabilities as a medium security risks, our internal teams understood the nature of both threats and acted quickly to treat them as critical rather than medium risks.
Intel released microcode updates for affected processors, which were tested on a small group of servers as automation was written for a wider rollout. Normally these system changes would be pushed as part of our normal patch cycle, neatly packaged by each OS distribution. However, due to the time sensitive nature of the disclosure we opted to push the updated Intel microcode directly to the affected servers without waiting for package maintainers to release their updates. Once the impact of the microcode change was measured on the small test group and found to be negligible, the microcode was rolled out to all affected servers.
In the end, all servers running customer workloads were mitigated within hours of the public disclosure.
Our customers come to us for the tools they need to run their businesses. They want to focus on making their businesses successful, not keeping up with the latest CPU vulnerabilities. Our prompt actions in mitigating emergent threats reinforces the commitment we have to our customers and ensures they have the most reliable, secure, and performant websites.