A recent research study has unveiled a security risk in Transparent Page Sharing (TPS), as acknowledged by VMware in kb2080735.
The researchers were able to discover that from a virtual machine A, an AES encryption key could be retrieved from machine B. While the steps to achieve this seem difficult to reproduce, the risk is real. In fact, the risk is so real that VMware decided to disable TPS for all future versions of ESXi, as well as all current versions for the next update release.
For example, version 5.5 is currently in update 2: TPS will be disabled with update 3. More exactly, inter-vm page sharing will be disabled per default. Pages can still be deduplicated within a virtual machine world, for a much smaller benefit of course.
Until these new releases hit the market, patches are available for those who wish to disable TPS in versions 5.5 and 5.1. And a patch is coming for version 5.0.
Let’s go back to 2006 and the release of CPUs able to manage virtual Memory Management Units (MMU) with a new, improved, hardware MMU (called Intel EPT or AMD RVI). These new MMUs were greatly improving performance for some workloads, but reducing performance for some others, due to the greater cost of a TLB cache miss. In order to reduce the stress on the TLB and get the best of both worlds, VMware recommended to use large memory pages (2 Mb instead of 4 kb) together with these hardware MMUs. New operating systems would use large memory pages anyway, so everything was coming together quite well.
But there is a downside with these large pages : it’s very unlikely to find two identical large pages! Therefore, TPS is disabled for large pages, unless the system is low in memory. When this happens, the large memory pages are split in smaller memory pages again (4 kb) and TPS can kick in to reclaim memory. But performance may go down!
Back to our problem : on any modern ESXi server where memory is not under pressure, there is no practical risk because large pages are enabled by default, which make TPS useless. Consequently, the future updates disabling TPS will have no further impact.
However, be careful. TPS may be unused in your environment now, but what if a server fails? Are you sure that the remaining servers in your HA cluster can recover all virtual machines without activating TPS?
In any case, for all environments relying on TPS to increase their consolidation ratio (which sometimes goes up 50% with this feature), the question will be raised : should you re-enable TPS after the next update? (VMware is disabling it, but leaves the possibility to enable it again)
Let’s analyse some use cases and see if we can propose a reasonable answer.
Use case analysis
Use case 1. A corporate server environment
In this use case, the VMware environment hosts the servers of a single company. The servers and applications are managed by the IT team and are used internally.
In this case, why would you try to gain access to one of the virtual machines and launch a complex attack via TPS, to find out an AES key on another virtual machine… instead of just gaining access to the target virtual machine in the first place? 🙂 In most companies, all servers share a single identity source (generally Active Directory). When you have an administrator account on one machine, you can generally use it to access the other machines too. Therefore, the practical risk of this attack in a regular corporate environment is very low.
But still; very low is not null! The mechanism used by this attack could be improved and become a greater risk in the future.
For this reason, our take would be to leave TPS activated for the time being. If there are some critical virtual machines (DMZ, critical application…) for which TPS should be disabled, this could be feasible with the latest patches and a technique called salting*, but the management of salting is quite complicated yet (as it involves editing the .vmx files manually). I would do it only with a strong business and security requirement!
However, you should plan to disable TPS in the midterm, and plan your investments accordingly. For instance, renew the servers with appropriately resized memory, or just increase the memory for the existing servers, so that you can disable TPS in the future.
Use case 2. Multi-tenant environment
This is typically a hosting platform, but it could also apply to a large company which offers infrastructure services to child companies, or different departments which have their own identity management, virtual environment and network, but could be hosted, from a compute point of view, on the same physical machines.
In this case, even if the security risk is not critical, you must disable TPS. You cannot allow a customer to access any data from another customer, and this is clearly a possibility here. And a SLA probably engages you to maintain the best possible security level anyway.
Therefore in this case, you should plan the necessary investment (probably memory – not so expensive anyway) to disable TPS as soon as possible, without impacting the performance on the customer side. And if you can disable it without impacting the performance too much, you should probably do it right away.
Use case 3. VDI
VDI has always been the perfect use case for TPS : on a single server, tens of identical VMs, running the same applications… TPS can greatly improve the consolidation ratio here! Should we really disable it?
Even more than in our first use case (corporate environment), the security risk with TPS in a VDI environment is extremely low. Typically, VDI virtual machines are deployed as a pool, and permissions are given at the pool level. If you have access to one VM, you have access to all of them. It would make no sense to try exploiting the TPS weakness in this scenario! Additionally, VDI VMs are typically virtual machines were no critical information is found. User data is stored centrally and applicative data is on servers. If one would target this data, he would not start to hack a VDI desktop (and he would not exploit the TPS issue anyway).
However, you could find environments where the VDI cluster hosts some servers, in which case an AES key from the server could be retrieved from a virtual machine. But in this case, the good solution is to apply best practices and avoid mixing workloads, not to disable TPS!
Finally, let’s mention the special case of multi-tenant VDI environments, where VDI machines from several customers could be executed on a single host (well, Microsoft’s licensing would make this difficult, but let’s imagine that scenario for a second). Here we have again the unacceptable case where a customer could access data from another customer.
This is a perfect use case for the new salting* technique. In this case, we would recommend to configure all virtual machines of a single customer with an identical salt and enjoy the benefits of TPS, with the guarantee of perfect isolation between customers. The configuration of salting is a bit complicated yet but it’s probably worth the additional work!
In the VDI use case, we would recommend to keep using TPS, with salting if required, but to keep an eye on this security risk to see if new information pops up!
Without any doubt, there is a real security issue with TPS. A first proof of concept has been done by retrieving an AES key from a remote virtual machine, but who knows the possibilities of the discovered technique? We cannot ignore the risk, and especially administrators of multi-tenant environments should probably react rapidly to mitigate this issue.
In many other cases, TPS isn’t used on a regular basis anymore (typical server environment) or is heavily used, but without a real security issue (VDI). In both cases, we must also keep in mind that the implementation of the attack scenario is complex, and the results limited. Therefore, an immediate response is not necessarily required.
Therefore, in order to properly respond to this new risk, the first thing is to carefully analyse the environment. And to keep in mind that TPS, as we know it, is not a future-proof solution anymore!
* salting : in the latest patches shipped for ESXi 5.5 and 5.1 (and a similar possibility will likely be found in ESXi’s future versions), TPS can be activated selectively for a group of virtual machines with a virtual machine attribute called “salt”. All VMs with the same salt can share their memory pages (unless they use large pages…). The default setting for the salt is based on the virtual machine UUID, which makes it unique, and effectively disables TPS. But configure the same salt for a group of VM and they will share their pages again. Clever!