Performance impact of snapshots

VMware recommends using a snapshot for a maximum of three days. The main reason for justifying this recommendation is related to disk space: snapshots can grow quickly on disk-intensive virtual machines.

But if you read carefully you can also find some performance-related risks. Interesting! We all know more or less that disk performance can be impacted by snapshots. But in which proportions? In order to get some clues, we are going to run some tests and evaluate the performance impact of snapshots by ourselves.

You might be surprised!

Here is our test environment: a vSphere 5.5u1 host and a 2008 R2 virtual machine with a 100 GB dedicated disk. Inside the VM, we execute IOmeter. The IOmeter test pattern is the following: 32k blocks, 80% read, 90% random. We use two SAN devices for our testing: a Storwize V7000 equipped with SAS disks only, and another V7000 with SAS disks and around 10% of SSD disks configured for automatic tiering.

We start by running IOmeter on a non-snapshotted VM to gather reference results, then we create a snapshot, wait a few minutes and test performance. Then we start the snapshot consolidation and measure performance again. At the end of the consolidation we gather performance results one last time.

Full SAS V7000

IO/sV7000 full SAS
Initial situation2450
Performance with snapshot300
Performance during consolidation180
Performance after consolidation2650

With an active snapshot, the disk performance inside the VM only reaches 12% of the initial performance. This number sinks to 7% during the consolidation! Here is how it looks like from the vCenter point of view (VM disk performance):

01.V7000-SASThe snapshot is taken at 3:22 and the performance drops down quickly to reach a plateau. At 3:35 we start to consolidate the snapshot. There seems to be an increase in throughput, but what we see is the disk activity related to the consolidation. Inside the virtual machine, the performance is even further reduced. As soon as the consolidation is finished, performance gets good again (even a bit better, but that’s not related to our test).

SAS + SSD tiering V7000

Let’s run our test again on another V7000 which has SSD tiering in addition to the SAS drives. Hot data is moved to SSDs in a 24 hours cycle. For our test, we let our IOmeter test run long enough to get the 100 GB disk moved to the SSD. But as soon as we create a snapshot, the blocks of the new snapshot disks, being new data, will be sent to the SAS drives. While the absolute performance should be better, the delta between regular performance and snapshotted performance should be even higher! And indeed:

IO/sV7000 SAS + SSD tiering
Initial situation11500
Performance with snapshot420
Performance during consolidation300
Performance after consolidation11500

In this situation, with an active snapshot we only reach 3,6% of the non-snapshotted performance; and this drops even further to 2,6% during the consolidation! The vCenter performance graph is quite clear too:

02.V7000-SAS-SSDIt is easy to see when the snapshot is created (around 3:55), and we quickly reach the performance plateau. During the consolidation (starting at around 4:09) we could think that the performance increases, but again this is just due to the snapshot consolidation. Actual performance within the VM is even lower. Around 4:13, the consolidation is finished and we immediately recover the original performance.

Summary

We can learn several things from these tests.

  1. The performance impact of snapshots is largely underestimated. We have seen a specific case where we only got 3% of the initial performance! Of course, our test is extreme and designed to demonstrate this phenomenon, nevertheless, you can be sure that a disk-intensive virtual machine will be greatly impacted by an active snapshot. Therefore, snapshots should be really used with caution on these machines!
  2. Devices like the V7000, which move “hot” data to SSDs after an analysis cycle of 24 hours, are totally useless in improving snapshot performance. Only full SSD storage devices, or at least devices which are able to use flash memory as a write cache, can mitigate the performance issue linked to active snapshots. In fact, on “slow-tiering” devices, the performance delta between native performance and snapshotted performance is much higher than on regular, full SAS devices!
  3. As snapshots are universally used for the backup of virtual machines, you can expect problems on disk-intensive virtual machines during backups : on machines which have to synchronously, constantly write huge amount of data, you could reach a situation where the writes of the virtual machines cannot be committed while the snapshot disks are consolidated. As a result, the virtual machine will be freezed and consolidation could fail.

Of course, we don’t question the usage of snapshots: it’s one of the virtualization benefits we now couldn’t live without! But we must definitely be aware of the performance limitations. And from what I’ve read, vSphere 6 could come with nice improvements in this area, which would be great as the logic behind snapshotting hasn’t changed much since the very early versions of ESX. Let’s hope for the best, and have a first good reason to migrate when vSphere 6 is released! 🙂