Those of us who use VCB (VMware Consolidated Backup) to perform backups of their SAN based virtual machines may know how time consuming and frustrating it can be to find and clean up stale snapshots on virtual machines that were left behind by failed VCB backups. This is even more time consuming if you have a large scale virtual environment with hundreds or even thousands of virtual machines than needs to be backup up on a daily basis.
Let me first give a brief explanation on how VCB goes about backing up virtual machines and why having stale snapshots on virtual machines prior to a VCB backup job will spell problems.
Every time a VCB backup job kicks off, a snapshot is created on the VM that is going to be backed up. Whilst this snapshot is in place, all changes that takes place in VM’s guest OS will be written to delta VMDK files, that is one delta file for every virtual disk on the VM. These files increment in 16MB chunks and on a busy VM, say for instance a VM that hosts a large database, these 16MB increments may result in several gigabytes per delta file. Whilst any changes are being written to these delta files, VCB can go ahead and mount the main VMDK files to the VCB proxy server in order to make the VMDK files or their contents available to your backup software, i.e. Netbackup. When the backup job completes, VCB will then remove the snapshot by merging the changes recorded in the delta files with the main VMDK files and delete the delta files from the SAN.
Now, in theory this sounds very neat, and in reality it is. That is, until it goes wrong. Sometimes when a VCB backup job fails (and they do fail from time to time), the snapshot on the VM doesn’t get removed. In this case, all changes to the guest OS will still continue to write to delta files. And to make things even worse, I’ve seen cases where the snapshot failed to be removed even though the VCB backup job completed successfully. In this case, Netbackup will show a successful backup, yet the snapshot still exists on the VM. You simply can’t assume that all virtual machine snapshots are cleared off just because Netbackup or whatever you use as your backup application reports successful backups.
So why are stale snapshots a problem you might ask? Well, not only do they grow to huge sizes which may actually cause the datastore to fill up and crash all other VMs on that datastore, but VCB will probably not be able to perform backup operations on a VM that already has snapshots. So yes, a stale snapshot on a VM will cause your next VCB job to fail. You also run the risk of your snapshot delta files to go out of sync with each other and that could cause a loss of data in the worst case. All of which I have first hand experience.
My advice is simple. Make sure you don’t have any snapshots on any virtual machines in scope of being backed up prior to the backup window opening. This is simple, but if you have hundreds of virtual machines, going though each VM to check for snapshots is insane! So, myself and colleague came up with a Perl script that will go and check for any delta files in all datastores seen by the ESX host and return a list of delta files via email.