Snapshots? Don’t have a C-O-W about it!

Unfortunately, I am having a COW about it!

Snapshots are the inherent offspring of the copy-on-write technique used in shadow-paging filesystems. NetApp’s WAFL and Oracle Solaris ZFS are commercial implementations of shadow-paging filesystems and they are typically promoted as Copy-on-Write filesystems.

As we may already know, snapshots are point-in-time copy of the active file system in the storage world. They perform quick backup of the active file system by making a copy of the block addresses (pointers) of the filesystem and then updating the pointer maps to the inodes in the fsinfo root inode of the WAFL filesystem for new changes after the snapshot has been taken. The equivalent of fsinfo is the uberblock in the ZFS filesystem.

However, contrary to popular belief, the snapshots from WAFL and ZFS are not copy-on-write implementations even though the shadow paging filesystem tree employs the copy-on-write technique.

Consider this for a while when a snapshot is being taken … Copy —- On —- Write. If the definition is (1) Copy then (2) Write, this means that there are several several steps to perform a copy-on-write snapshot. The filesystem has to to make a copy of the original data block (1 x Read I/O), then write the original data block to a new location (1 x Write I/O) and then write the new data block to the location of the original data block (1 x Write I/O).

This is a 3-step process that can be summarized as

  1. Read location of original data block (1 x Read I/O)
  2. Copy this data block to new unused location (1 x Write I/O)
  3. Write the new and modified data block to the location of original data block (1 x Write I/O)

This implementation, IS THE copy-on-write technique for snapshot but NetApp and possibly Oracle guys have been saying for years that their snapshots are based on copy-on-write. This is pretty much a misnomer that needs to be corrected. EMC, in its SnapSure and SnapView implementation, called this technique Copy-on-First-Write (COFW), probably to avoid the confusion. The data blocks are copied to a savvol, a separate location to store the changes of snapshots and defaults to 10% of the total capacity of their storage solutions.

As you have seen, this method is a 3 x I/O operation and it is an expensive solution. Therefore, when we compare the speed of NetApp/ZFS snapshots to EMC’s snapshots, the EMC COFW snapshot technique will be a tad slower.

However, this method has one superior advantage over the NetApp/ZFS snapshot technique. The data blocks in the active filesystem are almost always laid out in a more contiguous fashion, resulting in a more consistent read performance throughout the life of the active file system.

Below is a diagram of how copy-on-write snapshots are implemented:

 

What is NetApp/ZFS’s snapshot method then?

It is is known as Redirect-on-Write. Using the same step … REDIRECT —- ON —– WRITE. When a data block is about to be modified, the original data block is read (1 x Read I/O) and then the data block is written to a new location (1 x Write I/O). The active file system then updates the filesystem tree and its inode address to reflect the location of the new data block. The original data block remained unchanged.

In summary,

  1. Read location of original data block (1 x Read I/O)
  2. Write modified data block to new location (1 x Write I/O)

The Redirect-on-Write method resulted in 1 Write I/O less, making snapshot creation faster. This is the NetApp/ZFS method and it is superior when compared to the Copy-on-Write snapshot technique discussed earlier.

However, as the life of the filesystem progresses, fragmentation and holes will cause the performance of the active filesystem to degrade. The reason is most related data blocks are no longer contiguous and the active file system will be busy seeking the scattered data blocks across the volume. Fragmented filesystem would have to be “cleaned and reorganized” to regain its performance lustre.

Another unwanted problem using the Redirect-on-Write snapshot technique is the snapshot resides in the same boundary as the active filesystem. Over time, if the capacity consumed by the snapshots could overwhelm the active filesystem, if their recycle schedule is unchecked.

I guess this is a case of “SUFFER NOW/ENJOY LATER” or “ENJOY NOW/SUFFER LATER”. We have to make a conscious effort to understand what snapshots are all about.

Can snapshots replace traditional backups?

Backup is necessary evil. In IT, every operator, administrator, engineer, manager, and C-level executive knows that you got to have backup. When it comes to the protection of data and information in a business, backup is the only way.

Backup has also become the bane of IT operations. Every product that is out there in the market is trying to cram as much production data to backup as possible just to fit into the backup window. We only have 24 hours in a day, so there is no way the backup window can be increased unless

  • You reduce the size of the primary data to be backed up – think compression, deduplication, archiving
  • You replicate the primary data to a secondary device and backup the secondary device – which is ironic because when you replicate, you are creating a copy of the primary data, which technically is a backup. So you are technically backing up a backup
  • You speed up the transfer of primary data to the backup device

Either way, the IT operations is trying to overcome the challenges of the backup window. And the whole purpose for backup is to be cock-sure that data can be restored when it comes to recovery. It’s like insurance. You pay for the premium so that you are able to use the insurance facility to recover during the times of need. We have heard that analogy many times before.

On the flip side of the coin, a snapshot is also a backup. Snapshots are point-in-time copies of the primary data and many a times, snapshots are taken and then used as the source of a “true” backup to a secondary device, be it disk-based or tape-based. However, snapshots have suffered the perception that it is a pseudo-backup, until recent last couple of years.

Here are some food for thoughts …

WHAT IF we eliminate backing data to a secondary device?

WHAT IF the IT operations is ready to embrace snapshots as the true backup?

WHAT IF we rely on snapshots for backup and replicated snapshots for disaster recovery?

First of all, it will solve the perennial issues of backup to a “secondary device”. The operative word here is the “secondary device”, because that secondary device is usually external to the primary storage.

Tape subsystems and tape are constantly being ridiculed as the culprit of missing backup windows. Duplications after duplications of the same set of files in every backup set triggered the adoption of deduplication solutions from Data Domain, Avamar, PureDisk, ExaGrid, Quantum and so on. Networks are also blamed because network backup runs through the LAN. LANless backup will use another conduit, usually Fibre Channel, to transport data to the secondary device.

If we eliminate the “secondary device” and perform backup in the primary storage itself, then networks are no longer part of the backup. There is no need for deduplication because the data could already have been deduplicated and compressed in the primary storage.

Note that what I have suggested is to backup, compress and dedupe, AND also restore from the primary storage. There is no secondary storage device for backup, compress, dedupe and restore.

Wouldn’t that paint a better way of doing backup?

Snapshots will be the only mechanism to backup. Snapshots are quick, usually in minutes and some in seconds. Most snapshot implementations today are space efficient, consuming storage only for delta changes. The primary device will compress and dedupe, depending on the data’s characteristics.

For DR, snapshots are shipped to a remote storage of equal prowess at the DR site, where the snapshot can be rebuild and be in a ready mode to become primary data when required. NetApp SnapVault is one example. ZFS snapshot replication is another.

And when it comes to recovery, quick restores of primary data will be from snapshots. If the primary storage goes down, clients and host initiators can be rerouted quickly to the DR device for services to resume.

I believe with the convergence of multi-core processing power, 10GbE networks, SSDs, very large capacity drives, we could be seeing a shift in the backup design model and possible the entire IT landscape. Snapshots could very likely replace traditional backup in the near future, and secondary device may be a thing of the past.