btrfs – a better butter from a better COW?

There’s better a lot of chatter about the default file system in the recently released Fedora 16. Fedora 16 was released on November 8, 2011. For months, there were rumours that the default file system for Fedora 16 will be btrfs (btree file system, or better known as butter-FS). But after the release, the default file system of Fedora 16 is still ext4, much to the dismay of many. btrfs holds a lot of potential because in the space of less than 4 years, it has moved up the hierarchy to be one of the top file systems in the Linux ecosystem.

File systems in Linux has not seen had a knight in shining armour for the long time. The file systems kings of the Linux world were ext2/3 for the RedHat and Debian flavoured distros and reiserfs for the SuSE  flavoured distros. ext4 is now default in most Linux distros but the concept of  ext2/3/4 has not changed much since its inception decades ago.

At the same time, reiserfs  had a lot of promise as well, but its development and progress have lost it lustre after its principal developer, Han Reiser was convicted of the murder of his wife a few years ago. If you are KPC (Malaysian Chinese colloquialism, meaning busybody), you can read the news here.

btrfs is going to be the new generation of file systems for Linux and even Ted T’so, the CTO of Linux Foundation and principal developer admitted that he believed btrfs is the better direction because “it offers improvements in scalability, reliability, and ease of management”.

For those who has studied computer science, B-Tree is a data structure that is used in databases and file systems. B-Tree is an excellent data structure to store billions and billions of objects/data and is able to provide fast data retrieval in logarithmic time. And the B-Tree implementation is already present in some of the file systems such as JFS, XFS and ReiserFS. However, these file systems are not shadow-paging filesystems (popularly known as copy-on-write file systems).

You see, B-Tree, in its native form of implementation, is very incompatible with COW file systems. In fact, the implementation was thought of impossible, until someone by the name of Ohad Rodeh came along. He presented a paper in Usenix FAST ’07 which described the challenges of combining the B-Tree concept with shadow paging file systems. The solution, as he described it, was to introduce insert/remove key into the tree structure, and removing the dependency of intra-leaf linking.

Chris Mason, one of the developers of reiserfs, took Ohad’s idea and created a shadow-paging file system based on the B-Tree idea.

Traditional file systems tend to follow the idea of the Berkeley Fast File Systems. Different cylinder groups have its own inode, bitmap and disk blocks. The used space in one cylinder group cannot be shared to another cylinder group, resulting in wastage. At the same time, performance can be an issue as the disk read/head frequently have to move to the inodes to find out where the next used or free blocks will be. In a way, it looks like the diagram below.

Chris Mason’s idea of btrfs made the file system looked like this:

Today, a quick check in btrfs wiki page, shows that the main Btrfs features available at the moment include:

  • Extent based file storage
  • 2^64 byte == 16 EiB maximum file size
  • Space-efficient packing of small files
  • Space-efficient indexed directories
  • Dynamic inode allocation
  • Writable snapshots, read-only snapshots
  • Subvolumes (separate internal filesystem roots)
  • Checksums on data and metadata
  • Compression (gzip and LZO)
  • Integrated multiple device support
  • RAID-0, RAID-1 and RAID-10 implementations
  • Efficient incremental backup
  • Background scrub process for finding and fixing errors on files with redundant copies
  • Online filesystem defragmentation

And Chris Mason and his team have still plenty more to offer for btrfs. It will likely be the default file system in Fedora 17 and at the rate it is going, could be the file system of choice for RedHat, Debian and SuSE distros very soon.

There’s been a lot of comparisons between btrfs and ZFS, since both are part of Oracle now. ZFS is obviously a much more mature file system, with more enterprise features and more robust, (Incidentally, ZFS just celebrated its 10th birthday on Halloween 2011 – see Matt Ahren’s blog) and the btrfs is the rising star in the Linux world. But at this moment, the 2 file systems are set apart in their market positioning and deployment.

ZFS is based on the CDDL (Common Development and Distribution License) while btrfs is based on GNU GPL, open source licensing. There are controversies surrounding CDDL licensing scheme, while is incompatible with GNU GPL scheme.

Oracle can count itself very lucky to have 2 of the most promising and prominent COW file systems around. It will be interesting to see what Oracle will do next. As a proponent of innovation, community and sharing, I sincerely hope that both file systems will continue to thrive in Oracle’s brutal, sales-driven organization. We certainly don’t want to see controversies about dual ownership BS of Oracle and mySQL that could end in 2015.

ONTAP vs ZFS

I have to get this off my chest. Oracle’s Solaris ZFS is better than NetApp’s ONTAP WAFL! There! I said it!

I have been studying both similar Copy-on-Write (COW) file systems at the data structure level for a while now and I strongly believe ZFS is a better implementation of the COW file systems (also known as “shadow-paging” file system) than WAFL. How are both similar and how are both different? The angle we are looking at is not performance but about resiliency and reliability.

(Note: btrfs or “Butter File System” is another up-and-coming COW file system under GPL license and is likely to be the default file system for the coming Fedora 16)

In Computer Science, COW file system are tree-like data structures as shown below. They are different than the traditional Berkeley Fast File System data structure as shown below:

As some of you may know, Berkeley Fast File System is the foundation of some modern day file systems such as Windows NTFS, Linux ext2/3/4, and Veritas VxFS.

COW file system is another school of thought and this type of file system is designed in a tree-like data structure.

In a COW file system or more rightly named shadow-paging file system, the original node of the data block is never modified. Instead, a copy of the node is created and that copy is modified, i.e. a shadow of the original node is created and modified. Since the node is linked to a parent node and that parent node is linked to a higher parent node and so on all the way to the top-most root node, each parent and higher-parent nodes are modified as it traverses through the tree ending at the root node.

The diagram below shows the shadow-paging process in action as modifications of the node copy and its respective parent node copies traverse to the top of the tree data structure. The diagram is from ZFS but the same process applies to WAFL as well.

 

As each data block of either the leaf node (the last node in the tree) or the parent nodes are being modified, pointers to either the original data blocks or the copied data blocks are modified accordingly relative to the original tree structure, until the last root node at the top of the shadow tree is modified. Then, the COW file system commit is considered complete. Take note that the entire process of changing pointers and modifying copies of the nodes of the data blocks is done is a single I/O.

The root at the top for ZFS is called uberblock and called fsinfo in WAFL. Because an exact shadow of the tree-like file system is created when the data blocks are modified, this also gives birth to how snapshots are created in a COW file system. It’s all about pointers, baby!

Here’s how it looks like with the original data tree and the snapshot data tree once the shadow paging modifications are complete.

 

However, there are a few key features from the data integrity and reliability point of view where ZFS is better than WAFL. Let me share that with you.

In a nutshell, ZFS is a layered architecture that looks like this

The Data Management Unit (DMU) layer is one implementation that ensures stronger data integrity. The DMU maintains a checksum on the data in each data block by storing the checksum in the parent’s blocks. Thus if something is messed up in the data block (possibly by Silent Data Corruption), the checksum in the parent’s block will be able to detect it and also repair the data corruption if there is sufficient data redundancy information in the data tree.

WAFL will not be able to detect such data corruptions because the checksum is applied at the disk block level and the parity derived during the RAID-DP write does not flag this such discrepancy. An old set of slides I found portrayed this comparison as shown below.

 

Another cool feature that addresses data resiliency is the implementation of ditto blocks. Ditto blocks stores 3 copies of the metadata and this allows the recovery of lost metadata even if 2 copies of the metadata are deleted.

Therefore, the ability of ZFS to survive data corruption, metadata deletion is stronger when compared to WAFL .This is not discredit NetApp’s WAFL. It is just that ZFS was built with stronger features to address the issues we have with storing data in modern day file systems.

There are many other features within ZFS that have improved upon NetApp’s WAFL. One such feature is the implementation of RAID-Z/Z2/Z3. RAID-Z is a superset implementation of the traditional RAID-5 but with a different twist. Instead of using fixed stripe width like RAID-4 or RAID-DP, RAID-Z/Z2 uses a dynamic variable stripe width. This addressed the parity RAID-4/5 “write hole” flaw, where incomplete or partial stripes will result in a “hole” that leads to file system fragmentation. RAID-Z/Z2 address this by filling up all blocks with variable stripe width. A parity can be calculated and assigned with any striped width, as shown below.

 

Other really cool stuff are Hybrid Storage Pool and the ability to create software-based caching using fast disk drives such as SSDs. This approach of creating ReadZilla (read caching) and LogZilla (write caching) eliminates the need for proprietary NVRAM as implemented in NetApp’s WAFL.

The only problem is, despite the super cool features of ZFS, most Oracle (not Sun) sales does not have much clue how to sell ZFS storage. NetApp, with its well trained and tuned, sales force is beating Oracle to pulp.

Copy-on-Write and SSDs – A better match than other file systems?

We have been taught that file systems are like folders, sub-folders and eventually files. The criteria in designing file systems is to ensure that there are few key features

  • Ease of storing, retrieving and organizing files (sounds like a fridge, doesn’t it?)
  • Simple naming convention for files
  • Performance in storing and retrieving files – hence our write and read I/Os
  • Resilience in restoring full or part of a file when there are discrepancies

In file systems performance design, one of the most important factors is locality. By locality, I mean that data blocks of a particular file should be as nearby as possible. Hence, in most file systems designs originated from the Berkeley Fast File System (BFFS), requires the file system to seek the data block to be modified to ensure locality, i.e. you try not to split up the contiguity of the data blocks. The seek time to find the require data block takes time, but you are compensate with faster reads because the read-ahead feature allows you to read extra blocks ahead in anticipation that the data blocks are related.

In Copy-on-Write file systems (also known as shadow-paging file systems), the seek portion is usually not present because the new modified block is written somewhere else, not the present location of the original block. This is the foundation of Copy-on-Write file systems such as NetApp’s WAFL and Oracle Solaris ZFS. Because the new data blocks are written somewhere else, the storing (write operation) portion is faster. It eliminated the seek time and it also skipped the read-modify-write action to the original location of the data block. Therefore, write is likely to be faster.

However, the read portion will be slower because if you want to read a file, the file system has to go around looking for the data blocks because it lacks the locality. Therefore, as the COW file system ages, it tends to have higher file system fragmentation. I wrote about this in my previous blog. It is a case of ENJOY-FIRST/SUFFER-LATER. I am not writing this to say that COW file systems are bad. Obviously, NetApp and Oracle have done enough homework to make the file systems one of the better storage file systems in the market.

So, that’s Copy-on-Write file systems. But what about SSDs?

Solid State Drives (SSDs) will make enemies with file systems that tend prefer locality. Remember that some file systems prefer its data blocks to be contiguous? Well, SSDs employ “wear-leveling” and required writes to be spread out as much as possible across the SSDs device to prolong the life of the SSD device to reduce “wear-and-tear”. That’s not good news because SSDs just told the file systems, “I don’t like locality and I will spread out the data blocks“.

NAND Flash SSDs (the common ones we find in the market and not DRAM-based SSDs) are funny creatures. When you write to SSDs, you must ERASE first, WRITE AGAIN to the SSDs. This is the part that is creating the wear-and tear of the device. When I mean ERASE first, WRITE AGAIN, I describe it below

  • Writing 1 –> 0 (OK, no problem)
  • Writing 0 –> 1 (not OK, because NAND Flash can’t do that)

So, what does the SSD do? It ERASES everything, writing the entire data blocks on the device to 1s, and then converting some of them to 0s. Crazy, isn’t it? The firmware in the SSDs controller will also spread out the erase-and-then write operations across the entire SSD device to avoid concentrating the operations on a small location or dataset. This is the “wear-leveling” we often hear about.

Since SSDs shun locality and avoid the data blocks to be nearby, and Copy-on-Write file systems are already doing this because its nature to write new data blocks somewhere else, the combination of both COW file system and SSDs seems like a very good fit. It even looks symbiotic because it is a case of “I help you; and you help me“.

From this perspective, the benefits of COW file systems and SSDs extends beyond resiliency of the SSD device but also in performance. Since the data blocks are spread out at different locations in the SSD device, the effect of parallelism will inadvertently help with COW’s performance. Make sense, doesn’t it?

I have not learned about other file systems and how they behave with SSDs, but it is pretty clear that Copy-on-Write file systems works well with Solid State Devices. Have a good week ahead :-)!

Snapshots? Don’t have a C-O-W about it!

Unfortunately, I am having a COW about it!

Snapshots are the inherent offspring of the copy-on-write technique used in shadow-paging filesystems. NetApp’s WAFL and Oracle Solaris ZFS are commercial implementations of shadow-paging filesystems and they are typically promoted as Copy-on-Write filesystems.

As we may already know, snapshots are point-in-time copy of the active file system in the storage world. They perform quick backup of the active file system by making a copy of the block addresses (pointers) of the filesystem and then updating the pointer maps to the inodes in the fsinfo root inode of the WAFL filesystem for new changes after the snapshot has been taken. The equivalent of fsinfo is the uberblock in the ZFS filesystem.

However, contrary to popular belief, the snapshots from WAFL and ZFS are not copy-on-write implementations even though the shadow paging filesystem tree employs the copy-on-write technique.

Consider this for a while when a snapshot is being taken … Copy —- On —- Write. If the definition is (1) Copy then (2) Write, this means that there are several several steps to perform a copy-on-write snapshot. The filesystem has to to make a copy of the original data block (1 x Read I/O), then write the original data block to a new location (1 x Write I/O) and then write the new data block to the location of the original data block (1 x Write I/O).

This is a 3-step process that can be summarized as

  1. Read location of original data block (1 x Read I/O)
  2. Copy this data block to new unused location (1 x Write I/O)
  3. Write the new and modified data block to the location of original data block (1 x Write I/O)

This implementation, IS THE copy-on-write technique for snapshot but NetApp and possibly Oracle guys have been saying for years that their snapshots are based on copy-on-write. This is pretty much a misnomer that needs to be corrected. EMC, in its SnapSure and SnapView implementation, called this technique Copy-on-First-Write (COFW), probably to avoid the confusion. The data blocks are copied to a savvol, a separate location to store the changes of snapshots and defaults to 10% of the total capacity of their storage solutions.

As you have seen, this method is a 3 x I/O operation and it is an expensive solution. Therefore, when we compare the speed of NetApp/ZFS snapshots to EMC’s snapshots, the EMC COFW snapshot technique will be a tad slower.

However, this method has one superior advantage over the NetApp/ZFS snapshot technique. The data blocks in the active filesystem are almost always laid out in a more contiguous fashion, resulting in a more consistent read performance throughout the life of the active file system.

Below is a diagram of how copy-on-write snapshots are implemented:

 

What is NetApp/ZFS’s snapshot method then?

It is is known as Redirect-on-Write. Using the same step … REDIRECT —- ON —– WRITE. When a data block is about to be modified, the original data block is read (1 x Read I/O) and then the data block is written to a new location (1 x Write I/O). The active file system then updates the filesystem tree and its inode address to reflect the location of the new data block. The original data block remained unchanged.

In summary,

  1. Read location of original data block (1 x Read I/O)
  2. Write modified data block to new location (1 x Write I/O)

The Redirect-on-Write method resulted in 1 Write I/O less, making snapshot creation faster. This is the NetApp/ZFS method and it is superior when compared to the Copy-on-Write snapshot technique discussed earlier.

However, as the life of the filesystem progresses, fragmentation and holes will cause the performance of the active filesystem to degrade. The reason is most related data blocks are no longer contiguous and the active file system will be busy seeking the scattered data blocks across the volume. Fragmented filesystem would have to be “cleaned and reorganized” to regain its performance lustre.

Another unwanted problem using the Redirect-on-Write snapshot technique is the snapshot resides in the same boundary as the active filesystem. Over time, if the capacity consumed by the snapshots could overwhelm the active filesystem, if their recycle schedule is unchecked.

I guess this is a case of “SUFFER NOW/ENJOY LATER” or “ENJOY NOW/SUFFER LATER”. We have to make a conscious effort to understand what snapshots are all about.