I have to get this off my chest. Oracle’s Solaris ZFS is better than NetApp’s ONTAP WAFL! There! I said it!
I have been studying both similar Copy-on-Write (COW) file systems at the data structure level for a while now and I strongly believe ZFS is a better implementation of the COW file systems (also known as “shadow-paging” file system) than WAFL. How are both similar and how are both different? The angle we are looking at is not performance but about resiliency and reliability.
(Note: btrfs or “Butter File System” is another up-and-coming COW file system under GPL license and is likely to be the default file system for the coming Fedora 16)
In Computer Science, COW file system are tree-like data structures as shown below. They are different than the traditional Berkeley Fast File System data structure as shown below:
As some of you may know, Berkeley Fast File System is the foundation of some modern day file systems such as Windows NTFS, Linux ext2/3/4, and Veritas VxFS.
COW file system is another school of thought and this type of file system is designed in a tree-like data structure.
In a COW file system or more rightly named shadow-paging file system, the original node of the data block is never modified. Instead, a copy of the node is created and that copy is modified, i.e. a shadow of the original node is created and modified. Since the node is linked to a parent node and that parent node is linked to a higher parent node and so on all the way to the top-most root node, each parent and higher-parent nodes are modified as it traverses through the tree ending at the root node.
The diagram below shows the shadow-paging process in action as modifications of the node copy and its respective parent node copies traverse to the top of the tree data structure. The diagram is from ZFS but the same process applies to WAFL as well.
As each data block of either the leaf node (the last node in the tree) or the parent nodes are being modified, pointers to either the original data blocks or the copied data blocks are modified accordingly relative to the original tree structure, until the last root node at the top of the shadow tree is modified. Then, the COW file system commit is considered complete. Take note that the entire process of changing pointers and modifying copies of the nodes of the data blocks is done is a single I/O.
The root at the top for ZFS is called uberblock and called fsinfo in WAFL. Because an exact shadow of the tree-like file system is created when the data blocks are modified, this also gives birth to how snapshots are created in a COW file system. It’s all about pointers, baby!
Here’s how it looks like with the original data tree and the snapshot data tree once the shadow paging modifications are complete.
However, there are a few key features from the data integrity and reliability point of view where ZFS is better than WAFL. Let me share that with you.
In a nutshell, ZFS is a layered architecture that looks like this
The Data Management Unit (DMU) layer is one implementation that ensures stronger data integrity. The DMU maintains a checksum on the data in each data block by storing the checksum in the parent’s blocks. Thus if something is messed up in the data block (possibly by Silent Data Corruption), the checksum in the parent’s block will be able to detect it and also repair the data corruption if there is sufficient data redundancy information in the data tree.
WAFL will not be able to detect such data corruptions because the checksum is applied at the disk block level and the parity derived during the RAID-DP write does not flag this such discrepancy. An old set of slides I found portrayed this comparison as shown below.
Another cool feature that addresses data resiliency is the implementation of ditto blocks. Ditto blocks stores 3 copies of the metadata and this allows the recovery of lost metadata even if 2 copies of the metadata are deleted.
Therefore, the ability of ZFS to survive data corruption, metadata deletion is stronger when compared to WAFL .This is not discredit NetApp’s WAFL. It is just that ZFS was built with stronger features to address the issues we have with storing data in modern day file systems.
There are many other features within ZFS that have improved upon NetApp’s WAFL. One such feature is the implementation of RAID-Z/Z2/Z3. RAID-Z is a superset implementation of the traditional RAID-5 but with a different twist. Instead of using fixed stripe width like RAID-4 or RAID-DP, RAID-Z/Z2 uses a dynamic variable stripe width. This addressed the parity RAID-4/5 “write hole” flaw, where incomplete or partial stripes will result in a “hole” that leads to file system fragmentation. RAID-Z/Z2 address this by filling up all blocks with variable stripe width. A parity can be calculated and assigned with any striped width, as shown below.
Other really cool stuff are Hybrid Storage Pool and the ability to create software-based caching using fast disk drives such as SSDs. This approach of creating ReadZilla (read caching) and LogZilla (write caching) eliminates the need for proprietary NVRAM as implemented in NetApp’s WAFL.
The only problem is, despite the super cool features of ZFS, most Oracle (not Sun) sales does not have much clue how to sell ZFS storage. NetApp, with its well trained and tuned, sales force is beating Oracle to pulp.
Is zfs work perfectly with SSD?
SSD is perfect for ZFS because with HSP (Hybrid Storage Pools), you can create both Readzilla with Read Bias SSD (NAND Flash) and LogZilla with Write Bias SSD (DRAM-based or NAND Flash with Super Capacitors), leaving SATA for the data volumes.
This means that you can easily build an x86-based server into a ZFS storage appliance without the proprietary hardware such as NVRAM for NetApp and LCC for EMC VNX. This drives the cost down and significantly cheaper to integrate.
Hope this helps.
Whether can extend a Zpool on the fly?
# zpool add raidz/raidz2/mirror cXtXdX cXtXdX
and the zpool capacity grows on the fly. You can also add different RAID levels to the zpool but some best practice considerations have to be in place.
And the coolest feature of zfs (which is not mentioned) – it does software dynamic provisioning too! Not to mention it’s 128 bit.
You can assign a zpool to multiple mount paths and add disks on the fly when they’re running out of space. No more orphaned or extra disk spaces left at /var or /tmp or/opt.
By far the most impressive filesystem I’ve worked with – but….limited to Solaris (and Linux?) at the moment.
You are spot on but I am curious how are you involved with ZFS. Is your company using it?
Reason I asked is not many people use Sun Storage solutions, let alone ZFS. I wish there are more people like you out there appreciating how awesome ZFS is.
See you on Thursday. Thanks
Pingback: ONTAP vs ZFS | Storage news | Scoop.it
I think the primary issue creating a open source filer based on zfs, or maybe even btrfs, is finding hardware which would support redundant controllers.
I’m trying to create a redundant storage system for VM storage, evaluating gluster and drbd.
Do you have any input on this?
Thanks for reading my blog. I am surprised it got all the way to you.
Yes, I agree with you. This is still our challenge when we started shipping ZFS storage appliance 3 months ago. My partner and I, him being the more technical one, are very much Sun/Solaris inclined. And we started testing with Sun Clusters 3.2 (or whatever was left of it after being devoured by Oracle), and we still do not have a working feature yet. Right now, we position replication as an availability feature, and address customers that do not fully need clustering.
We have experimented with Oracle Solaris, OpenIndiana, Illumos, and NexentaCore but have yet to settle the clustering portion. Our HW partner, Supermicro, has been getting feedbacks from us.
I like Gluster. I watched a webcast of it and I was impressed. But I have yet to test it out. My partner is kinda like a Solaris bigot, so there are personal challenges as well when it comes to developing our product roadmap. Ha ha.
Thanks and all the best
It’s a good blog and you write about a lot of stuff that I google 😉
Sounds exciting about your own storage appliances – I hope it’s going well.
If you have any ideas for the clustering part let me know, I have around 12 months of testing before I must decide, so I have time.
My main problem is that if I want the redundancy of a netapp filer, I need to double the storage servers/harddrives to have cluster functionality with gluster, nexentaCore etc.
This way the cost of a netapp is not that much higher than buying commodity hardware, because I have to buy 2 of everything which comes with 2 x power costs.
I guess that two raid controllers connected to the same disks is too complicated for commodity hardware? Are you aware of any HW providers offering this kind of system?
Sorry for the late reply. It’s been a busy week.
The clustering bit is something that we get questioned on the field. We want to ensure that the clustering on the HW part remains as simple as possible without the proprietary HW. NetApp uses proprietary NVRAM and the Intel VI interconnect for their clustering. We don’t want that.
We want to just use a Gigabit or 10Gigabit Interconnect without any special internal PCIe card. We were using Solaris OpenHA for a while but Oracle is really killing the whole open-source thing.
We were thinking of a newer approach to the clustering – ala-Oracle RAC. HP LeftHand (P4000) has an interesting concept called Network RAID which bypasses the interconnect concept. The clustering is based on network nodes which I believe is easier to do. We are still arguing about it ;-(
If I come across any specific HW providers providing dual RAID controllers, I will let you know.
Have a good weekend
Pingback: Joy(ent) to the World « Storage Gaga
Pingback: AoE – All about Ethernet! | Storage Gaga