OpenZFS 2.0 exciting new future

The OpenZFS (virtual) Developer Summit ended over a weekend ago. I stayed up a bit (not much) to listen to some of the talks because it started midnight my time, and ran till 5am on the first day, and 2am on the second day. Like a giddy schoolboy, I was excited, not because I am working for iXsystems™ now, but I have been a fan and a follower of the ZFS file system for a long time.

History wise, ZFS was conceived at Sun Microsystems in 2005. I started working on ZFS reselling Nexenta in 2009 (my first venture into business with my company nextIQ) after I was professionally released by EMC early that year. I bought a Sun X4150 from one of Sun’s distributors, and started creating a lab server. I didn’t like the workings of NexentaStor (and NexentaCore) very much, and it was priced at 8TB per increment. Later, I started my second company with a partner and it was him who showed me the elegance and beauty of ZFS through the command lines. The creed of ZFS as a volume and a file system at the same time with the CLI had an effect on me. I was in love.

OpenZFS Developer Summit 2020 Logo

Exciting developments

Among the many talks shared in the OpenZFS Developer Summit 2020 , there were a few ideas and developments which were exciting to me. Here are 3 which I liked and I provide some commentary about them.

Block Reference Table
dRAID (declustered RAID)
Persistent L2ARC

Block Reference Table (BRT)

Deduplication in ZFS is always a contention. The size of the deduplication table (DDT) in the ARC (memory) grows as the dedupe processing ages. When it overflows and spills over to disks, performance can get ugly. Each entry in the DDT is about 392 bytes which resulted in the “5GB per 1TB deduplication” sizing guideline. It was not very practical to turn on the dedupe function when the RAM requirement is high.

I followed this for a number of years, and I know that Matt Ahrens posted the dedupe challenge a couple of years back to address this issue. This was what he posted.

Matt Ahrens previous posting about addressing dedupe

The aim of both BRT and DDT is space saving. In DDT, a block is hashed to produce a checksum of the block, and making sure that a block with the similar checksum is saved once. Along with reference count and other details, this is kept in the DDT, hence that 392 bytes or so. When the dedupe flag is on, it dedupes the ZFS pool with a relatively high overhead.

[ Note : This “overhead” will soon be alleviated by iXsystems™ TrueNAS® Fusion Pools in version 12.0. Details coming very, very soon ]

The OpenZFS Summit BRT technology talk was shared by Pawel Dawidek. Details were scant but I find that the concept of BRT rather similar to Changed Block Tracking (CBT) by VMware®. The changed and unchanged blocks are tagged and block reference details are kept in the BRT. This definitely simplifies data cloning, and speeds up operations like snapshot recovery, and file copy or move.

However, I am still hoping that OpenZFS would improve upon dedupe situation. Robert Petrocelli, the CTO of Datto, used to be the architect of Greenbytes, one of the early performant dedupe leaders in ZFS. Panzura has a temporal ZFS deduplication technology which was discussed into OpenZFS plans. Both may have a say of OpenZFS dedupe development in the future.

Here are a few noteworthy news of OpenZFS and ZFS deduplication.

[ 2013 ] Ex-Sun Micro CTO reveals Greenbytes ‘world beating’ dedupe
[ 2014 ] Oracle acquires VDI startup Greenbytes
[ 2019 ] OpenZFS could soon see much better Deduplication Support

dRAID

Some years ago, I was chatting with a friend about 60TB disk drives. He joked that the RAID reconstruction would probably finish when his daughter finish college, and she was in her mid-teens at the time. Dual and triple parity protection were introduced to reduce the risk of the RAID volume catastrophe because the time taken to rebuild large capacity drives would take days and weeks. I wrote about the RAID conundrum aeons ago when the 4TB drives came into being back in 2012.

The concept of declustered RAID (dRAID) is not new. Panasas® and IBM® (notably XIV and Storwize) have been doing it for more than a decade in their own unique methods. Read this article

[ 2010 ] Can Declustering save RAID storage technology?

The RAID-Z1/Z2/Z3 declustered plans for OpenZFS have been in the works since 2015. The objective is to speed up resilvering, reducing the risk factor of time in RAID reconstruction, rebuilding. dRAID utilizes all of the drives (including spare drives) within the pool instead of just the specific drives in the affected OpenZFS VDEV. dRAID is implemented with a separate driver, and layered upon the existing RAID Z1/2/3 code as shown below.

dRAID – Parity Declustering with the logical layout

During normal dRAID operations, all the data are randomly distributed utilizing all the drives. The spares drive(s) is basically idle in RAID Z1/2/3 but in the implementation of dRAID, spares block are distributed as logical spares rather than physical drive unit. Therefore, during reconstruction or sequential resilver, reads are read from all the relevant drives, and writes are written to the logical spares blocks which are spread across all the drives in the dRAID volume. Thus the resilvering utilizes all the spindles in parallel to accelerate RAID recovery.

Tests have shown that the recovery performance is significantly faster compared to traditional RAID Z1/2/3, ranging between 2x-10x, depending on group sizes.

According to Mark Maybee presentation, the present lead architect for the dRAID, the feature will miss out in the OpenZFS 2.0 release but will be in coming 2.x release after that.

Persistent L2ARC

I last visited L2ARC 8 years ago. ZFS is performant because it utilizes the memory and fast storage medium very well, and L2ARC (Level 2 Adaptive Replacement Cache) is a big part of that. It catches and caches ZFS read data blocks evicted from the memory. Therefore, an ARC read miss in the RAM will get an additional read hit opportunity from the L2ARC.

Up until now, the data blocks in the L2ARC does not survive a reboot or a controller failover. This means that the L2ARC has to be populated again, and this takes time and affects the performance of the storage system. Persistent L2ARC enables the metadata of the L2ARC to be written to a persistent storage medium such as SSDs with a structure called L2ARC Log Block.

Persistent L2ARC preserving ZFS block after storage controller failure

In the implementation explained by George Amanakis in the OpenZFS Summit, when the L2ARC reaches 1022 buffer entries, it will automatically be written to disk. The diagram below shows how the L2ARC log blocks store the buffer header entries.

L2ARC storing buffer header entries with the L2ARC Log Block