RAIDZ expansion and dRAID excellent OpenZFS adventure

RAID (Redundant Array of Independent Disks) is the foundation of almost every enterprise storage array in existence. Thus a technology change to a RAID implementation is a big deal. In recent weeks, we have witnessed not one, but two seismic development updates to the volume management RAID subsystem of the OpenZFS open source storage platform.

OpenZFS logo

For the uninformed, ZFS is one of the rarities in the storage industry which combines the volume manager and the file system as one. Unlike traditional volume management, ZFS merges both the physical data storage representations (eg. Hard Disk Drives, Solid State Drives) and the logical data structures (eg. RAID stripe, mirror, Z1, Z2, Z3) together with a highly reliable file system that scales. For a storage practitioner like me, working with ZFS is that there is always a “I get it!” moment every time, because the beauty is there are both elegances of power and simplicity rolled into one.

So, when OpenZFS founding developer, and also the co-creator of the ZFS file system, Matt Ahrens announced the review of the RAID-Z expansion feature in OpenZFS in June 2021, there was elation! Within weeks of the RAID-Z expansion announcement, OpenZFS 2.1 was out, and finally, dRAID (distributed RAID) was GAed as well! I went nuts! Double Happiness!

The announcements of both RAID-Z expansion and dRAID within weeks of each other were special. They were rare events, and I hope to do them justice just to learn about them.

Rebuilds taking too damn long 

The issue of data loss risks in the time it takes to rebuild or reconstruct (in ZFS terms, resilver) a RAID volume has been a bugbear in the storage industry like forever. Using a handful of hard disk drives to rebuild the volume was too slow (from hours to days to weeks now), further exacerbated by the RAID-6 dual parity configurations. With the hard disk drives capacity hitting 20TB and beyond soon, many storage vendors already have triple parity RAID as well. And each vendor has its own proprietary RAID rebuild mechanism, with the prime objective to return the volume to a healthy state as fast as possible and recover the data blocks from the failed disk(s).

I broached the subject in 2012 in my blog “4TB disks – The end of RAID“, but even earlier, Enterprise Storage Forum had an article named “Can Declustering save RAID Storage Technology?” Both my blog and ESF mentioned Dr. Garth Gibson‘s work on RAID declustering, a pre-cursor to dRAID.

IBM® also had distributed RAID provisioned via its SAN Volume Controller as well, and their technology also precedes the OpenZFS dRAID.

OpenZFS dRAID (Distributed RAID)

I have followed OpenZFS dRAID since Isaac Huang’s presentation in the OpenZFS Developer Summit of 2015. dRAID is a vdev (virtual device similar to a RAID volume) topology. This topology is a logical overlay over the traditional fixed disks vdevs, and the storage admin can define the quantity of data, parity and hot spares sectors per logical dRAID stripe. Because the striped width in OpenZFS is not fixed (unlike many other storage array’s RAID implementation), this presents many different permutations of a dRAID vdev, each with different performance, reliability, capacity and resilvering recovery speed profiles.

dRAID – Parity distributed over group rows unlike traditional OpenZFS RAID

The diagram above shows the differences between a traditional OpenZFS RAID and a parity declustered dRAID. Overall, the different groups of rows distributed based on dRAID groups and dRAID sizes. The effects are parity, and spares sectors are now distributed along with the data “drives”, and collectively they reinvent the way resilvering performs to speed up vdev (RAID volume) recovery. Theoretically all drives now can participate in doing parallel recovery work instead of only just involving the affected disk drives in a RAID volume recovery.

To differentiate between traditional resilvering and dRAID resilvering, 2 new terms are added to the lexicology of OpenZFS.

  • Traditional resilvering on traditional vdevs = Healing Resilver
  • dRAID resilvering = Sequential Resilver

And the impact of dRAID on resilvering recovery speed is instantaneous as shown in the graph from the dRAID documentation.

dRAID Sequential Resilver times based on the number of disk drives

The caveat of dRAID is it must have a large drive count in order to receive the benefits of its resilver recovery. Perfect for High Performance Computing environments!

RAID-Z Expansion

Today, each vdev “size” is fixed. That means each RAID-Z1/2/3 vdev can only have fixed number of drives. Once created, you cannot attach a drive to the RAID-Z1/2/3 vdev and the only way to do it is to destroy the vdev and create a new one again with a different drive count. This is the present problem.

You can’t attach a drive to a RAID-Z vdev

As eloquently describes by Matt Ahrens in the FreeBSD Developer Summit June 2021 (Youtube video), RAID-Z  expansion was not possible at present. But this will soon be a thing of the past because the early development work on RAID-Z expansion now allows additional drive(s) to be added to a RAID-Z1/2/3 vdev, extending its vdev’s width (but not the stripe width) and retain the RAID-Z level. Only newly written data can be striped over the vdev’s new logical width while existing data stripe width remains. The storage capacity size of the vdev grows but its RAID data to parity ratio does not change for the existing stripes.

In many parity-based RAID implementations, the stripe widths are fixed. In some, there are also dedicated drives for parity. This means that when a new drive (or drives) are added, the data and the parity blocks are likely read, parity recalculated and re-written back to the new stripe width as shown in the box on the left above.

In RAID-Z expansion (box above on the right), the existing stripe distribution is merely adjusted by copying what blocks are needed to be copied. The space allocation map would be able to provide the guidance for this adjustment, now called a Reflow technique. And it does not matter where the parity segment will be, because OpenZFS logical stripe width is dynamic, variable and each a full stripe width.

You can read more in depth about RAID-Z expansion in the full slide deck from Matt Ahrens from the Summit.

It could take some time for RAID-Z Expansion to reach GA (general availability) in a future OpenZFS version release. But because it is one of the most sought after features in OpenZFS, I am pretty this could be out very soon! And many smaller implementations using single vdevs would greatly benefit from the RAID-Z expansion feature release. Perfect for the small medium businesses and home users!

Excellent!

I conclude this blog with my tribute to the excellent RAID duo in OpenZFS. To both RAID-Z expansion and dRAID, in the wise words of Bill and Ted, “Party on dudes!

Bill and Ted Excellent Adventures

Tagged , , , , , , , , , , , , . Bookmark the permalink.

About cfheoh

I am a technology blogger with 30 years of IT experience. I write heavily on technologies related to storage networking and data management because those are my areas of interest and expertise. I introduce technologies with the objectives to get readers to know the facts and use that knowledge to cut through the marketing hypes, FUD (fear, uncertainty and doubt) and other fancy stuff. Only then, there will be progress. I am involved in SNIA (Storage Networking Industry Association) and between 2013-2015, I was SNIA South Asia & SNIA Malaysia non-voting representation to SNIA Technical Council. I currently employed at iXsystems as their General Manager for Asia Pacific Japan.

2 Responses to RAIDZ expansion and dRAID excellent OpenZFS adventure

  1. Kendrick says:

    I have looked around and not found much talking about performance differences between z1 and d1. I got cerious about the new tech our there and decided to toss it on a 8 sas 10k drive pool to see what it would run like on my system. If I do a z1 pool on that set of drives that system hits about 730mb transfer. d1 with 7:8c gives me 330mb average using the same write of data. 3:8c gives the same speed even though it is across 2vdevs.

    dd if=/dev/zero of=/tank/temp oflag=direct bs=128k count=500000 was used to test the speed. any thoughts about the speed and recommended benchmarks would be appreciated.

    • cfheoh says:

      Hello Kendrick

      Performance is both a science and an art. I cannot comment on the sort of performance you have assembled but there are many factors that affect the performance of the storage service (I presume you are using NFS here) end-to-end. I stress this “end-to-end” part because there is the client part and the server part, and everything in between.

      I would have to know the specifics of what you have to explain the performance you may be able to achieve. All the best.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.