4TB disks – the end of RAID

Seriously? 4 freaking terabyte disk drives?

The enterprise SATA/SAS disks have just grown larger, up to 4TB now. Just a few days ago, Hitachi boasted the shipment of the first 4TB HDD, the 7,200 RPM Ultrastar™ 7K4000 Enterprise-Class Hard Drive.

And just weeks ago, Seagate touted their Heat-Assisted Magnetic Recording (HAMR) technology will bring forth the 6TB hard disk drives in the near future, and 60TB HDDs not far in the horizon. 60TB is a lot of capacity but a big, big nightmare for disks availability and data backup. My NetApp Malaysia friend joked that the RAID reconstruction of 60TB HDDs would probably finish by the time his daughter finishes college, and his daughter is still in primary school!.

But the joke reflects something very serious we are facing as the capacity of the HDDs is forever growing into something that could be unmanageable if the traditional implementation of RAID does not change to meet such monstrous capacity.

Yes, RAID has changed since 1988 as every vendor approaches RAID differently. NetApp was always about RAID-4 and later RAID-DP and I remembered the days when EMC had a RAID-S. There was even a vendor in the past who marketed RAID-7 but it was proprietary and wasn’t an industry standard. But fundamentally, RAID did not change in a revolutionary way and continued to withstand the ever ballooning capacities (and pressures) of the HDDs. RAID-6 was introduced when the first 1TB HDDs first came out, to address the risk of a possible second disk failure in a parity-based RAID like RAID-4 or RAID-5. But today, the 4TB HDDs could be the last straw that will break the camel’s back, or in this case, RAID’s back.

RAID-5 obviously is dead. Even RAID-6 might be considered insufficient now. Having a 3rd parity drive (3P) is an option and the only commercial technology that I know of which has 3 parity drives support is ZFS. But having 3P will cause additional overhead in performance and usable capacity. Will the fickle customer ever accept such inadequate factors?

Note that 3P is not RAID-7. RAID-7 is a trademark of a old company called Storage Computer Corporation and RAID-7 is not a standard definition of RAID.

One of the biggest concerns is rebuild times. If a 4TB HDD fails, the average rebuild speed could take days. The failure of a second HDD could up the rebuild times to a week or so … and there is vulnerability when the disks are being rebuilt.

There are a lot of talks about declustered RAID, and I think it is about time we learn about this RAID technology. At the same time, we should demand this technology before we even consider buying storage arrays with 4TB hard disk drives!

I have said this before. I am still trying to wrap my head around declustered RAID. So I invite the gurus on this matter to comment on this concept, but I am giving my understanding on the subject of declustered RAID.

Panasas‘ founder, Dr. Garth Gibson is one of the people who proposed RAID declustering way back in 1999. He is a true visionary.

One of the issues of traditional RAID today is that we still treat the hard disk component in a RAID domain as a whole device. Traditional RAID is designed to protect whole disks with block-level redundancy.  An array of disks is treated as a RAID group, or protection domain, that can tolerate one or more failures and still recover a failed disk by the redundancy encoded on other drives. The RAID recovery requires reading all the surviving blocks on the other disks in the RAID group to recompute blocks lost on the failed disk. In short, the recovery, in the event of a disk failure, is on the whole object and therefore, a entire 4TB HDD has to be recovered. This is not good.

The concept of RAID declustering is to break away from the whole device idea. Apply RAID at a more granular scale. IBM GPFS works with logical tracks and RAID is applied at the logical track level. Here’s an overview of how is compares to the traditional RAID:

The logical tracks are spread out algorithmically spread out across all physical HDDs and the RAID protection layer is applied at the track level, not at the HDD device level. So, when a disk actually fails, the RAID rebuild is applied at the track level. This significant improves the rebuild times of the failed device, and does not affect the performance of the entire RAID volume much. The diagram below shows the declustered RAID’s time and performance impact when compared to a traditional RAID:

While the IBM GPFS approach to declustered RAID is applied at a semi-device level, the future is leaning towards OSD. OSD or object storage device is the next generation of storage and I blogged about it some time back. Panasas is the leader when it comes to OSD and their radical approach to this is applying RAID at the object level. They call this Object RAID.

With object RAID, data protection occurs at the file-level. The Panasas system integrates the file system and data protection to provide novel, robust data protection for the file system.  Each file is divided into chunks that are stored in different objects on different storage devices (OSD).  File data is written into those container objects using a RAID algorithm to produce redundant data specific to that file.  If any object is damaged for whatever reason, the system can recompute the lost object(s) using redundant information in other objects that store the rest of the file.

The above was a quote from the blog of Brent Welch, Panasas’ Director of Software Architecture. As mentioned, the RAID protection of the objects in the OSD architecture in Panasas occurs at file-level, and the file or files constitute the object. Therefore, the recovery domain in Object RAID is at the file level, confining the risk and damage of data loss within the file level and not at the entire device level. Consequently, the speed of recovery is much, much faster, even for 4TB HDDs.

Reliability is the key objective here. Without reliability, there is no availability. Without availability, there is no performance factors to consider. Therefore, the system’s reliability is paramount when it comes to having the data protected. RAID has been the guardian all these years. It’s time to have a revolutionary approach to safeguard the reliability and ensure data availability.

So, how many vendors can claim they have declustered RAID?

Panasas is a big YES, and they apply their intelligence in large HPC (high performance computing) environments. Their technology is tried and tested. IBM GPFS is another. But where are the rest?

 

Tagged , , , , , , , , , , , . Bookmark the permalink.

About cfheoh

I am a technology blogger with 30 years of IT experience. I write heavily on technologies related to storage networking and data management because those are my areas of interest and expertise. I introduce technologies with the objectives to get readers to know the facts and use that knowledge to cut through the marketing hypes, FUD (fear, uncertainty and doubt) and other fancy stuff. Only then, there will be progress. I am involved in SNIA (Storage Networking Industry Association) and between 2013-2015, I was SNIA South Asia & SNIA Malaysia non-voting representation to SNIA Technical Council. I currently employed at iXsystems as their General Manager for Asia Pacific Japan.

29 Responses to 4TB disks – the end of RAID

  1. Pingback: 4TB disk – The end of RAID « Storage Gaga

  2. Lee Johns says:

    Great article. The ever growing capacity of HDD really is a problem for traditional RAID controllers. In addition to those you mention above Starboard Storage Systems provides a Dynamic Pooling architecture that does not use a RAID controller. On our AC72 Storage System we break the HDDs into extents, stripe the data across the HDD’s in variable size chunks and allow the user to stripe using custom disk layouts that can have data chunks, parity chunks and spare chunks. 9D+1P+1S for instance would be a RAID 5 equivalent stirped over 11 drives. The layouts also enable you to specify chunk size and class of disk. This enables us to deliver a system that can provide rapid rebuilds of any failed disk since we only reconstruct what is required and we do it from multiple disks in the system. Alos because of the architecture we can add in larger capacity drives without having to have them in an indentical RAID group and still use all of the capacity in the system. I would be happy to have our CTO Kirill Malkin talk to you more about this.

  3. Kyle Bader says:

    Another problem is that the amount of reads before unrecoverable error is not going up commensurate with the increase in capacity. The consequence is that if you are using a mirror setup and have to copy from the remaining pair there is a good chance that the only remaining copy will have an unrecoverable error and in some cases will cause the controller to boot that disk and crash the entire array.

    • cfheoh says:

      Hi Kyle

      Very, very sorry for the late reply. I have been extremely busy to go thru some of the comments.

      Yes, I totally agree. Users have to understand that accepting larger capacities will mean higher risks for reliability and performance. I wrote something yesterday about vendors giving higher priority to storage efficiency (thin provisioning, RAID-6, space efficient snapshots and cloning, deduplication) than performance. And these storage efficiencies are creating a I/O performance issue in which all-Flash SSD vendors are lamenting as the I/O gap. To completely re-architecture the existing design will be difficult.

      What I am saying is the storage industry has created these problems for themselves. Going the way of very large capacities such as 4 or 6TB in the near future basically is creating a bigger issue than just capacity alone.

      Oh well, I suppose these issues will breed new innovations. 🙂

      Cheers and sorry again for the late reply. Thank you
      /Chin-Fah

  4. ally says:

    Just ran some dirty math on the rebuild time for a 60TB HDD. Talking about 70 days.

    • cfheoh says:

      Hello Ally

      Ha, ha, ha Thanks for running the maths. Looks like my NetApp friend doesn’t have to wait for his daughter to finish college after all. I go tell him.

      Come to think of it, 70-days is a big deal, and holding the reliability of the storage at ransom. No fun at all, but why the heck do we do a storage capacity so big?

      Thanks
      /Chin-Fah

  5. Nathan says:

    Not too sure about that dirty maths…

    Considering most of the new disks have a media speed (using largeish blocks, which you could reasonably expect a RAID array to use), you should be able to get up to 200MB/s.

    So assuming this simple fact (which I’ll agree is optimistic, however, not that far off *if* the raid controller is smart, and has sufficient bandwidth…)

    60 TB of data to resync is
    60000 GB which is
    60000000 MB
    300000 Seconds to rebuild @ 200MB/s, which is
    5000 minutes, or
    83.33333333 hours, or
    3.472222222 days

    Again – this assumes the RAID subsystem is up to the task. A PC class server, running ZFS can easily do this, assumign it’s not having to replicate pools with billions of tiny files. If it *did* need to do that, you could easily see it taking up to 10 times longer…

    • cfheoh says:

      Hi Nathan

      Thanks for your comment.

      You said it! That time required to rebuild the volume is taking too long and risking the integrity of the data. Many have been predicting the end of RAID and looking into newer implementation of data protection techniques such as erasure encoding and RAID declustering.

      Thanks
      /Chin-Fah

    • Eren Kotan says:

      Disks will not operate at their full rated speed whilst a degraded array is being rebuilt. There’s a lot of I/O and parity calculations going on as the controller reconstructs the array, so in practice, it would take much longer than your calculations suggest

      • cfheoh says:

        Hi Eren

        Thanks for your comments. Yes, what I have mentioned is for the rebuild in a “perfect” condition, where there are little variables and factors that will further delay the rebuilt.

        I recalled one experience I had with Shell back in 2006, and they had to rebuild tens of TBs with the HDS USP600. That whole fiasco, with RAID-6, and the environment going live, took almost 30 days to complete the rebuild. The risk and the teething experience were nightmares for us and the customer.

        Yup, in real life, there are a lot of what-ifs and it is scary that we are relying so much on general RAID to ensure data availability.

        Cheers and thanks for reading my blog.

        /Chin-Fah

  6. Capt. Am. says:

    Great analysis. You might also want to check out what Isilon has done with file protection using distributed parity, Reed Solomon encoding and rebuild files to free space instead of a single hot spare. Details here:
    http://simple.isilon.com/doc-viewer/1812/high-availability–data-protection-with-emc-isilon-scale-out-nas.pdf?gid=GitpRtihtYN&utm_campaign=www&utm_medium=doctab-f&utm_source=onefs_tech

    • cfheoh says:

      Hi Eric

      Thanks for sharing the Isilon link with me. I am reading the document with glee as I regain my momentum to learn new technologies again.

      All the best in 2013 to you!

      Regards
      /Chin-Fah

  7. Jerker Nyberg says:

    Hello, you may want to have look at Ceph (open source). The
    CRUSH function for placement of object data together with intelligent OSDs (object storage devices) rebuild to the required level of replication when there is a failure of a disk, node or rack. The POSIX file system is not stable yet (native Linux kernel client), but the object storage (S3/Swift compatible) and block device is stable (with native kernel or QEMU/KVM clients).

    • cfheoh says:

      Hi Jerker

      Thank you for your comments. I am quite aware of Ceph or its commercial version, Inktank.

      I was going after an opportunity in Jakarta for a Cloud Service Provider some months ago. The CTO was evaluating my very own ZFS storage solution and Ceph, and he felt that Ceph was more suitable. He felt that ZFS was not so good for scale-out and he went with Ceph. I wished him well because given his ambitions, Ceph was better suited.

      Thanks for sharing Ceph links! All the best to you!

      /Chin-Fah

  8. Mark says:

    No doubt, RAID has challenges. But so do all the alternatives. I have seen Isilon take 3 days to rebuild a single terrabyte. The challenge with “file -level RAID” is that you lose the implied, mathmatical relationships between blocks you get with RAID. Instead of simply starting the rebuild with standard RAID (which takes time, but is straightforward all things considered), the system needs to search the entire cluster, file by file, to see which files were impacted, then rebuild them individually. Hash tables help, but apparently not enough. I’ll be the first to acknowledge when the superior alternative to RAID is here, but so far all the advanced forms of data protection tend to hit issues with complexity and unpredictable rebuild times, which results in more challenges maintaining compliance and service levels than they solve.

  9. Jack says:

    I just heard IBM can rebuild a TB in 13 minutes. So a 4TB drive in 52 minutes. can some one please comment ?

    • cfheoh says:

      Hi Jack

      Thanks for highlighting the IBM technology on RAID rebuilding.

      I can neither confirm nor deny the existence of the fact that IBM can rebuild a TB in 13 minutes. Parity-based rebuild are no longer what we think they are. There has been many developments in RAID that has superceded the foundation of the RAID technology described in the 1988 RAID implementation. As the drives get larger and larger, we know well that foundation RAID just won’t cut it.

      Technologies such as IBM XIV’s RAID-X is one way to circumvent the weaknesses of parity-based RAID. In addition to that, to speed up RAID rebuilds, technology like forward error coding in the likes of Reed-Solomon and Hamming code were brought into parity-based RAID. IBM also has RAID declustering that improves RAID beyond the rigid stripe width and thus uses more spindles with larger stripe width to improve performance and RAID rebuilds.

      So, I believe IBM (and other storage vendors) have the right RAID technology in place for RAID resiliency. But when it comes to RAID rebuild performance, I have not heard much about any storage vendor beating their drums about it.

      Otherwise, why would they continue to promote RAID-6 protection instead of RAID-5 type of protection?

      Appreciate if you can share a bit of your IT and storage experience with me. I would like to know you better.

      Thanks for your comments

      /Chin-Fah

      • lee johns says:

        Rebuild times vary with new style RAID technologies. The secret is in three areas. 1) do not rebuild the entire disk. Only rebuild the written data. 2) rebuild that written data from multiple drives in parallel. 3) pre provision spare data chunks so that a drive rebuild is to spare space on existing drives and does not require a drive replacement to recover parity.

        At Starboard Stoeage we do all this of these things and this can mean very fast drive recovery even for large drives. However it is dependent on written daat

    • Mark says:

      Very few people really understand how the algorithms to rebuild lost data and parity when dealing with declustered parity. RAID has challenges, but not using traditional RAID does not necessarily mean the situation is better. Ive seen an Isilon 36NL have a rebuild time on a 1TB SATA drive pull increase from 10 minutes to 3.5 days simply by taking it from 0.1% full (from the factory) to 40% full. Instead of nice simple straightforward RAID mapping, it needs to rebuild file-by-file, which means each disk pull results in a CLUSTER-WIDE search through EACH protection group of EACH file to see which files were impacted. It also needs to consolidate all RAID rebuild calculations onto a single node for each impacted protection group. Even EMC wont commit officially to rebuilds in finishing under a day, with the latest gear using SSDs.

      Some day we will probably move beyond RAID, or maybe not. I’ll be eager to embrace the change, but cautious enough not to embrace it just because its different.

  10. Dave says:

    RAID is. More fore performance and safe guarding data. A 60tb drive won’t end raid.

    Raid will never end you always get better performance from anything in parallel or you get safer data by using raid 5, mirroring, or advanced combinations.

    The title of the article is terrible. Drive size will NEVER affect the life of raid p

    • cfheoh says:

      Hello DDutton

      Thanks for your comments.

      The capacity of the HDD has gotten to a point where the relevance of RAID with regards to data reconstruction and its impact on the performance and availability will spell the potency of RAID in the coming years. While RAID will not entire go away, its value is likely to be greatly diminished.

      Regards
      /Chin-Fah

  11. Dave says:

    Oh and raid 5 isn’t dead lol. I used it at home and at work
    All of my servers at work use raid 5, with a spare. All hot swappable. I could get into nested raids, but I don’t need 24/7/365 uptime

    Raid 5 provides a good performance hit with redundancy and they chances of TWO failures at once are astronomical. Plus if u have a drive on standby hooked in, the controller auto grabs it and rebuilds

    Yes the rebuild time on 3 60tbs will be long, and performance degraded to that of a single drive.

    But to say its dead is wrong. It’s very much alive
    List just now consumers an enterprises have MORE raid solutions that 0,1, and 5.

    I use raid 5 at home

    I have 5 drives in my alienware box. 4 as raid 5. So I get a 16tb array. If one fails, the controller grabs the 5th and rebuilds. Once done I grab the spare 4tb from my closet and replace the dead one. Then I go buy one for the closet.

    Raid 5 is perfect. It gives me performance boost fir games and protects data. Then I have a 16tb raid 5 USB 3.0 box that get nightly backups and hourly windows 8 file history updates.

    • cfheoh says:

      Hello again DDutton

      I do understand that from your perspective, RAID is good. I use RAID-5 at home too, but at work, where most environments scales to Petabytes and even Exabytes, the amplification of the RAID effects on performance and availability relative to the capacity of the drives is great. The impact is significant so much so that other data protection mechanisms are considered. Parity declustering, forward error coding, erasure coding are some techniques used in combo with RAID or in its entirety.

      Perhaps you may not have encountered the impact of RAID because it appears that your environment has lower requirements compared to many enterprises that we encounter daily. But trust me, in what we face, degradation in RAID is significant.

      Have a great day! Thank you
      /Chin-Fah

    • computer user says:

      Two failures is not something astronomical. we’ve had it in an EqualLogic array, which should be far more resilient to error than your desktop setup. But there are lots of factors in play. If you’re doing hypothetical math of two MBTF numbers, it seems unlikely. If you have a real world array with lots of drives and something goes wrong, two drives (out of 24, let’s say) can get kicked out in a hurry. Nobody uses RAID 5 anymore in the enterprise space.

      • cfheoh says:

        Hello

        You are right that 2 failures are not entirely detrimental, but the impact that supercedes that are beyond the technical. It impacts the business and other intangible aspects as well that results in loss of revenue and future business.

        The problem is RAID-6 is most vendors are merely an extension of how RAID-5 had behaved for them, making the entire RAID recovery sluggish and poor. EqualLogic does not entire prescribe to the traditional RAID 5/6 protection model, and using something they call Network RAID. That is much more resilient, in my point of view, and hence is able to overcome the weakness of RAID 5/6.

        In extremely loaded environment, RAID 5 and even RAID 6 are just not good enough.

        Thanks for sharing your info.

        /Chin-Fah

  12. James says:

    Funny ZFS is mentioned in the article, and then the author goes on to talk about OSD as though it’s something new. ZFS already implements file level striping and redundancy. RAIDz does not work off of striping the drive like traditional raid, rather, it stripes the file. Thus rebuild times are directly related to the amount of data that has to be recovered. For instance if a drive drops out of a zpool for some reason and then is reinserted, only the data that has changed is resilvered. Or lets say that you have a drive failure, but there is only 50% drive utilization, when you resilver you are only going to write the actual data. Thus rebuild times are directly dependent on data sets and data distribution and not drive sizes. The ZFS system is always in a consistent state, in fact, the ZFS approach completely eliminates the RAID5 write hole which RAID5/6 suffer from, and thus no battery backup is required.

  13. Pingback: Nutanix Disk Self-Healing: Laser Surgery vs The Scalpel | Nutanix

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.