SMB Witness Protection Program

No, no, FBI is not in the storage business and there are no witnesses to protect.

However, SMB 3.0 has introduced a RPC-based mechanism to inform the clients of any state change in the SMB servers. Microsoft calls it Service Witness Protocol [SWP], and its objective is provide a much faster notification service allow the SMB 3.0 clients to do a failover. In previous SMB 1.0 and even in SMB 2.x, the SMB clients rely on time-out services. The time-out services, either SMB or TCP, could take up as much as 30-45 seconds, and this creates a high latency that is disruptive to enterprise applications.

SMB 3.0, as mentioned in my previous post, had a total revamp, and is now enterprise ready. In what Microsoft calls “Continuously Available” File Service, the SMB 3.0 supports clustered or scale-out file servers. The SMB shares must be shared as “Continuously Available” shares and mapped to SMB 3.0 clients. As shown in the diagram below (provided by SNIA’s webinar),

SMB 3.0 CA Shares

Client A mapping to Server 1 share (\\srv1\CAshr). Client A has a share “handle” that establishes a connection with a corresponding state of the session. The state of the session is synchronously kept consistent with a corresponding state in Server 2.

The Service Witness Protocol is not responsible for the synchronization of the states in the SMB file server cluster. Microsoft has left the HA/cluster/scale-out capability to the proprietary technology method of the NAS vendor. However, SWP regularly observes the status of all services under its watch. (more…)

SMB on steroids but CIFS lord isn’t pleased

I admit it!

I am one of the guilty parties who continues to use CIFS (Common Internet File System) to represent the Windows file sharing protocol. And a lot of vendors continue to use the “CIFS” word loosely without knowing that it was a something from a bygone era. One of my friends even pronounced it as “See Fist“, which sounded even funnier when he said it. (This is for you Adrian M!)

And we couldn’t be more wrong because we shouldn’t be using the CIFS word anymore. It is so 90′s man! And the tell-tale signs have already been there but most of us chose to ignore it with gusto. But a recent SNIA Webinar titled “SMB 3.0 – New opportunities for Windows Environment” aims to dispel our incompetence and change our CIFS-venture to the correct word – SMB (Server Message Block).

A selfie photo of Dennis Chapman, Senior Technical Director for Microsoft Solutions at NetApp from the SNIA webinar slides above, wants to inform all of us that … SMB History (more…)

Supercharging Ethernet … with a PAUSE

It’s been a while since I wrote. I had just finished a 2-week stint in Melbourne, conducting 2 Data ONTAP classes and had a blast.

But after almost 3 1/2 months of doing little except teaching NetApp classes, the stint is ending. I wanted it that way, to take a break and also to take on a new challenge. I will be taking on a job with Hitachi Data Systems, going back to the industry that I have termed the “Wild, wild west”. After a 4 1/2-year hiatus, I think that industry still behaves the way it is .. brash, exclusive, rich! The oligarchy of the oilmen are still laughing their way to the banks. And it will be my job to sell storage (and cloud) solutions to them.

In my Netapp (and EMC) engagements in the past 6 months, I have seen the greater adoption of iSCSI over Fibre Channel, and many has predicted that 10Gigabit Ethernet will be the infliction point where iSCSI can finally stand shoulder-to-shoulder with Fibre Channel. After all, 10 Gigabit/sec is definitely faster than 8 Gigabit/sec Fibre Channel, right? WRONG! (I am perfectly aware there is a 16 Gigabit/sec Fibre Channel, but can’t you see I am trying to start an argument here?)

Delivering SCSI data load over iSCSI on 10 Gigabit/sec Ethernet does not necessarily mean that it would be faster than delivering the same payload over 8 Gigabit/sec Fibre Channel. This statement can be viewed in many different ways and hence the favourite IT reply would be … “It depends“.

I would leave this performance argument for another day but today we are going to talk about some of the key additions to supercharge 10 Gigabit Ethernet for data delivery in storage networking capacity. In addition, 10 Gigabit Ethernet is the primary transport for Fibre Channel over Ethernet (FCoE) and it is absolutely critical that 10 Gigabit Ethernet must be close to as reliable as Fibre Channel for data delivery in a storage network.

Ethernet is a non-deterministic protocol, and therefore, its delivery result is dependent on many factors. Likewise 10 Gigabit Ethernet has inherited part of that feature. The delivery of data over Ethernet can be lossy, i.e. packets can get lost and the upper layer application protocols will have to respond to detecte the dropped packets and to ensure lost packets are redelivered to complete the consignment. But delivering data in a storage network cannot be lossy and in most cases of SANs, the requirement is to have the data arrive in the sequence they were delivered. The SAN fabric (especially with the common services of Layer 3 of the FC protocol stack) and the deterministic nature of Fibre Channel protocol were the reasons many has relied on Fibre Channel SAN technology for more than a decade. How can 10 Gigabit Ethernet respond?

(more…)

Time for Fujitsu Malaysia to twist and shout and yet …

The worldwide storage market is going through unprecedented change as it is making baby steps out of one of the longest recessions in history. We are not exactly out of the woods yet, given the Eurozone crisis, slowing growth in China and the little sputters in the US economy.

Back in early 2012, Fujitsu has shown good signs of taking market share in the enterprise storage but what happened to that? In the last 2 quarters, the server boys in the likes of HP, IBM and Dell storage market share have either shrunk (in the case of HP and Dell) or tanked (as in IBM). I would have expected Fujitsu to continue its impressive run and continue to capture more of the enterprise market, and yet it didn’t. Why?

I was given an Eternus storage technology update by the Fujitsu Malaysia pre-sales team more than a year ago. It has made some significant gains in technology such as Advanced Copy, Remote Copy, Thin Provisioning, and Eco-Mode, but I was unimpressed. The technology features were more like a follower, since every other storage vendor in town already has those features.

(more…)

4TB disks – the end of RAID

Seriously? 4 freaking terabyte disk drives?

The enterprise SATA/SAS disks have just grown larger, up to 4TB now. Just a few days ago, Hitachi boasted the shipment of the first 4TB HDD, the 7,200 RPM Ultrastar™ 7K4000 Enterprise-Class Hard Drive.

And just weeks ago, Seagate touted their Heat-Assisted Magnetic Recording (HAMR) technology will bring forth the 6TB hard disk drives in the near future, and 60TB HDDs not far in the horizon. 60TB is a lot of capacity but a big, big nightmare for disks availability and data backup. My NetApp Malaysia friend joked that the RAID reconstruction of 60TB HDDs would probably finish by the time his daughter finishes college, and his daughter is still in primary school!.

But the joke reflects something very serious we are facing as the capacity of the HDDs is forever growing into something that could be unmanageable if the traditional implementation of RAID does not change to meet such monstrous capacity.

Yes, RAID has changed since 1988 as every vendor approaches RAID differently. NetApp was always about RAID-4 and later RAID-DP and I remembered the days when EMC had a RAID-S. There was even a vendor in the past who marketed RAID-7 but it was proprietary and wasn’t an industry standard. But fundamentally, RAID did not change in a revolutionary way and continued to withstand the ever ballooning capacities (and pressures) of the HDDs. RAID-6 was introduced when the first 1TB HDDs first came out, to address the risk of a possible second disk failure in a parity-based RAID like RAID-4 or RAID-5. But today, the 4TB HDDs could be the last straw that will break the camel’s back, or in this case, RAID’s back.

RAID-5 obviously is dead. Even RAID-6 might be considered insufficient now. Having a 3rd parity drive (3P) is an option and the only commercial technology that I know of which has 3 parity drives support is ZFS. But having 3P will cause additional overhead in performance and usable capacity. Will the fickle customer ever accept such inadequate factors?

Note that 3P is not RAID-7. RAID-7 is a trademark of a old company called Storage Computer Corporation and RAID-7 is not a standard definition of RAID.

One of the biggest concerns is rebuild times. If a 4TB HDD fails, the average rebuild speed could take days. The failure of a second HDD could up the rebuild times to a week or so … and there is vulnerability when the disks are being rebuilt.

There are a lot of talks about declustered RAID, and I think it is about time we learn about this RAID technology. At the same time, we should demand this technology before we even consider buying storage arrays with 4TB hard disk drives!

I have said this before. I am still trying to wrap my head around declustered RAID. So I invite the gurus on this matter to comment on this concept, but I am giving my understanding on the subject of declustered RAID.

Panasas‘ founder, Dr. Garth Gibson is one of the people who proposed RAID declustering way back in 1999. He is a true visionary.

One of the issues of traditional RAID today is that we still treat the hard disk component in a RAID domain as a whole device. Traditional RAID is designed to protect whole disks with block-level redundancy.  An array of disks is treated as a RAID group, or protection domain, that can tolerate one or more failures and still recover a failed disk by the redundancy encoded on other drives. The RAID recovery requires reading all the surviving blocks on the other disks in the RAID group to recompute blocks lost on the failed disk. In short, the recovery, in the event of a disk failure, is on the whole object and therefore, a entire 4TB HDD has to be recovered. This is not good.

The concept of RAID declustering is to break away from the whole device idea. Apply RAID at a more granular scale. IBM GPFS works with logical tracks and RAID is applied at the logical track level. Here’s an overview of how is compares to the traditional RAID:

The logical tracks are spread out algorithmically spread out across all physical HDDs and the RAID protection layer is applied at the track level, not at the HDD device level. So, when a disk actually fails, the RAID rebuild is applied at the track level. This significant improves the rebuild times of the failed device, and does not affect the performance of the entire RAID volume much. The diagram below shows the declustered RAID’s time and performance impact when compared to a traditional RAID:

While the IBM GPFS approach to declustered RAID is applied at a semi-device level, the future is leaning towards OSD. OSD or object storage device is the next generation of storage and I blogged about it some time back. Panasas is the leader when it comes to OSD and their radical approach to this is applying RAID at the object level. They call this Object RAID.

With object RAID, data protection occurs at the file-level. The Panasas system integrates the file system and data protection to provide novel, robust data protection for the file system.  Each file is divided into chunks that are stored in different objects on different storage devices (OSD).  File data is written into those container objects using a RAID algorithm to produce redundant data specific to that file.  If any object is damaged for whatever reason, the system can recompute the lost object(s) using redundant information in other objects that store the rest of the file.

The above was a quote from the blog of Brent Welch, Panasas’ Director of Software Architecture. As mentioned, the RAID protection of the objects in the OSD architecture in Panasas occurs at file-level, and the file or files constitute the object. Therefore, the recovery domain in Object RAID is at the file level, confining the risk and damage of data loss within the file level and not at the entire device level. Consequently, the speed of recovery is much, much faster, even for 4TB HDDs.

Reliability is the key objective here. Without reliability, there is no availability. Without availability, there is no performance factors to consider. Therefore, the system’s reliability is paramount when it comes to having the data protected. RAID has been the guardian all these years. It’s time to have a revolutionary approach to safeguard the reliability and ensure data availability.

So, how many vendors can claim they have declustered RAID?

Panasas is a big YES, and they apply their intelligence in large HPC (high performance computing) environments. Their technology is tried and tested. IBM GPFS is another. But where are the rest?

 

Chink in NetApp MetroCluster?

Ok, let me clear the air about the word “Chink” (before I get into trouble), which is not racially offensive unlike the news about ESPN having to fire 2 of their employees for using the word “Chink” on Jeremy Lin.  According to my dictionary (Collins COBUILD), chink is a very narrow crack or opening on a surface and I don’t really know the derogatory meaning of “chink” other than the one in my dictionary.

I have been doing a spot of work for a friend who has just recently proposed NetApp MetroCluster. When I was at NetApp many years ago, I did not have a chance to get to know more about the solution, but I do know of its capability. After 6 years away, coming back to do a bit of NetApp was fun for me, because I was always very comfortable with the NetApp technology. But NetApp MetroCluster, and in this opportunity, NetApp Fabric MetroCluster presented me an opportunity to get closer to the technology.

I have no doubt in my mind, this is one of the highest available storage solutions in the market, and NetApp is not modest about beating its own drums. It touts “No SPOF (Single Point of Failure“, and rightly so, because it has put in all the right plugs for all the points that can fail.

NetApp Fabric MetroCluster is a continuous availability solution that stretches over 100km. It is basically a NetApp Cluster with mirrored storage but with half of  its infrastructure mirror being linked very far apart, over Fibre Channel components and dark fiber. Here’s a diagram of how NetApp Fabric Metrocluster works for a VMware FT (Fault Tolerant) environment.

There’s a lot of simplicity in the design, because when I started explaining it to the prospect, I was amazed how easy it was to articulate about it, without all the fancy technical jargons or fuzz. I just said … “imagine a typical cluster, with an interconnect heartbeat, and the storage are mirrored. Then imagine the 2 halves are being pulled very far apart … That’s NetApp Fabric MetroCluster”. It was simply blissful.

But then there were a lot of FUDs (fear, uncertainty, doubt) thrown in by the competitor, feeding the prospect with plenty of ammunition. Yes, I agree with some of the limitations, such as no SATA support for now. But then again, there is no perfect storage solution. In fact, Chris Mellor of The Register wrote about God’s box, the perfect storage, but to get to that level, be prepared to spend lots and lots of money! Furthermore, once you fix one limitation or bottleneck in one part of the storage, it introduces a few more challenges here and there. It’s never ending!

Side note: The conversation triggered the team to check with NetApp for SATA support in Fabric MetroCluster. Yes, it is already supported in ONTAP 8.1 and the present version is 8.1RC3. Yes, SATA support will be here soon. 

More FUDs as we went along and when I was doing my research, some HP storage guys on the web were hitting at NetApp MetroCluster. Poor HP! If you do a search of NetApp MetroCluster, I am sure you will come across these 2 HP blogs in 2010, deriding the MetroCluster solution. Check out this and the followup on the first blog. What these guys chose to do was to break the MetroCluster apart into 2 single controllers after a network failure, and attack it from that level.

Yes, when you break up the halves, it is basically a NetApp system with several single point of failure (SPOF). But then again, who isn’t? Almost every vendor’s storage will have some SPOFs when you break the mirror.

Well, I can tell you is, the weakness of NetApp MetroCluster is, it’s not continuous data protection (CDP). Once your applications have written garbage on one volume, the garbage is reflected on the mirrored volume. You can’t roll back and you live with the data corruption. That is why storage vendors, including NetApp, offer snapshots – point-in-time copies where you can roll back to the point before the data corruption occurred. That is why CDP gives the complete granularity of recovery in every write I/O and that’s something NetApp does not have. That’s NetApp’s MetroCluster weakness.

But CDP is aimed towards data recovery, NOT data availability. It is focused on customers’ whose requirements are ability to get the data back to some usable state or form after the event of a disaster (big or small), while the MetroCluster solution is focused on having the data available all the time. They are 2 different set of requirements. So, it depends on what the customer’s requirement is.

Then again, come to think of it, NetApp has no CDP technology of their own … isn’t it?

Apple chomps Anobit

A few days ago, Apple paid USD$500 million to buy an Israeli startup, Anobit, a maker of flash storage technology.

Obviously, one of the reasons Apple did so is to move up a notch to differentiate itself from the competition and positions itself as a premier technology innovator. It has won the MP3 war with its iPod, but in the smartphones, tablets and notebooks space, Apple is being challenged strongly.

Today, flash storage technology is prevalent, and the demand to pack more capacity into a small real-estate of flash will eventually lead to reliability issues. The most common type of NAND flash storage is the MLC (multi-level cells) versus the more expensive type called SLC (single level cells).

But physically and the internal-build of MLC and SLC are the exactly the same, except that in SLC, one cell contains 1 bit of data. Obviously this means that 2 or more bits occupy one cell in MLC. That’s the only difference from a physical structure of NAND flash. However, if you can see from the diagram below, SLCs has advantages over MLCs.

 

NAND Flash uses electrical voltage to program a cell and it is always a challenge to store bits of data in a very, very small cell. If you apply too little voltage, the bit in the cell does not register and will result in something unreadable or an error. If you apply too much voltage, the adjacent cells are disturbed and resulting in errors in the flash. Voltage leak is not uncommon.

The demands of packing more and more data (i.e. more bits) into one cell geometry results in greater unreliability. Though the reliability of  the NAND Flash storage is predictable, i.e. we would roughly know when it will fail, we will eventually reach a point where the reliability of MLCs will no longer be desirable if we continue the trend of packing more and more capacity.

That’s when Anobit comes in. Anobit has designed and implemented architectural changes of the way NAND Flash storage is used. The technology in laymen terms comes in 2 stages.

  1. Error reduction – by understanding what causes flash impairment. This could be cross-coupling, read disturbs, data retention impairments, program disturbs, endurance impairments
  2. Error Correction and Signal Processing – Advanced ECC (error-correcting code), and introducing the patented (and other patents pending) Memory Signal Processing (TM) to improve the reliability and performance of the NAND Flash storage as show in the diagram below:

In a nutshell, Anobit’s new and innovative approach will result in

  • More reliable MLCs
  • Better performing MLCs
  • Cheaper NAND Flash technology

This will indeed extend the NAND Flash technology into greater innovation of flash storage technology in the near future. Whatever Apple will do with Anobit’s technology is anybody’s guess but one thing is certain. It’s going to propel Apple into newer heights.

Storage must go on a diet

Nowadays, the capacity of the hard disk drives (HDDs) are really big. 3TB is out and 4TB is in the horizon. What’s next?

For small-medium businesses in Malaysia, depending on their data requirements and applications, 3-10TB is pretty sufficient  and with room to grow as well. Therefore, a 6TB requirement can be easily satisfied with 2 x 3TB HDDs.

If I were the customer, why would I buy a storage array, with the software licenses and other stuff that will not only increase my cost of equipment acquisition and data management, it will also increase the complexity of my IT infrastructure? I could just slot HDDs into my existing server, RAID it with RAID-0 (not a good idea but to save costs, most customers would do that) and I have a 6TB volume! It’s cheaper, easier to manage with Windows or Linux, and my system administrator doesn’t have to fuss about lack of storage experience.

And RAID isn’t really keeping up with the tremendous growth of HDD’s capacity as well. In fact, RAID is at risk. RAID (especially RAID 5/6) just cannot continue provide the LUN or volume reliability and data availability because it just takes too damn long to rebuild the volume after the failure of a disk.

Back in the days where HDDs were less than 500GB, RAID-5 would still hold up but after passing the 1TB mark, RAID-6 became more prevalent. But now, that 1TB has ballooned to 3TB and RAID-6 is on shaky ground. What’s next? RAID-7? ZFS has RAID-Z3, triple parity but come on, how many vendors have that? With triple parity or stronger RAID (is there one?), the price of the storage array is going to get too costly.

Experts have been speaking about parity-declustering,  but that’s something that a few vendors have right now. Panasas, founded by one of forefathers of RAID, Garth Gibson, comes to mind. In fact, Garth Gibson and Mark Holland of Cargenie-Mellon University’s Parallel Data Lab (PDL) presented a paper about parity-declustering more than 10 years ago.

Let’s get back to our storage fatty. Yes, our storage is getting fat, obese, rotund or whatever you want to call it. And storage vendors have been pushing a concept in hope that storage administrators and customers can take advantage of it. It is called Storage Optimization or Storage Efficiency.

Here are a few ways you can consider to put your storage on a diet.

  • Compression
  • Thin Provisioning
  • Deduplication
  • Storage Tiering
  • Tapes and SSDs

To me, compression has not taken the storage world by storm. But then again, there aren’t many vendors that tout compression as a feature for storage optimization. Most of them rather prefer to push the darling of data reduction, data deduplication, as the main feature for save more space. Theoretically, data deduplication makes more sense when the data is inactive, and has high occurrence of duplicated data. That is why secondary storage such  as backup deduplication targets like Data Domain, HP StoreOnce, Quantum DXi can publish 20:1 rates and over time, that rate can get even higher.

NetApp also has been pushing their A-SIS data deduplication on primary storage. Yes, it helps with the storage savings in primary but when the need for higher data transfer rates and time to access “manipulated” data (deduped or compressed), it is likely that compression is a better choice for primary, active data.

So who has compression? NetApp ONTAP 8.0.1 has compression now and IBM with its Storewize V7000 started as a compression device. Read about IBM Storewize in my blog here. Dell has Ocarina Networks, which was recently unleashed. I am a big fan of Ocarina Networks and I wrote about the technology in my previous blog. EMC, during the Celerra days of DART has compression but I don’t hear much about it in their VNX. Compression is there, believe me, embedded all the loads of EMC marketing.

Thin Provisioning is now a must-have and standard feature of all storage vendors. What is Thin Provisioning? The diagram below shows you:

In the past, storage systems aren’t so intelligent. You ask for 10TB, you are given 10TB and that 10TB is “deducted” from the storage capacity. That leads to wastage and storage inefficiencies. Today, Thin Provisioning will give you 10TB but storage capacity is consumed as it is being used. The capacity is not pre-allocated as in the past. Thin provisioning is a great diet pill for bloated storage projects. 

Another up and coming feature is storage tiering. Storage tiering, when associated to storage optimization, should include hierarchical storage management (HSM) and tape-out as well. Storage optimization solutions should not offer only in the storage array itself. Storage tiering within the storage array is available with most vendors – IBM EasyTier, EMC FAST2, Dell Fluid Data Management and many others. But what about data being moved out of the storage array? What about reducing the capacity of the data online or near-line? Why not put them offline if there isn’t a need for it?

I term this as Active Archiving, something I learned while I was at EMC. Here’s a look at EMC’s style of Active Archiving:

Active Archiving promotes the concept of data archiving and is not unique only to EMC. Almost all storage vendors, either natively or with 3rd party vendors, can perform fairly efficient data archiving in one way or another. One of the software that I liked (and not unique!) is Quantum Stornext. Here’s a video of how Quantum Stornext helps reduce the fat of the storage.

With the single-copy sharing feature of Quantum Stornext to multiple disparate OSes, there are lesser duplicate files in storage as well.

Tapes have been getting a bad name in the past few years. It has been repositioned and repurposed as an archive medium rather than a backup medium. But tape is the greenest and most powerful storage diet pill around. And we should not be discount tapes because tapes are fighting back. Pretty soon you will be hearing about Linear Tape File System (LTFS). In a nutshell, Linear Tape File System (LTFS) allows you to use the tape almost as if it were a hard disk. You can drag and drop files from your server to the tape, see the list of saved files using a standard operating system directory (no backup software catalog needed), and use point and click to restore. How cool is that!

And Solid State Drives (SSDs) makes sense as well.

There are times that we need IOPS and using spinning drives, we have to set up many disk spindles to achieve the IOPS that we want.  For example, using the diagram below from the godfather of storage, Greg Schulz,

The set of 16 spinning HDD drives on the left can only deliver 3,520 IOPS. The problem is, we have wasted a lot of disk space, as seen in the diagram below. This design, which most customer would be accustomed to, may look cheaper but in actual fact, is NOT.

If the price of a Fibre Channel HDD is RM2,000, the total of 16 would make up RM32,000.00. That is not inclusive of additional power and cooling and rack space and also the data management costs. Assuming the SSDs costs 5 times more than the Fibre Channel HDD. SSDs are capable of delivering very high IOPS. Here I am putting a modest 5,000 IOPS per SSDs. With just 2 SSDs (as the right design suggests), the total costs is only RM20,000. It has greater performance room to grow, and also savings in data management, power and cooling.

Folks, consider SSDs as part of your storage diet plan.

All these features are available, in whole or in part, and they are part of the storage technology offerings that is out there. With all these being said, are you doing something about it? Get off your lazy bum and start managing your storage and put your storage on a diet!!!

Playing with NetApp … After Rightsizing

It has been a tough week for me and that’s why I haven’t been writing much this week. So, right now, right after dinner, I am back on keyboard again, continuing where I have left off with NetApp’s usable capacity.

A blog and a half ago, I wrote about the journey of getting NetApp’s usable capacity and stopping up to the point of the disk capacity after rightsizing. We ended with the table below.

Manufacturer Marketing Capacity NetApp Rightsized Capacity
36GB 34.0/34.5GB*
72GB 68GB
144GB 136GB
300GB 272GB
600GB 560GB
1TB 847GB
2TB 1.69TB
3TB 2.48TB

* The size of 34.5GB was for the Fibre Channel Zone Checksum mechanism employed prior to ONTAP version 6.5 of 512 bytes per sector. After ONTAP 6.5, block checksum of 520 bytes per sector was employed for greater data integrity protection and resiliency.

At this stage, the next variable to consider is RAID group sizing. NetApp’s ONTAP employs 2 types of RAID level – RAID-4 and the default RAID-DP (a unique implementation of RAID-6, employing 2 dedicated disks as double parity).

Before all the physical hard disk drives (HDDs) are pooled into a logical construct called an aggregate (which is what ONTAP’s FlexVol is about), the HDDs are grouped into a RAID group. A RAID group is also a logical construct, in which it combines all HDDs into data or parity disks. The RAID group is the building block of the Aggregate.

So why a RAID group? Well, first of all, (although likely possible), it is not prudent to group a large number of HDDs into a single group with only 2 parity drives supporting the RAID. Even though one can maximize the allowable, aggregated capacity from the HDDs, the data reconstruction or data resilvering operation following a HDD failure (disks are supposed to fail once in a while, remember?) would very much slow the RAID operations to a trickle because of the large number of HDDs the operation has to address. Therefore, it is best to spread them out into multiple RAID groups with a recommended fixed number of HDDs per RAID group.

RAID group is important because it is used to balance a few considerations

  • Performance in recovery if there is a disk reconstruction or resilvering
  • Combined RAID performance and availability through a Mean Time Between Data Loss (MTBDL) formula

Different ONTAP versions (and also different disk types) have different number of HDDs to constitute a RAID group. For ONTAP 8.0.1, the table below are its recommendation.

 

So, given a large pool of HDDs, the NetApp storage administrator has to figure out the best layout and the optimal number of HDDs to get to the capacity he/she wants. And there is also a best practice to set aside 2 HDDs for a RAID-DP configuration with every 30 or so HDDs. Also, it is best practice to take the default recommended RAID group size most of the time.

I would presume that this is all getting very confusing, so let me show that with an example. Let’s use the common 2TB SATA HDD and let’s assume the customer has just bought a 100 HDDs FAS6000. From the table above, the default (and recommended) RAID group size is 14. The customer wants to have maximum usable capacity as well. In a step-by-step guide,

  1. Consider the hot sparing best practice. The customer wants to ensure that there will always be enough spares, so using the rule-of-thumb of 2 HDDs per 30 HDDs, 6 disks are set aside as hot spares. That leaves 94 HDDs from the initial 100 HDDs.
  2. There is a root volume, rootvol, and it is recommended to put this into an aggregate of its own so that it gets maximum performance and availability. To standardize, the storage administrator configures 3 HDDs as 1 RAID group to create the rootvol aggregate, aggr0. Even though the total capacity used by the rootvol is just a few hundred GBs, it is not recommended to place data into rootvol. Of course, this situation cannot be avoided in most of the FAS2000 series, where a smaller HDDs count are sold and implemented. With 3 HDDs used up as rootvol, the customer now has 91 HDDs.
  3. With 91 HDDs, and using the default RAID group size of 14, for the next aggregate of aggr1, the storage administrator can configure 6 x full RAID group of 14 HDDs (6 x 14 = 84) and 1 x partial RAID group of 7. (91/14 = 6 remainder 7). And 84 + 7 = 91 HDDs.
  4. RAID-DP requires 2 disks per RAID group to be used as parity disks. Since there are a total of 7 RAID groups from the 91 HDDs, 14 HDDs are parity disks, leaving 77 HDDs as data disks.

This is where the rightsized capacity comes back into play again. 77 x 2TB HDDs is really 77 x 1.69TB = 130.13TB from an initial of 100 x 2TB = 200TB.

If you intend to create more aggregates (in our example here, we have only 2 aggregates – aggr0 and aggr1), there will be more consideration for RAID group sizing and parity disks, further reducing the usable capacity.

This is just part 2 of our “Playing with NetApp Capacity” series. We have not arrived at the final usable capacity yet and I will further share that with you over the weekend.

ONTAP vs ZFS

I have to get this off my chest. Oracle’s Solaris ZFS is better than NetApp’s ONTAP WAFL! There! I said it!

I have been studying both similar Copy-on-Write (COW) file systems at the data structure level for a while now and I strongly believe ZFS is a better implementation of the COW file systems (also known as “shadow-paging” file system) than WAFL. How are both similar and how are both different? The angle we are looking at is not performance but about resiliency and reliability.

(Note: btrfs or “Butter File System” is another up-and-coming COW file system under GPL license and is likely to be the default file system for the coming Fedora 16)

In Computer Science, COW file system are tree-like data structures as shown below. They are different than the traditional Berkeley Fast File System data structure as shown below:

As some of you may know, Berkeley Fast File System is the foundation of some modern day file systems such as Windows NTFS, Linux ext2/3/4, and Veritas VxFS.

COW file system is another school of thought and this type of file system is designed in a tree-like data structure.

In a COW file system or more rightly named shadow-paging file system, the original node of the data block is never modified. Instead, a copy of the node is created and that copy is modified, i.e. a shadow of the original node is created and modified. Since the node is linked to a parent node and that parent node is linked to a higher parent node and so on all the way to the top-most root node, each parent and higher-parent nodes are modified as it traverses through the tree ending at the root node.

The diagram below shows the shadow-paging process in action as modifications of the node copy and its respective parent node copies traverse to the top of the tree data structure. The diagram is from ZFS but the same process applies to WAFL as well.

 

As each data block of either the leaf node (the last node in the tree) or the parent nodes are being modified, pointers to either the original data blocks or the copied data blocks are modified accordingly relative to the original tree structure, until the last root node at the top of the shadow tree is modified. Then, the COW file system commit is considered complete. Take note that the entire process of changing pointers and modifying copies of the nodes of the data blocks is done is a single I/O.

The root at the top for ZFS is called uberblock and called fsinfo in WAFL. Because an exact shadow of the tree-like file system is created when the data blocks are modified, this also gives birth to how snapshots are created in a COW file system. It’s all about pointers, baby!

Here’s how it looks like with the original data tree and the snapshot data tree once the shadow paging modifications are complete.

 

However, there are a few key features from the data integrity and reliability point of view where ZFS is better than WAFL. Let me share that with you.

In a nutshell, ZFS is a layered architecture that looks like this

The Data Management Unit (DMU) layer is one implementation that ensures stronger data integrity. The DMU maintains a checksum on the data in each data block by storing the checksum in the parent’s blocks. Thus if something is messed up in the data block (possibly by Silent Data Corruption), the checksum in the parent’s block will be able to detect it and also repair the data corruption if there is sufficient data redundancy information in the data tree.

WAFL will not be able to detect such data corruptions because the checksum is applied at the disk block level and the parity derived during the RAID-DP write does not flag this such discrepancy. An old set of slides I found portrayed this comparison as shown below.

 

Another cool feature that addresses data resiliency is the implementation of ditto blocks. Ditto blocks stores 3 copies of the metadata and this allows the recovery of lost metadata even if 2 copies of the metadata are deleted.

Therefore, the ability of ZFS to survive data corruption, metadata deletion is stronger when compared to WAFL .This is not discredit NetApp’s WAFL. It is just that ZFS was built with stronger features to address the issues we have with storing data in modern day file systems.

There are many other features within ZFS that have improved upon NetApp’s WAFL. One such feature is the implementation of RAID-Z/Z2/Z3. RAID-Z is a superset implementation of the traditional RAID-5 but with a different twist. Instead of using fixed stripe width like RAID-4 or RAID-DP, RAID-Z/Z2 uses a dynamic variable stripe width. This addressed the parity RAID-4/5 “write hole” flaw, where incomplete or partial stripes will result in a “hole” that leads to file system fragmentation. RAID-Z/Z2 address this by filling up all blocks with variable stripe width. A parity can be calculated and assigned with any striped width, as shown below.

 

Other really cool stuff are Hybrid Storage Pool and the ability to create software-based caching using fast disk drives such as SSDs. This approach of creating ReadZilla (read caching) and LogZilla (write caching) eliminates the need for proprietary NVRAM as implemented in NetApp’s WAFL.

The only problem is, despite the super cool features of ZFS, most Oracle (not Sun) sales does not have much clue how to sell ZFS storage. NetApp, with its well trained and tuned, sales force is beating Oracle to pulp.