Highroad to Parallel Road

Unless you are working with highly, parallelized access to files in a large scale-out NAS environment, you probably don’t get to work much with Parallel NFS (pNFS). pNFS is part of the NFSv4.1 (RFC 5661) standard, and was introduced in January 2010 to address NFS protocol in the clustered, scale-out NAS environment. It is to provide parallel file access across distributed servers.

pNFS is heavily driven by Panasas, NetApp, EMC, IBM, Sun (now Oracle) among others. And funnily enough, the company that sticks out from the bunch is one that used to tout block storage as the way to go, not files. That’s EMC, the company that more well known for its SAN solutions than its NAS (remember Celerra and IP4700?). And EMC has embraced pNFS in a big, big way. Read EMC’s CTO for Global Marketing, Chuck Hollis’ blog here and here.

However, unknown to many, a lot of the thinking that goes into pNFS are very similar to an EMC product some years ago. That product is EMC Highroad, which in the later years, renamed as Multi-path File System (MPFS).

Note: If you want to know more about the history of HighRoad/MPFS, read this blog.

The cornerstone of EMC MPFS is their File Mapping Protocol or FMP, which is a robust protocol that lines the mapping of files to their addressable blocks in storage. In a nutshell, when I was made responsible for this product during my time at EMC, I used to pitch to companies that MPFS was a file request is through NFS but respond to the requester can be in blocks (iSCSI or Fibre Channel). The beauty of this was, NFSv3 was chatty and heavy but the delivery of data through blocks via iSCSI or Fibre Channel has lesser overhead compared to NFSv3.

Hence the delivery is faster and EMC touted that the performance was 2-4x faster than NFS. Indeed, I have seen some lab tests results from EMC’s work with Schlumberger High Performance Lab in Houston, and the numbers were impressive. I still have them on Powerpoint somewhere.

In circa of 2003-2004, EMC donated the FMP code to IETF and as they say the rest was history.

The picture below basically summarizes what pNFS is all about.

 

A NFSv4 client will basically check with a Metadata server via the pNFS protocol about the layout of the distributed cluster of server. The file, could be striped across multiple nodes of the cluster, and it is the metadata server that provides a map to the client to access the blocks or files from these nodes. This is exactly what the EMC HighRoad/MPFS File Mapping Protocol (FMP) did, mapping the file requests to its respective corresponding blocks. See diagram below:

 

And one of the powerful feature of pNFS is that it is not just about NFS. The green arrow you see in the above diagram is the storage-access protocol. That access protocol can be NFSv4.1, CIFS, iSCSI, Fibre Channel, FCoE, Infiniband, and Object Storage Device (OSD).

In order to have pNFS working, the NFS client must be NFSv4.1 ready and that code has been made available in Linux and OpenSolaris. Other Unix vendors, no doubt, will be coming out with their NFSv4.1 implementation soon. Oooooh, there will be a Windows NFSv4.1 client coming as well!

But I want to dispel the notion that EMC is a SAN company. EMC is a very strong NAS company and if you have seen the IDC market share (ok, ok, many of you out there will argue about it), EMC is #1 in NAS. And their contribution to pNFS is immense.

Stop stroking your …

A few days after I wrote about the performance benchmark bag of tricks, EMC was the first to fire the first salvo at NetApp’s SPECSfs2008 world records on NFS IOPS.

EMC is obviously using all its ammo to deflate NetApp chest thumping act, with Storagezilla‘s blog. Mark Twomey,  who is the alter ego of Storagezilla posted several observations about NetApp apparent use of disk short stroking to artificially boost its performance numbers. This puts NetApp against the wall, with Alex MacDonald (who incidentally is SNIA NFSv4 co-chairman) of the office of the CTO responding hard to Storagezilla’s observation.

The news of this appeared in The Register. Read all about it.

With no letting up, the article also mentioned EMC Isilon’s CTO, Rob Pegler, adding more fuel to the fire.

I spoke about short stroking as some of the tricks used to gain better numbers in benchmark. And I also mentioned that these numbers have little use to the real work and I would like to add that these numbers are just there for marketing reasons. So, for you readers out there, benchmark is really not big of a deal.

Have a great weekend!

Data Deduplication – Dell is first and last

A very interesting report surfaced in front of me today. It is Information Week’s IT Pro ranking of Data Deduplication vendors, just made available a few weeks ago, and it is the overview of the dedupe market so far.

It surveyed over 400 IT professionals from various industries with companies ranging from less than 50 employees to over 10,000 employees and revenues of less than USD5 million to USD1 billion. Overall, it had a good mix of respondents. But the results were quite interesting.

It surveyed 2 segments

  1. Overall performance – product reliability, product performance, acquisition costs, operations costs etc.
  2. Technical features – replication, VTL, encryption, iSCSI and FCoE support etc.

When I saw the results (shown below), surprise, surprise! Here’s the overall performance survey chart:

Dell/Compellent scored the highest in this survey while EMC/Data Domain ranked the lowest. However, the difference between the first place and the last place vendor is only 4%, and this is to suggest that EMC/Data Domain was about just as good as the Dell/Compellent solution, but it scored poorly in the areas that matters most to the customer. In fact, as we drill down into the requirements of the overall performance one-by-one, as shown below,

there is little difference among the 7 vendors.

However, when it comes to Technical Features, Dell/Compellent is ranked last, the complete opposite. As you can see from the survey chart below, IBM ProtecTier, NetApp and HP are all ranked #1.

The details, as per the technical requirements of the customers, are shown below:

These figures show that the competition between the vendors is very, very stiff, with little edge difference from one to another. But what I was more interested were the following findings, because these figures tell a story.

In the survey, only 34% of the respondents say they have implemented some data deduplication solutions, while the rest are evaluating and plan to evaluation. This means that the overall market is not saturated and there is still a window of opportunity for the vendors. However, the speed of the a maturing data deduplication market, from early adopters perhaps 4-5 years ago to overall market adoption, surprised many, because the storage industry tend to be a bit less trendy than most areas of IT. With the way the rate of data deduplication is going, it will be very much a standard feature of all storage vendors in the very near future.

The second figures that is probably not-so-surprising is, for most of the customers who have already implemented the data deduplication solution, almost 99% are satisfied or somewhat satisfied with their solutions. Therefore, the likelihood of these customer switching vendors and replacing their gear is very low, perhaps partly because of the reliability of the solution as well as those products performing as they should.

The Information Week’s IT Pro survey probably reflected well of where the deduplication market is going and there isn’t much difference in terms of technical and technology features from vendor to vendor. Customer will have to choose beyond the usual technology pitch, and look for other (and perhaps more important) subtleties such as customer service, price and flexibility of doing business with. EMC/Data Domain, being king-of-the-hill, has not been the best of vendor when it comes to price, quality of post-sales support and service innovation. Let’s hope they are not like the EMC sales folks of the past, carrying the “Take it or leave it” tag when they develop their relationship with their future customers. And it will not help if word-of-mouth goes around the industry about EMC’s arrogance of their dominance. It may not be true, and let’s hope it is not true because the EMC of today has changed plenty compared to the Symmetrix days. EMC/Data Domain is now part of their Backup Recovery Service (BRS) team, and I have good friends there at EMC Malaysia and Singapore. They are good guys but remember guys, customer is still king!

Dell, new with their acquisition of Compellent and Ocarina Networks, seems very eager to win the business and kudos to them as well. In fact, I heard from a little birdie that Dell is “giving away” several units of Compellents to selected customers in Malaysia. I did not and cannot ascertain if this is true or not but if it is, that’s what I call thinking-out-of-the-box, given Dell as a late comer into the storage game. Well done!

One thing to note is that the survey took in 17 vendors, including Exagrid, Falconstor, Quantum, Sepaton and so on, but only the top-7 shown in the charts qualified.

In the end, I believe the deduplication vendors had better scramble to grab as much as they can in the coming months, because this market will be going, going, gone pretty soon with nothing much to grab after that, unless there is a disruptive innovation to the deduplication technology

Playing with NetApp … final usable capacity

This is the third and last blog entry of how do we get the ONTAP final capacity.

In my first blog, we ran through a gamut of explanations how disk rightsizing came about for NetApp’s ONTAP. And the importance of disk rightsizing is to give ONTAP a level set of disks, regardless of manufacturer, model, make, firmware versions and so on, and ONTAP is pretty damn sure that the disks that it gets will not mess up.

In my second blog, progressing from the disk rightsizing stage, was the RAID group sizing stage, where different RAID group size affected the number of disks used for data and for parity in an aggregate. An aggregate, for the uninformed, is the disks pool in which the flexible volume, FlexVol, is derived. In a simple picture below,

OK, the diagram’s in Japanese (I am feeling a bit cheeky today :P)!

But it does look a bit self explanatory with some help which I shall provide now. If you start from the bottom of the picture, 16 x 300GB disks are combined together to create a RAID Group. And there are 4 RAID Groups created – rg0, rg1, rg2 and rg3. These RAID groups make up the ONTAP data structure called an aggregate. From ONTAP version 7.3 onward, there were some minor changes of how ONTAP reports capacity but fundamentally, it did not change much from previous versions of ONTAP. And also note that ONTAP takes a 10% overhead of the aggregate for its own use.

With the aggregate, the logical structure called the FlexVol is created. FlexVol can be as small as several megabytes to as large as 100TB, incremental by any size on-the-fly. This logical structure also allow shrinking of the capacity of the volume online and on-the-fly as well. Eventually, the volumes created from the aggregate become the next-building blocks of NetApp NFS and CIFS volumes and also LUNs for iSCSI and Fibre Channel. Also note that, for a more effective organization of logical structures from the volumes, using qtree is highly recommended for files and ONTAP management reasons.

However, for both aggregate and the FlexVol volumes created from the aggregate, snapshot reserve is recommended. The aggregate takes a 5% overhead of the capacity for snapshot reserve, while for every FlexVol volume, a 20% snapshot reserve is applied. While both snapshot percentage are adjustable, it is recommended to keep them as best practice (except for FlexVol volumes assigned for LUNs, which could be adjusted to 0%)

Note: Even if the snapshot reserve is adjusted to 0%, there are still some other rule sets for these LUNs that will further reduce the capacity. When dealing with NetApp engineers or pre-sales, ask them about space reservations and how they do snapshots for fat LUNs and thin LUNs and their best practices in these situations. Believe me, if you don’t ask, you will be very surprised of the final usable capacity allocated to your applications)

In a nutshell, the dissection of capacity after the aggregate would look like the picture below:

 

We can easily quantify the overall usable in the little formula that I use for some time:

Rightsized Disks capacity x # Disks x 0.90 x 0.95 = Total Aggregate Usable Capacity

Then remember that each volume takes a 20% snapshot reserve overhead. That’s what you have got to play with when it comes to the final usable capacity.

Though the capacity is not 100% accurate because there are many variables in play but it gives the customer a way to manually calculate their potential final usable capacity.

Please note the following best practices and this is only applied to 1 data aggregate only. For more aggregates, the same formula has to be applied again.

  1. A RAID-DP, 3-disk rootvol0, for the root volume is set aside and is not accounted for in usable capacity
  2. A rule-of-thumb of 2-disks hot spares is applied for every 30 disks
  3. The default RAID Group size is used, depending on the type of disk profile used
  4. Snapshot reserves default of 5% for aggregate and 20% per FlexVol volumes are applied
  5. Snapshots for LUNs are subjected to space reservation, either full or fractional. Note that there are considerations of 2x + delta and 1x + delta (ask your NetApp engineer) for iSCSI and Fibre Channel LUNs, even though snapshot reserves are adjusted to 0% and snapshots are likely to be turned off.

Another note that remember is not to use any of those Capacity Calculators given. These calculators are designed to give advantage to NetApp, not necessarily to the customer. Therefore, it is best to calculate these things by hand.

Regardless of how the customer will get as the overall final usable capacity, it is the importance to understand the NetApp philosophy of doing things. While we have perhaps, went overboard explaining the usable capacity and the nitty gritty that comes with it, all these things are done for a reason to ensure simplicity and ease of navigating data management in the storage networking world. Other NetApp solutions such as SnapMirror and SnapVault and also the SnapManager suite of product rely heavily on this.

And the intangible benefits of NetApp and ONTAP definitely have moved NetApp forward since its early years, into what NetApp is today, a formidable storage juggernaut.

Playing with NetApp … After Rightsizing

It has been a tough week for me and that’s why I haven’t been writing much this week. So, right now, right after dinner, I am back on keyboard again, continuing where I have left off with NetApp’s usable capacity.

A blog and a half ago, I wrote about the journey of getting NetApp’s usable capacity and stopping up to the point of the disk capacity after rightsizing. We ended with the table below.

Manufacturer Marketing Capacity NetApp Rightsized Capacity
36GB 34.0/34.5GB*
72GB 68GB
144GB 136GB
300GB 272GB
600GB 560GB
1TB 847GB
2TB 1.69TB
3TB 2.48TB

* The size of 34.5GB was for the Fibre Channel Zone Checksum mechanism employed prior to ONTAP version 6.5 of 512 bytes per sector. After ONTAP 6.5, block checksum of 520 bytes per sector was employed for greater data integrity protection and resiliency.

At this stage, the next variable to consider is RAID group sizing. NetApp’s ONTAP employs 2 types of RAID level – RAID-4 and the default RAID-DP (a unique implementation of RAID-6, employing 2 dedicated disks as double parity).

Before all the physical hard disk drives (HDDs) are pooled into a logical construct called an aggregate (which is what ONTAP’s FlexVol is about), the HDDs are grouped into a RAID group. A RAID group is also a logical construct, in which it combines all HDDs into data or parity disks. The RAID group is the building block of the Aggregate.

So why a RAID group? Well, first of all, (although likely possible), it is not prudent to group a large number of HDDs into a single group with only 2 parity drives supporting the RAID. Even though one can maximize the allowable, aggregated capacity from the HDDs, the data reconstruction or data resilvering operation following a HDD failure (disks are supposed to fail once in a while, remember?) would very much slow the RAID operations to a trickle because of the large number of HDDs the operation has to address. Therefore, it is best to spread them out into multiple RAID groups with a recommended fixed number of HDDs per RAID group.

RAID group is important because it is used to balance a few considerations

  • Performance in recovery if there is a disk reconstruction or resilvering
  • Combined RAID performance and availability through a Mean Time Between Data Loss (MTBDL) formula

Different ONTAP versions (and also different disk types) have different number of HDDs to constitute a RAID group. For ONTAP 8.0.1, the table below are its recommendation.

 

So, given a large pool of HDDs, the NetApp storage administrator has to figure out the best layout and the optimal number of HDDs to get to the capacity he/she wants. And there is also a best practice to set aside 2 HDDs for a RAID-DP configuration with every 30 or so HDDs. Also, it is best practice to take the default recommended RAID group size most of the time.

I would presume that this is all getting very confusing, so let me show that with an example. Let’s use the common 2TB SATA HDD and let’s assume the customer has just bought a 100 HDDs FAS6000. From the table above, the default (and recommended) RAID group size is 14. The customer wants to have maximum usable capacity as well. In a step-by-step guide,

  1. Consider the hot sparing best practice. The customer wants to ensure that there will always be enough spares, so using the rule-of-thumb of 2 HDDs per 30 HDDs, 6 disks are set aside as hot spares. That leaves 94 HDDs from the initial 100 HDDs.
  2. There is a root volume, rootvol, and it is recommended to put this into an aggregate of its own so that it gets maximum performance and availability. To standardize, the storage administrator configures 3 HDDs as 1 RAID group to create the rootvol aggregate, aggr0. Even though the total capacity used by the rootvol is just a few hundred GBs, it is not recommended to place data into rootvol. Of course, this situation cannot be avoided in most of the FAS2000 series, where a smaller HDDs count are sold and implemented. With 3 HDDs used up as rootvol, the customer now has 91 HDDs.
  3. With 91 HDDs, and using the default RAID group size of 14, for the next aggregate of aggr1, the storage administrator can configure 6 x full RAID group of 14 HDDs (6 x 14 = 84) and 1 x partial RAID group of 7. (91/14 = 6 remainder 7). And 84 + 7 = 91 HDDs.
  4. RAID-DP requires 2 disks per RAID group to be used as parity disks. Since there are a total of 7 RAID groups from the 91 HDDs, 14 HDDs are parity disks, leaving 77 HDDs as data disks.

This is where the rightsized capacity comes back into play again. 77 x 2TB HDDs is really 77 x 1.69TB = 130.13TB from an initial of 100 x 2TB = 200TB.

If you intend to create more aggregates (in our example here, we have only 2 aggregates – aggr0 and aggr1), there will be more consideration for RAID group sizing and parity disks, further reducing the usable capacity.

This is just part 2 of our “Playing with NetApp Capacity” series. We have not arrived at the final usable capacity yet and I will further share that with you over the weekend.

Playing with NetApp … (Capacity) BR

Much has been said about usable disk storage capacity and unfortunately, many of us take the marketing capacity number given by the manufacturer in verbatim. For example, 1TB does not really equate to 1TB in usable terms and that is something you engineers out there should be informing to the customers.

NetApp, ever since the beginning, has been subjected to the scrutiny of the customers and competitors alike about their usable capacity and I intend to correct this misconception. And the key of this misconception is to understand what is the capacity before rightsizing (BR) and after rightsizing (AR).

(Note: Rightsizing in the NetApp world is well documented and widely accepted with different views. It is part of how WAFL uses the disks but one has to be aware that not many other storage vendors publish their rightsizing process, if any)

Before Rightsizing (BR)

First of all, we have to know that there are 2 systems when it comes to system of unit prefixes. These 2 systems can be easily said as

  • Base-10 (decimal) – fit for human understanding
  • Base-2 (binary) – fit for computer understanding

So according the International Systems of Units, the SI prefixes for Base-10 are

Text Factor Unit
kilo 103 1,000
mega 106 1,000,000
giga 109 1,000,000,000
tera 1012 1,000,000,000,000

In computer context, where the binary, Base-2 system is relevant, that SI prefixes for Base-2 are

Text Factor Unit
kilo-byte 210 1,024
mega-byte 220 1,048,576
giga-byte 230 1,073,741,824
tera-byte 240 1,099,511,627,776

And we must know that the storage capacity is in Base-2 rather than in Base-10. Computers are not humans.

With that in mind, the next issue are the disk manufacturers. We should have an axe to grind with them for misrepresenting the actual capacity. When they say their HDD is 1TB, they are using the Base-10 system i.e. 1TB = 1,000,000,000,000 bytes. THIS IS WRONG!

Let’s see how that 1TB works out to be in Gigabytes in the Base-2 system:

1,000,000,000/1,073,741,824 = 931.3225746154785 Gigabytes

Note: 230 =1,073,741,824

That result of 1TB, when rounded, is only about 931GB! So, the disk manufacturers aren’t exactly giving you what they have advertised.

Thirdly, and also the most important factor in the BR (Before Rightsizing) phase is how WAFL handles the actual capacity before the disk is produced to WAFL/ONTAP operations. Note that this is all done before all the logical structures of aggregates, volumes and LUNs are created.

In this third point, WAFL formats the actual disks (just like NTFS formats new disks) and this reduces the usable capacity even further. As a starting point, WAFL uses 4K (4,096 bytes) per block

For Fibre Channel disks, WAFL formats them with a 520 byte per sector. Therefore, for each block, 8 sectors (520 x 8 = 4160 bytes) fill 1 x 4K block, with remainder of 64 bytes (4,160 – 4,096 = 64 bytes) for the checksum of the 1 x 4K block. This additional 64 bytes per block checksum is not displayed by WAFL or ONTAP and not accounted for in its usable capacity.

512 bytes per sector are used for formatting SATA/SAS disks and it consumes 9 sectors (9 x 512 = 4,608 bytes). 8 sectors will be used for WAFL’s 4K per block (4,096/512 = 8 sectors), the remainder of 1 sector (the 9th sector) of 512 bytes is used partially for its 64 bytes checksum. Again, this 448 bytes (512 – 64 = 448 bytes) is not displayed and not part of the usable capacity of WAFL and ONTAP.

And WAFL also compensates for the ever-so-slightly irregularities of the hard disk drives even though they are labelled with similar marketing capacities. That is to say that 1TB from Seagate and 1TB from Hitachi will be different in terms actual capacity. In fact, 1TB Seagate HDD with firmware 1.0a (for ease of clarification) and 1TB Seagate HDD with firmware 1.0b (note ‘a’ and ‘b’) could be different in actual capacity even when both are shipped with a 1.0TB marketing capacity label.

So, with all these things in mind, WAFL does what it needs to do – Right Size – to ensure that nothing get screwed up when WAFL uses the HDDs in its aggregates and volumes. All for the right reason – Data Integrity – but often criticized for their “wrongdoing”. Think of WAFL as your vigilante superhero, wanted by the law for doing good for the people.

In the end, what you are likely to get Before Rightsizing (BR) from NetApp for each particular disk capacity would be:

Manufacturer Marketing Capacity NetApp Rightsized Capacity Percentage Difference
36GB 34.0/34.5GB* 5%
72GB 68GB 5.55%
144GB 136GB 5.55%
300GB 272GB 9.33%
600GB 560GB 6.66%
1TB 847GB 11.3%
2TB 1.69TB 15.5%
3TB 2.48TB 17.3%

* The size of 34.5GB was for the Fibre Channel Zone Checksum mechanism employed prior to ONTAP version 6.5 of 512 bytes per sector. After ONTAP 6.5, block checksum of 520 bytes per sector was employed for greater data integrity protection and resiliency.

From the table, the percentage of “lost” capacity is shown and to the uninformed, this could look significant. But since the percentage value is relative to the Manufacturer’s Marketing Capacity, this is highly inaccurate. Therefore, competitors should not use these figures as FUD and NetApp should use these as a way to properly inform their customers.

You have been informed about NetApp capacity before Right Sizing.

I will follow on another day with what happens next after Right Sizing and the final actual usable capacity to the users and operations. This will be called After Rightsizing (AR). Till then, I am going out for an appointment.