The recipe for storage performance modeling

Good morning, afternoon, evening, Ladies & Gentlemen, wherever you are.

Today, we are going to learn how to bake, errr … I mean, make a storage performance model. Before we begin, allow me to set the stage.

Don’t you just hate it when you are asked to do storage performance sizing and you don’t have a freaking idea how to get started? A typical techie would probably say, “Aiya, just use the capacity lah!”, and usually, they will proceed to size the storage according to capacity. In fact, sizing by capacity is the worst way to do storage performance modeling.

Bear in mind that storage is not a black box, although some people wished it was. It is not black magic when it comes to performance sizing because things can be applied in a very scientific and logical manner.

SNIA (Storage Networking Industry Association) has made a storage performance modeling methodology (that’s quite a mouthful), and basically simplified it into these few key ingredients. This recipe is for storage performance modeling in general and I am advising you guys out there to engage your storage vendors professional services. They will know their storage solutions best.

And I am going to say to you – Don’t be cheap and not engage professional services – to get to the experts out there. I was having a chat with an consultant just now at McDonald’s. I have known this friend of mine for about 6-7 years now and his name is Sugen Sumoo, the Director of DBORA Consulting. They specialize in Oracle and database performance tuning and performance forecasting and it is something that a typical DBA can’t do, because DBORA Consulting is the Professional Service that brings expertise and value to Oracle customers. Likewise, you have to engage your respective storage professional services as well.

In a cook book or a cooking show, you are presented with the ingredients used and in this recipe for storage performance modeling, the ingredients (in no particular order) are:

  • Application block size
  • Read and Write ratio
  • Application access patterns
  • Working set size
  • IOPS or throughput
  • Demand intensity

Application Block Size

First of all, the storage is there to serve applications. We always have to look from the applications’ point of view, not storage’s point of view.  Different applications have different block size. Databases typically range from 8K-64K and backup applications usually deal with larger block sizes. Video applications can have 256K block sizes or higher. It all depends.

The best way is to find out from the DBA, email administrator or application developers. The unfortunate thing is most so-called technical people or administrators in Malaysia doesn’t have a clue about the applications they manage. So, my advice to you storage professionals, do your research on the application and take the default value. These clueless fellas are likely to take the default.

Read and Write ratio

Applications behave differently at different times of the day, and at different times of the month (no, it’s not PMS). At the end of the financial year or calendar, there are some tasks that these applications do as well. But in a typical day, there are different weightage or percentage of read operations versus write operations.

Most OLTP (online transaction processing)-based applications tend to be read heavy and write light, but we need to find out the ratio. Typically, it can be a 2:1 ratio or 60%:40%, but it is best to speak to the application administrators about the ratio. DSS (Decision Support Systems) and data warehousing applications could have much higher reads than writes while a seismic-analysis applications can have multiple writes during the analysis periods. It all depends.

To counter the “clueless” administrators, ask lots of questions. Find out the workflow of several key tasks and ask what that particular tasks do at different checkpoints of the application’s processing. If you are lazy (please don’t be lazy, because it degrades your value as a storage professional), use a rule of thumb.

Application access patterns

Applications behave differently in general. They can be sequential, like backup or video streaming. They can be random like emails, databases at certain times of the day, and so on. All these behavioral patterns affect how we design and size the disks in the storage.

Some RAID levels tend to work well with sequential access and others, with random access. It is not difficult to find out about the applications’ pattern and if you read more about the different RAID-levels in storage, you can easily identify the type of RAID levels suitable for each type of behavioral patterns.

Working set size

This variable is a bit more difficult to determine. This means that a chunk of the application has to be loaded into a working area, usually memory and cache memory, to be used and abused by the application users.

Unless someone is well versed with the applications, one would not be able to determine how much of the applications would be placed in memory and in cache memory. Typically, this can only be determined after the application has been running for some time.

The flexibility of having SSDs, especially the DRAM-type of SSDs, are very useful to ensure that there is sufficient “working space” for these applications.

IOPS or Throughput

According to SNIA model, for I/O less than 64K, IOPS should be used as a yardstick to do storage performance modeling. Anything larger, use throughput, in which MB/sec is the measurement unit.

The application guy would be able to tell you what kind of IOPS their application is expecting or what kind of throughput they want. Again, ask a lot of questions, because this will help you determine the type of disks and the kind of performance you give to the application guys.

If the application guy is clueless again, ask someone more senior or ask the vendor. If the vendor engineers cannot give you an answer, then they should not be working for the vendor.

Demand intensity

This part is usually overlooked when it comes to performance sizing. Demand intensity refers to how intense is the I/O requests. It could come from 1 channel or 1 part of the applications, or it could come from several parts of the applications in parallel. It is as if the storage is being ‘bombarded’ by applications and this is the part that is hard to determine as well.

In some applications, the degree of intensity or parallelism can be tuned and to find out, ask the application administrator or developer. If not, ask the vendor. Also do a lot of research on the application’s architecture.

And one last thing. What I have learned is to add buffers to the storage performance model. Typically I would add about 10-20% extra but you never know. As storage professionals, I would strongly encourage to engage professional services, because it is worthwhile, especially in the early stages of the sizing. It is usually a more expensive affair to size it after the applications have been installed and running.

“Failure to plan is planning to fail”.  The recipe isn’t that difficult. Go figure it out.

3TB Seagate – a performance sloth

I can’t get home. I am stuck here at the coffee shop waiting out the traffic jam after the heavily downpour an hour ago.

It has been an interesting week for me, which began last week when we were testing the new Seagate 3TB Constellation ES.2 hard disk drives. It doesn’t matter if it was SAS or SATA because the disks were 7,200 RPM, and basically built the same. SAS or SATA is merely the conduit to the disks and we were out there maneuvering the issue at hand.

Here’s an account of  testing done by my team. My team has been testing the drives meticulously, using every trick in the book to milk performance from the Seagate drives. In the end, it wasn’t the performance we got but more like duds from Seagate where these type of drives are concerned.

How did the tests go?

We were using a Unix operating system to test the sequential writes on different partitions of the disks, each with a sizable GB per partition. In one test, we used 100GB per partition. With each partition, we were testing the outer cylinders to the inner cylinders, and as the storage gurus will tell you, the outer rings runs at a faster speed than the inner rings.

We thought it could be the file system we were using, so we switched the sequential writes to raw disks. We tweaked the OS parameters and tried various combinations of block sizes and so on. And what discovered was a big surprise.

The throughput we got from the sequential writes were horrible, started out with MB/sec lower almost 25% lower than a 2TB Western Digital RE4 disk, and as it went on, the throughput in the inner rings went down to single digit MB/sec. According to reliable sources, the slowest published figures by Seagate were in the high 60’s for MB/sec but what we got were close to 20+MB/sec. The Western Digital RE4 was giving out consistent throughput numbers throughout the test. We couldn’t believe it!

We scoured the forums looking for similar issues, but we did not find much about this.This could be a firmware bug. We are in the midst of opening an escalation channel to Seagate to seek explanation. I would like to share what we have discovered and the issue can be easily reproduced. For customers who have purchased storage arrays with 2 or 3TB Seagate Constellation ES/2 drives, please take note. We were disappointed with the disks but thanks to my team for their diligent approach that resulted in this discovery.

Does all SSDs make sense?

I have been receiving a lot of email updates from Texas Memory Systems for many months now. I am a subscriber to their updates and Texas Memory System is the grand daddy of flash and DRAM-based storage systems. They are not cheap but they are blazingly fast.

Lately, more and more vendors have been coming out with all SSDs storage arrays. Startups such Pure Storage, Violin Memory and Nimbus Data Systems have been pushing the envelope selling all SSDs storage arrays. A few days ago, EMC also announced their all SSDs storage array. As quoted, the new EMC VNX5500-F utilizes 2.5-in, single-level cell (SLC) NAND flash drives to 10 times the performance of the hard-drive based VNX arrays. And that is important because EMC has just become one of the earliest big gorillas to jump into the band wagon.

But does it make sense? Can one justify to invest in an all SSDs storage array?

At this point, especially in this part of the world, I predict that not many IT managers are willing to put their head on the chopping board and invest in an all SSDs storage array. They would become guinea pigs for a very expensive exercise and the state of the economy is not helping. Therefore the automatic storage tiering (AST) might stick better than having an all SSDs storage array. The cautious and prudent approach is less risky as I have mentioned in a past blog.

I wrote about Pure Storage in a previous blog and the notion that SSDs will offer plenty of IOPS and throughput. If the performance gain translates into higher productivity and getting the job done quicker, then I am all for SSDs. In fact, given the extra performance numbers

There is no denying that the fact that the industry is moving towards SSDs and it makes sense. That day will come in the near future but not now for customers in these part of the world.

Playing with NetApp … final usable capacity

This is the third and last blog entry of how do we get the ONTAP final capacity.

In my first blog, we ran through a gamut of explanations how disk rightsizing came about for NetApp’s ONTAP. And the importance of disk rightsizing is to give ONTAP a level set of disks, regardless of manufacturer, model, make, firmware versions and so on, and ONTAP is pretty damn sure that the disks that it gets will not mess up.

In my second blog, progressing from the disk rightsizing stage, was the RAID group sizing stage, where different RAID group size affected the number of disks used for data and for parity in an aggregate. An aggregate, for the uninformed, is the disks pool in which the flexible volume, FlexVol, is derived. In a simple picture below,

OK, the diagram’s in Japanese (I am feeling a bit cheeky today :P)!

But it does look a bit self explanatory with some help which I shall provide now. If you start from the bottom of the picture, 16 x 300GB disks are combined together to create a RAID Group. And there are 4 RAID Groups created – rg0, rg1, rg2 and rg3. These RAID groups make up the ONTAP data structure called an aggregate. From ONTAP version 7.3 onward, there were some minor changes of how ONTAP reports capacity but fundamentally, it did not change much from previous versions of ONTAP. And also note that ONTAP takes a 10% overhead of the aggregate for its own use.

With the aggregate, the logical structure called the FlexVol is created. FlexVol can be as small as several megabytes to as large as 100TB, incremental by any size on-the-fly. This logical structure also allow shrinking of the capacity of the volume online and on-the-fly as well. Eventually, the volumes created from the aggregate become the next-building blocks of NetApp NFS and CIFS volumes and also LUNs for iSCSI and Fibre Channel. Also note that, for a more effective organization of logical structures from the volumes, using qtree is highly recommended for files and ONTAP management reasons.

However, for both aggregate and the FlexVol volumes created from the aggregate, snapshot reserve is recommended. The aggregate takes a 5% overhead of the capacity for snapshot reserve, while for every FlexVol volume, a 20% snapshot reserve is applied. While both snapshot percentage are adjustable, it is recommended to keep them as best practice (except for FlexVol volumes assigned for LUNs, which could be adjusted to 0%)

Note: Even if the snapshot reserve is adjusted to 0%, there are still some other rule sets for these LUNs that will further reduce the capacity. When dealing with NetApp engineers or pre-sales, ask them about space reservations and how they do snapshots for fat LUNs and thin LUNs and their best practices in these situations. Believe me, if you don’t ask, you will be very surprised of the final usable capacity allocated to your applications)

In a nutshell, the dissection of capacity after the aggregate would look like the picture below:

 

We can easily quantify the overall usable in the little formula that I use for some time:

Rightsized Disks capacity x # Disks x 0.90 x 0.95 = Total Aggregate Usable Capacity

Then remember that each volume takes a 20% snapshot reserve overhead. That’s what you have got to play with when it comes to the final usable capacity.

Though the capacity is not 100% accurate because there are many variables in play but it gives the customer a way to manually calculate their potential final usable capacity.

Please note the following best practices and this is only applied to 1 data aggregate only. For more aggregates, the same formula has to be applied again.

  1. A RAID-DP, 3-disk rootvol0, for the root volume is set aside and is not accounted for in usable capacity
  2. A rule-of-thumb of 2-disks hot spares is applied for every 30 disks
  3. The default RAID Group size is used, depending on the type of disk profile used
  4. Snapshot reserves default of 5% for aggregate and 20% per FlexVol volumes are applied
  5. Snapshots for LUNs are subjected to space reservation, either full or fractional. Note that there are considerations of 2x + delta and 1x + delta (ask your NetApp engineer) for iSCSI and Fibre Channel LUNs, even though snapshot reserves are adjusted to 0% and snapshots are likely to be turned off.

Another note that remember is not to use any of those Capacity Calculators given. These calculators are designed to give advantage to NetApp, not necessarily to the customer. Therefore, it is best to calculate these things by hand.

Regardless of how the customer will get as the overall final usable capacity, it is the importance to understand the NetApp philosophy of doing things. While we have perhaps, went overboard explaining the usable capacity and the nitty gritty that comes with it, all these things are done for a reason to ensure simplicity and ease of navigating data management in the storage networking world. Other NetApp solutions such as SnapMirror and SnapVault and also the SnapManager suite of product rely heavily on this.

And the intangible benefits of NetApp and ONTAP definitely have moved NetApp forward since its early years, into what NetApp is today, a formidable storage juggernaut.

Playing with NetApp … (Capacity) BR

Much has been said about usable disk storage capacity and unfortunately, many of us take the marketing capacity number given by the manufacturer in verbatim. For example, 1TB does not really equate to 1TB in usable terms and that is something you engineers out there should be informing to the customers.

NetApp, ever since the beginning, has been subjected to the scrutiny of the customers and competitors alike about their usable capacity and I intend to correct this misconception. And the key of this misconception is to understand what is the capacity before rightsizing (BR) and after rightsizing (AR).

(Note: Rightsizing in the NetApp world is well documented and widely accepted with different views. It is part of how WAFL uses the disks but one has to be aware that not many other storage vendors publish their rightsizing process, if any)

Before Rightsizing (BR)

First of all, we have to know that there are 2 systems when it comes to system of unit prefixes. These 2 systems can be easily said as

  • Base-10 (decimal) – fit for human understanding
  • Base-2 (binary) – fit for computer understanding

So according the International Systems of Units, the SI prefixes for Base-10 are

Text Factor Unit
kilo 103 1,000
mega 106 1,000,000
giga 109 1,000,000,000
tera 1012 1,000,000,000,000

In computer context, where the binary, Base-2 system is relevant, that SI prefixes for Base-2 are

Text Factor Unit
kilo-byte 210 1,024
mega-byte 220 1,048,576
giga-byte 230 1,073,741,824
tera-byte 240 1,099,511,627,776

And we must know that the storage capacity is in Base-2 rather than in Base-10. Computers are not humans.

With that in mind, the next issue are the disk manufacturers. We should have an axe to grind with them for misrepresenting the actual capacity. When they say their HDD is 1TB, they are using the Base-10 system i.e. 1TB = 1,000,000,000,000 bytes. THIS IS WRONG!

Let’s see how that 1TB works out to be in Gigabytes in the Base-2 system:

1,000,000,000/1,073,741,824 = 931.3225746154785 Gigabytes

Note: 230 =1,073,741,824

That result of 1TB, when rounded, is only about 931GB! So, the disk manufacturers aren’t exactly giving you what they have advertised.

Thirdly, and also the most important factor in the BR (Before Rightsizing) phase is how WAFL handles the actual capacity before the disk is produced to WAFL/ONTAP operations. Note that this is all done before all the logical structures of aggregates, volumes and LUNs are created.

In this third point, WAFL formats the actual disks (just like NTFS formats new disks) and this reduces the usable capacity even further. As a starting point, WAFL uses 4K (4,096 bytes) per block

For Fibre Channel disks, WAFL formats them with a 520 byte per sector. Therefore, for each block, 8 sectors (520 x 8 = 4160 bytes) fill 1 x 4K block, with remainder of 64 bytes (4,160 – 4,096 = 64 bytes) for the checksum of the 1 x 4K block. This additional 64 bytes per block checksum is not displayed by WAFL or ONTAP and not accounted for in its usable capacity.

512 bytes per sector are used for formatting SATA/SAS disks and it consumes 9 sectors (9 x 512 = 4,608 bytes). 8 sectors will be used for WAFL’s 4K per block (4,096/512 = 8 sectors), the remainder of 1 sector (the 9th sector) of 512 bytes is used partially for its 64 bytes checksum. Again, this 448 bytes (512 – 64 = 448 bytes) is not displayed and not part of the usable capacity of WAFL and ONTAP.

And WAFL also compensates for the ever-so-slightly irregularities of the hard disk drives even though they are labelled with similar marketing capacities. That is to say that 1TB from Seagate and 1TB from Hitachi will be different in terms actual capacity. In fact, 1TB Seagate HDD with firmware 1.0a (for ease of clarification) and 1TB Seagate HDD with firmware 1.0b (note ‘a’ and ‘b’) could be different in actual capacity even when both are shipped with a 1.0TB marketing capacity label.

So, with all these things in mind, WAFL does what it needs to do – Right Size – to ensure that nothing get screwed up when WAFL uses the HDDs in its aggregates and volumes. All for the right reason – Data Integrity – but often criticized for their “wrongdoing”. Think of WAFL as your vigilante superhero, wanted by the law for doing good for the people.

In the end, what you are likely to get Before Rightsizing (BR) from NetApp for each particular disk capacity would be:

Manufacturer Marketing Capacity NetApp Rightsized Capacity Percentage Difference
36GB 34.0/34.5GB* 5%
72GB 68GB 5.55%
144GB 136GB 5.55%
300GB 272GB 9.33%
600GB 560GB 6.66%
1TB 847GB 11.3%
2TB 1.69TB 15.5%
3TB 2.48TB 17.3%

* The size of 34.5GB was for the Fibre Channel Zone Checksum mechanism employed prior to ONTAP version 6.5 of 512 bytes per sector. After ONTAP 6.5, block checksum of 520 bytes per sector was employed for greater data integrity protection and resiliency.

From the table, the percentage of “lost” capacity is shown and to the uninformed, this could look significant. But since the percentage value is relative to the Manufacturer’s Marketing Capacity, this is highly inaccurate. Therefore, competitors should not use these figures as FUD and NetApp should use these as a way to properly inform their customers.

You have been informed about NetApp capacity before Right Sizing.

I will follow on another day with what happens next after Right Sizing and the final actual usable capacity to the users and operations. This will be called After Rightsizing (AR). Till then, I am going out for an appointment.

Storage Tiering – Responsible and Prudent

Does your IT have bottomless budget? If not, storage tiering is likely to be considered as one of IT’s weapons to combat the ever growing need for storage capacity.

Storage tiering is not new and in the past, features such as HSM (Hierarchical Storage Management) and ILM (Information Lifecycle Management) addresses storage tiering in different capacities, ranging for simple aging files movement and migration, to data objects being moved within the data infrastructure of an organization with some kind of workflow and searching capabilities.

Lately, storage tiering, and especially automated storage tiering, has been gaining prominence, thanks to the 2 high profile acquisitions – HP 3PAR and Dell Compellent. According to Wikibon,

Tiered storage is a system of assigning applications to different
types of storage media based on application requirements. Factors
considered in the allocation of storage type include the level of
protection needed, performance requirements, speed of recovery,
and many other considerations.Since assigning application data to
specific media may be complex, some vendors provide software for
automatically managing the process.

For the sake of simplicity, this blog talks about automated storage tiering within the storage array itself, where different data blocks are moved within several tiers to achieve just-right storage provisioning. Why do we need to achieve this “just-right provisioning”? Rather than discussing this from an IT, technical angle, the just-right storage provisioning should be addressed from a business and operational angle, and more rightly so, costs and benefits.

Business and operations are about managing costs and increasing profits. In the past, many storage administrators employ a single storage tier architecture. Using the same type of disks, for example, 146G 10,000RPM Fibre Channel disks, there was usually 1 or 2 RAID levels for the entire data storage requirement. Usually RAID 1+0 volumes/LUNs are for the applications that require the highest performance and availability but they come with a big cost. So, the rest of the data are kept in RAID-5 volumes/LUNs. The introduction of enterprise SATA hard disk drives basically changed the rules of the ball game, giving storage administrators another option, a cheaper alternative to store their data. Obviously, storage vendors saw the great need to address this requirement, and hence created automated storage tiering as part of their offerings.

There are quite a few storage solutions that offers the storage tiering feature, and most of them are automated as well, meaning that the data blocks are moved between the different tiers of storage within the array itself automatically. 3PAR, long before they were acquired by HP, had their Dynamic Autonomic Tiering. Today, with HP, 3PAR offers 2 key strengths in their Autonomic Tiering offering.

  • Adaptive Optimization
  • Dynamic Optimization

As HP puts it,

 

Not to be outdone, Compellent (also long before its acquisition by Dell) had the Data Progression feature as part of the Automated Storage Tiering offering. In a nutshell, their solution (which is basically similar from a 10,000 feet view with most of the competitors) is shown below.

 

The idea is to put the most frequently accessed data blocks to the most expensive, fastest, storage tier and then dynamically move the lesser accessed data block to the least expensive, most economical tier.

I have had the privilege to learn more about Compellent (before Dell) technology about 2.5 years ago, thanks to my friends Chyr and Winston, the bosses at Impact Business Solutions. And what Compellent has was pretty cool stuff and I would like to share what I have picked up about Dell Compellent storage solution. But some of the information could be a little out of date.

The foundation of Dell Compellent automated storage tiering feature, called Data Progression, is their Dynamic Block Architecture (as shown below)

 

From a high level, all data blocks are bunched together into a logical data structure called a page. A page is by default 2MB but can be configured between 512KB and 4MB. The page is the granular unit required to initiate and implement the Data Progression feature in Compellent’s automated storage tiering solution. Every page comes with attached metadata about the page such as

  • When was this page created
  • When was this page last accessed
  • Which RAID level is it currently in (RAID 1+0, RAID-59, RAID-55 and so on)
  • Which Tier does it currently reside (Tier 1, 2 or 3)
  • Which kind of disk track does it live in (Fast or Standard)

Meanwhile, there are different storage Tiers and notably, Tier 1, 2 or 3 where different disk profiles reside. Typically, the SSDs or the 15K RPM disk drives will be in Tier 1, the 10K RPM disk drives will be in Tier 2 and the slowest 7200 RPM disks will be in Tier 3.  Each of the 3 tiers are further divided into the outer Fast disk cylinders (where the platters spin the fastest) and the Standard disk cylinders (running in the inner tracks and slower).

As data chunks or blocks are accessed, their frequency of access and their data movement statistics are gathered in real-time, giving the Compellent solution a fairly good intelligence of how the pages should be laid out on the most relevant tiers. As the pages become more stale, and less relevant, the pages of data chunks are progressively relegated to the lower tiers, while the more active, and most relevant pages relative to importance of access, is progressively promoted to the higher tiers.

Different policies can also be configured to ensure that some important pages stay where they are regardless of their frequency of access or their relevance.

There is a very nice whitepaper from Dell detailing their Data Progression technology.

Another big automated storage tiering player is HP 3PAR. I admit that I don’t know the inner details of the HP 3PAR Dynamic Tiering solution, though I had some glossy lessons from a 3PAR Systems Engineer called Nathan Boeger (thanks to my friends at PTC Singapore, the 3PAR distributor back then) about the same time I learned about Dell Compellent. I hope HP can offer to introduce more in depth of how the 3PAR technology works, now that I have gotten cosy with some of the HP Malaysia’s folks.

Similarly, the other big boys are offering the automated storage tiering solution as well. IBM has been offering Easy Tier for almost 18 months and EMC has its FAST2 for about the same time.

Funnily, the odd one out in this automated storage tiering game is NetApp. I was in a partner conference call about 1 year ago and there were questions asking NetApp about their views of automated storage tiering. At that time of the concall, NetApp did not believe in automated storage tiering, preferring to market their FlashCache PCIe (previously called the PAM card) solution. Take note that the FlashCache is a Read-Only “extension” to their NVRAM, and used to accelerate read operations of WAFL. And also take note that NetApp, at the time of writing, does not have an “engine” that performs automated storage tiering, regardless of how they spin it.

There are also host-based file tiering solutions as well.Since I am familiar with the NetApp universe, Arkivio and Enigma Data Solutions are 2 of the main partners that NetApp works with. Recently NetApp also resells StorNext from Quantum. But note that these host-based solutions are file-based, making them less granular, less dynamic and less efficient. They are usually marketed as file archiving solutions, and the host-based license are usually charged by per TB. In large enterprises, this might make sense but for the everyday Joes (with tight IT budgets), host-based file archiving solutions are expensive. And it is nowhere close to the efficiencies of automated storage tiering.

Overall, automated storage tiering, when applied, should help the IT operations and the organization’s business reduce costs. There is no longer a one-size-fit-all model and associating the right storage tier to the relevance and importance of the data at a very granular sub-LUN/sub-volume level will help any organization define a more prudent approach in managing their data actively and more importantly their cost of operations.

This is called Responsible IT. 😀

Don’t just look at disk reliability!

I am sure that many of you in the storage networking industry can relate to this very well.

When 1 or 2 disk drives fail, the customer will usually press you for an answer and usually this question will pop up. “How come the MTBF is 1.5 million hours but the drive(s) failed after a few months? We also get asked of “How reliable are the disks?” “How sure are you that the storage disks I buy will last?”

And for us in this line, we cannot deny the fact that the customer should be better informed (or at least we get cheesed off by these questions). A few blogs ago, I took the easy way out and educated the customer about MTBF (Mean Time Between Failure). This is only a quarter of the story because MTBF alone does not determine the reliability of the storage ecosystem and the reliability of the storage ecosystem (which translates to data availability) is something that the customer should ask rather than spending their time pressing their annoyance onto you about 1 or 2  disk failures.

I also want to say a little about another disk reliability statistics called AFR. More about that later.

Let’s get a little deeper with disk MTBF. Disk MTBF is a statistically calculated, pre-production measurement. The key word here is “PRE” meaning that THIS IS NOT A FIELD TESTED statistics! This is a statistical likelihood of how long a disk device will last.

One thing to note is how MTBF is derived. In fact, MTBF is established before the entire disk drive line goes into volume production. Typically, there is a process called Real Demonstration Test (RDT). RDT involves putting about 1,000 or more drives into a testing chamber, running them very hard, in elevated temperatures with 100% I/O for about 6-8 weeks. This is to simulate the harshest of an operating environment and inevitably, some disk drives will fail. From these failures, the MTBF is calculated.

A enterprise hard disk drives MTBF will usually be between 1.2 million to 2.0 million hours while the consumer grade drives usually have MTBF of about 300,000-600,000 hours. Therefore, it is important to educate customers because customers like to use some home office/SMB storage solutions to compare with the enterprise storage solution you are about to propose to him.

One of the war stories I heard was from a high-definition video production house. They get hundreds of thousands of Malaysian Ringgit worth of contract from a satellite TV content provider. But being less “educated” (could also be translated to being cheapo), they decided to store their valuable video contents on Buffalo NAS storage. And video production environments can be harsh. The I/O stress on the disks are strenuous and the Buffalo NAS disks crashed. They lost all contents (I don’t know what happened to their backup), and they were fined hundreds of thousands of Malaysian Ringgit and had their contract terminated on the spot. This is not to say that the Buffalo NAS is a poor product, but they got the wrong product for the job. You can’t expect to race the Formula 1 with an old jalopy, can you? You got to get the right solution for the job, even if it costs more.

So the moral of the story is – “Educate yourself and be prepared to invest if the dollar value of the data is more important than what are you think you might be cost-saving”

Over the years, MTBF (even though it is still very much in use today) is getting less and less useful as a reliability measurement. So, what’s better? AFR!

AFR or Annualized Failure Rate has been in use for almost 10 years now, and Seagate, the hard disk manufacturer, uses the AFR value heavily. AFR is the percentage of the installed bases of hard disk drives that failed and returned to factory in a given year. This is a more realistic figure and it is the statistics from the field. The typical value for enterprise disk drives  is usually between 0.7-1.0% although a few years ago, Google created a splash in the industry when they reported in an AFR of 36%. For those who would like to read Google’s paper, click here.

Therefore AFR is a more reliable measurement of disk reliability than MTBF.

But disk reliability is just a 1/4 of the story. We need to be out there educating the customers about the storage ecosystem reliability rather than a specific component. The data availability is paramount because components will fail throughout the lifecycle of the solution. That is why there are technology like RAID, snapshots, backup, mirroring and so on to ensure that the data is made available for the operations and businesses to continue.

Ultimately, if the customer wants to use the disk MTBF onto you, he’s basically shooting at you with the wrong bullet. It’s time you storage networking professional out there educate the customers.

Using simple MTBF to determine reliability to Finance

The other day, a prospect was requesting quotations after quotations from a friend of mine to make so-called “apple-to-apple” comparison with another storage vendor. But it was difficult to have that sort of comparisons because one guy would propose SAS, and the other SATA and so on. I was roped in by my friend to help. So in the end I asked this prospect, which 3 of these criteria matters to him most – Performance, Capacity or Reliability.

He gave me an answer and the reliability criteria was leading his requirement. Then he asked me if I could help determine in a “quick-and-dirty manner” by using MTBF (Mean Time Between Failure) of the disks to convince his finance about the question of reliability.

Well, most HDD vendors published their MTBF as a measuring stick to determine the reliability of their manufactured disks. MTBF is by no means accurate but it is useful to define HDD reliability in a crude manner. If you have seen the components that goes into a HDD, you would be amazed that the HDD components go through a tremendously stressed environment. The Read/Write head operating at a flight height (head gap)  between the platters thinner than a human hair and the servo-controlled technology maintains the constant, never-lagging 7200/10,000/15,000 RPM days-after-days, months-after-months, years-after-years. And it yet, we seem to take the HDD for granted, rarely thinking how much technology goes into it on a nanoscale. That’s technology at its best – bringing something so complex to make it so simple for all of us.

I found that the Seagate Constellation.2 Enterprise-class 3TB 7200 RPM disk MTBF is 1.2 million hours while the Seagate Cheetah 600GB 10,000 RPM disk MTBF is 1.5 million hours. So, the Cheetah is about 30% more reliable than the Constellation.2, right?

Wrong! There are other factors involved. In order to achieve 3TB usable, a RAID 1 (average write performance, very good read performance) would require 2 units of 3TB 7200 RPM disks. On the other hand, using a 10, 000 RPM disks, with the largest shipping capacity of 600GB, you would need 10 units of such HDDs. RAID-DP (this is NetApp by the way) would give average write performance (better than RAID 1 in some cases) and very good read performance (for sequential access).

So, I broke down the above 2 examples to this prospect (to achieve 3TB usable)

  1. Seagate Constellation.2 3TB 7200 RPM HDD MTBF is 1.2 million hours x 2 units
  2. Seagate Cheetah 600GB 10,000 RPM HDD MTBF is 1.5 million hours x 10 units

By using a simple calculation of

    RF (Reliability Factor) = MTBF/#HDDs

the prospect will be able to determine which of the 2 HDD types above could be more reliable.

In case #1, RF is 600,000 hours and in case #2, the RF is 125,000 hours. Suddenly you can see that the Constellation.2 HDDs which has a lower MTBF has a higher RF compared to the Cheetah HDDs. Quick and simple, isn’t it?

Note that I did not use the SAS versus SATA technology into the mixture because they don’t matter. SAS and SATA are merely data channels that drives data in and out of the spinning HDDs. So, folks, don’t be fooled that a SAS drive is more reliable than a SATA drive. Sometimes, they are just the same old spinning HDDs. In fact, the mentioned Seagate Constellation.2 HDD (3TB, 7200 RPM) has both SAS and SATA interface.

Of course, this is just one factor in the whole Reliability universe. Other factors such as RAID-level, checksum, CRC, single or dual-controller also determines the reliability of the entire storage array.

In conclusion, we all know that the MTBF alone does not determine the reliability of the solution the prospect is about to purchase. But this is one way you can use to help the finance people to get the idea of reliability.

All SSDs storage array? There’s more than meets the eye at Pure Storage

Wow, after an entire week off with the holidays, I am back and excited about the many happenings in the storage world.

One of the more prominent news was the announcement of Pure Storage launching its enterprise storage array build entirely with flash-based solid state drives. In addition to that, there were other start-ups who were also offering SSDs storage arrays. The likes of Nimbus Data, Avere, Violin Memory Systems all made the news as well as the grand daddy of solid state storage arrays, Texas Memory Systems.

The first thing that came to my mind was, “Wow, this is great because this will push down the $/GB of SSDs closer to the range of $/GB for spinning disks”. But then skepticism crept in and I thought, “Do we really need an entire enterprise storage array of SSDs? That’s going to cost the world”.

At the same time, we in the storage industry knows that no piece of data are alike. They can be large, small, random, sequential, accessed frequently or infrequently and so on. It is obviously better to tier the storage, using SSDs for Tier 0, 10K/15K RPM spinning HDDs for Tier 1, SATA for Tier 2 and perhaps tape for the archive tier. I was already tempted to write my pessimism on Pure Storage when something interesting caught my attention.

Besides the usual marketing jive of sub-milliseconds, predictable latency, green messaging, global inline deduplication and compression and built-in data integrity into its Purity Operating Environment (POE), I was very surprised to find the team behind Pure Storage. Here’s their line-up

  • Scott Dietzen, CEO – starting from principal technologist of Transarc (sold to IBM), principal architect of Web Logic (sold to BEA Systems), CTO of BEA (sold to Oracle), CTO of Zimbra (sold to Yahoo! and then to VMware)
  • John “Coz” Colgrove, Founder & CTO – Veritas Fellow, CTO of Symantec Data Management group, principal architect of Veritas Volume Manager (VxVM) and Veritas File System (VxFS) and holder of 70 patents
  • John Hayes, Founder & Chief Architect – formerly of  Yahoo! office of Chief Technologist
  • Bob Wood, VP of Engineering – Formerly NetApp’s VP of File System Engineering,
  • Michael Cornwell, Director of Technology & Strategy – formerly the lead technologist of Sun Microsystems’ Sun Storage F5100 Flash Array and also Quantum’s storage architect for their storage telemetry, VTL and DXi solutions
  • Ko Yamamoto, VP of System Engineering – previously NetApp’s director of platform engineering, Quantum DXi director of hardware engineering, and also key contributor to 4-generations of Tandem NonStop technology

In addition to that, there are 3 key individual investors worth mentioning

  • Diane Green – Founder of VMware and former CEO
  • Dr. Mendel Rosenblum – Founder and former Chief Scientist and creator of VMware
  • Frank Slootman – formerly CEO of Data Domain (acquired by EMC)

All these industry big guns are flocking to Pure Storage for a reason and it looks to me that Pure Storage ain’t your ordinary, run-of-the-mill enterprise storage company. There’s definitely more than meet the eye.

On top of the enterprise storage array platform is Pure Storage’s Purity Operating Environment (POE). POE focuses on 3 key storage services which are

  • High Performance Data Reduction
  • Mission Critical Reliability
  • Predictable Sub-millisecond Performance

After going through the deep-dive videos by Pure Storage’s CTO, John Colgrove, they are very much banking the success of their solution around SSDs. Everything that they have done is based on SSDs.  For example, in order to achieve a larger capacity as well as a much cheaper $/GB, the data reduction techniques in global deduplication, high compression and also fine grained thin provision of 512 bytes are used. By trading off IOPS (which SSDs have plenty since they are several times faster than conventional spinning disks), a larger usable capacity is achieved.

In their RAID 3D, they also incorporated several high reliability techniques and data integrity algorithm that are specifically for SSDs. One note that was mentioned was that traditional RAID and especially the parity-based RAID levels were designed in the beginning to protect against an entire device failure. However, in SSDs, the failure does not necessarily occur in the entire device. Because of the way SSDs are built, the failure hotspots tend to happen at the much more granular bit level of the SSDs. The erase-then-write techniques that are inherent in NAND Flash SSDs causes the bit error rate (BER) of the SSD device to go up as the device ages. Therefore, it is more likely to get a read/write error from within the SSDs memory itself rather than having the entire SSD device failing. Pure Storage RAID 3D is meant to address such occurrences of bit errors.

I spoke a bit of storage tiering earlier in this article because every corporation employs storage tiering to be financially responsible. However, John Colgrove’s argument was why tier the storage when there’s plentiful of IOPS and the $/GB is comparable to spinning disks. That is true is when the $/GB of SSDs can match the $/GB of spinning disks. Factors we must also taken into account is the rack-space savings using the smaller profile disks of SSDs, the power-savings costs of SSDs versus conventional HDD-based enterprise storage arrays. In its entirety, there are strong indications that the $/GB of SSD-based systems to match or perhaps lower the $/GB of HDD-based systems. And since the IOPS requirement levels of present-day applications have not demanded super-high IOPS and multi-core processing is cheap, there’s plenty of head-room for Pure Storage and other similar enterprise storage array companies to grow.

The tides are changing for the storage industry and it is good to see a start-up like Pure Storage boldly coming forth to announce their backing for SSDs. It’s good for the consumer and good for the industry. But more importantly, they are driving innovations to rethink of how we build storage arrays. I am looking forward to more things to come.

Having fun with your storage vendor and get the information to fit your data center

I was on my way to Singapore yesterday. At the departure lounge, I just started reading “Data Center Storage” by Hubbert Smith (ISBN#: 978-1439834879) yesterday and I learned something very interesting immediately. Then my thoughts started stirring and I thought I have a bit of fun with what I have learned from the book.

The single, most significant piece of the storage solution is the hard disk drive (HDD). Regardless of SAN or NAS protocols, the data is stored and served from the hard disk drives. And there are 4 key metrics of a HDD, which are

  • Price
  • Performance
  • Capacity
  • Power

As storage professionals, we are often challenged to deliver the best storage solution to meet the customer’s requirements. Therefore, it is not about providing the fastest IOPS or the best availability or the lowest price. It is about providing the best balance of the 4 key metrics above.

The 4 metrics are of little help when they are standalone but if they are combined in relation to each other, you as a customer, can obtain some measurable ratios that will be useful to size for a requirements, keeping the balance of the 4 key metrics better defined rather than getting fluff and BS from the storage vendor.

In the book, the following table was displayed and I found it to be extremely useful:

Key Ratios for HDDs
Ratio
Performance/Price IOPS/$
Performance/Power IOPS/watt
Capacity/Price GB/$
Capacity/Power GB/watt

The relational ratios in red are going to be useful in determining the right type of storage for the requirement. And we will come back to this later. We begin our quest to obtain the information that we want – Performance, Capacity, Price, Power.

Capacity is the easy one because it is a given fact the size of the HDDs.

IOPS for each type of HDDs is also easy to obtain. See table below:

Disk Type RPM IOPS Range
SATA 5,400 50-75
SATA 7,200 75-100
SAS/FC 10,000 100-125
SAS/FC 15,000 175-200
SSD N/A 5,000-10,000

The watt of each HDDs is also quite easy. Just ask the vendor to give the specification of the HDDs.

The pricing part would be part where we can have a bit of fun with the storage vendor. Usually, storage vendors do not release the price of a single HDD in the quotation. The total price is lumped together with everything else, making it harder to decipher the price. So, what can the customer do?

Easy. Get 4-5 quotations from the storage vendor, each with different type of HDDs. This is the customer’s rights. For example, I have created several fictitious quotations, each with a different type of HDDs/SSD and pricing.

Quote #1 (SATA 7200 RPM)

 

Quote #2 (SAS 10,000 RPM)

 

Quote #3 (SAS 15,000 RPM)

 

Quote #4 (SSD)

 

From the 4 quotations, we cannot ascertain the true price of a single disk, but we can assume that the 12 units HDDs/SSDs take up 50% of the entire quotation. With all things being equal, especially the quantity of 12, we can establish the very rough estimate of the price. Having fun asking the storage vendor to run around with the quotations is the added bonus.

But we can derive the following figures (rough estimates but useful when we apply them to the key ratios above)

1TB SATA = 3333.33; 300GB 10,000 RPM SAS = 5000.00; 300GB 15,000 RPM SAS = 6250.00; 100GB SSD = 10416.66

When we juxtapose the information that we have collected i.e. price, performance and capacity (ok, I am skipping power/watt because I am lazy to find out), we come up with a table below:

In the boxed area, we can now easily determine which HDDs/SSDs that give the best value for money either Performance/$ or Capacity/$. The higher the key ratio, the better the value.

From this aspect, the customer can now determine methodically which type of disk he should invest into, in order to get the best value.

This is just a very simplistic method to find the value of the storage solution to be purchased. Bear in mind that there are many other factors to consider as well, such as rack unit height, total power consumption, storage efficiency, data protection and many more.

I am not taking credit for what Hufferd Smith has proposed. All kudos to him but I am using his method to apply to what is relevant to us on the field.

In conclusion, the customer won’t be baffled and confused thinking that they got the best deal at lowest price or fastest performance. This crude method can help turn perception into something that is more concrete and analytical. It’s time we, as customer, know our rights, and know what we are buying into and have a bit of fun too with the storage vendor.