Hail Hydra!

The last of the Storage Field Day 6 on November 7th took me and the other delegates to NEC. There was an obvious, yet eerie silence among everyone about this visit. NEC? Are you kidding me?

NEC isn’t exactly THE exciting storage company in the Silicon Valley, yet I was pleasantly surprised with their HydraStorprowess. It is indeed quite a beast, with published numbers of backup throughput of 4PB/hour, and scales to 100PB of capacity. Most impressive indeed, and HydraStor deserves this blogger’s honourable architectural dissection.

HydraStor is NEC’s grid-based, scale-out storage platform with an object storage backend. The technology, powered by the DynamicStor ™ software, a distributed file system laid over the HydraStor grid architecture. At the same time, it has the DataRedux™ technology that provides the global in-line deduplication as the HydraStor ingests data for data protection, replication, archiving and WORM purposes. It is a massive data consolidation platform, storing gazillion loads of data (100PB you say?) for short-term and long-term retention and recovery.

The architecture is indeed solid, and its data availability goes beyond traditional RAID-level resiliency. HydraStor employs their proprietary erasure coding, called Distributed Resilient Data™. The resiliency knob can be configured to withstand 6 concurrent disks or nodes failure, but by default configured with a resiliency level of 3.

We can quickly deduce that DynamicStor™, DataRedux™ and Distributed Resilient Data™ are the technology pillars of HydraStor. How do they work, and how do they work together?

Let’s look a bit deeper into the HydraStor architecture.

HydraStor is made up of 2 types of nodes:

  • Accelerator Nodes
  • Storage Nodes

The Accelerator Nodes (AN) are the access nodes. They interface with the HydraStor front end, which could be CIFS, NFS or OST (Open Storage Technology). The AN nodes chunks the in-coming data and performs in-line deduplication at a very high speed. It can reach speed of 300TB/hour, which is blazingly fast!

The AN nodes also runs DynamicStor™, handling the performance heavy-lifting portion of HydraStor. The chunked data from the AN nodes are then passed on to the Storage Nodes (SN), where they are further “deduped in-line” to determined if the chunks are unique or not. It is a two-step inline deduplication process. Below is a diagram showing the ANs built above the SNs in the HydraStor grid architecture.

NEC AN & SN grid architecture


The HydraStor grid architecture is also a very scalable architecture, allow the dynamic scale-in and scale-out of both ANs and SNs. AN nodes and SN nodes can be added or removed into the system, auto-configuring and auto-optimizing while everything stays online. This capability further strengthens the reliability and the resiliency of the HydraStor.

NEC Hydrastor dynamic topology

Moving on to DataRedux™. DataRedux™ is HydraStor’s global in-line data deduplication technology. It performs dedupe at the sub-file level, with variable length window. This is performed at the AN nodes and the SN nodes level,chunking and creating unique hash values. All unique chunks are further compressed with a modified LZ compression algorithm, shrinking the data to its optimized footprint on the disk storage. To maintain the global in-line deduplication, the hash table is available across the HydraStor cluster.

NEC Deduplication & Compression

The unique data chunk resulting from deduplication and compression are then written to disks using the configured Distributed Resilient Data™ (DRD) algorithm, at its set resiliency level.

At the junction of DRD, with erasure coding parity, the data is broken up into multiples of fragments and assigned a parity to a grouping of fragments. If the resiliency level is set to 3 (the default), the data is broken into 12 pieces, 9 data fragments + 3 parity fragments. The 3 parity fragments corresponds to the resiliency level of 3. See diagram below of the 12 fragments spread across a group of selected disks in the storage pool of the Storage Nodes.

NEC DRD erasure coding on Storage Nodes


If the HydraStor experiences a failure in the disks or nodes, and has resulted in the loss of a fragment or fragments, the DRD self-healing function will auto-rebuild and auto-reconfigure the recovered fragments in another set of disks, maintaining the level of 3 parities.

The resiliency level, as mentioned earlier, can be set up to 6, boosting the HydraStor survival factor of 6 disks or nodes failure in the grid. See below of how the autonomous DRD recovery works:

NEC Autonomous Data recovery

Despite lacking the razzle dazzle of most Silicon Valley storage startups and upstarts, credit be given where credit is due. NEC HydraStor is indeed a strong show stopper.

However, in a market that is as fickle as storage, deduplication solutions such as HydraStor, EMC Data Domain, and HP StoreOnce, are being superceded by Copy Data Management technology, touted by Actifio. It was rumoured that EMC restructured their entire BURA (Backup Recovery Archive) division to DPAD (Data Protection and Availability Division) to go after the burgeoning copy data management market.

It would be good if NEC can take notice and turn their HydraStor “supertanker” towards the Copy Data Management market. That would be something special to savour.

P/S: NEC. Sorry about the title. I just couldn’t resist it ;-)

4TB disks – the end of RAID

Seriously? 4 freaking terabyte disk drives?

The enterprise SATA/SAS disks have just grown larger, up to 4TB now. Just a few days ago, Hitachi boasted the shipment of the first 4TB HDD, the 7,200 RPM Ultrastar™ 7K4000 Enterprise-Class Hard Drive.

And just weeks ago, Seagate touted their Heat-Assisted Magnetic Recording (HAMR) technology will bring forth the 6TB hard disk drives in the near future, and 60TB HDDs not far in the horizon. 60TB is a lot of capacity but a big, big nightmare for disks availability and data backup. My NetApp Malaysia friend joked that the RAID reconstruction of 60TB HDDs would probably finish by the time his daughter finishes college, and his daughter is still in primary school!.

But the joke reflects something very serious we are facing as the capacity of the HDDs is forever growing into something that could be unmanageable if the traditional implementation of RAID does not change to meet such monstrous capacity.

Yes, RAID has changed since 1988 as every vendor approaches RAID differently. NetApp was always about RAID-4 and later RAID-DP and I remembered the days when EMC had a RAID-S. There was even a vendor in the past who marketed RAID-7 but it was proprietary and wasn’t an industry standard. But fundamentally, RAID did not change in a revolutionary way and continued to withstand the ever ballooning capacities (and pressures) of the HDDs. RAID-6 was introduced when the first 1TB HDDs first came out, to address the risk of a possible second disk failure in a parity-based RAID like RAID-4 or RAID-5. But today, the 4TB HDDs could be the last straw that will break the camel’s back, or in this case, RAID’s back.

RAID-5 obviously is dead. Even RAID-6 might be considered insufficient now. Having a 3rd parity drive (3P) is an option and the only commercial technology that I know of which has 3 parity drives support is ZFS. But having 3P will cause additional overhead in performance and usable capacity. Will the fickle customer ever accept such inadequate factors?

Note that 3P is not RAID-7. RAID-7 is a trademark of a old company called Storage Computer Corporation and RAID-7 is not a standard definition of RAID.

One of the biggest concerns is rebuild times. If a 4TB HDD fails, the average rebuild speed could take days. The failure of a second HDD could up the rebuild times to a week or so … and there is vulnerability when the disks are being rebuilt.

There are a lot of talks about declustered RAID, and I think it is about time we learn about this RAID technology. At the same time, we should demand this technology before we even consider buying storage arrays with 4TB hard disk drives!

I have said this before. I am still trying to wrap my head around declustered RAID. So I invite the gurus on this matter to comment on this concept, but I am giving my understanding on the subject of declustered RAID.

Panasas‘ founder, Dr. Garth Gibson is one of the people who proposed RAID declustering way back in 1999. He is a true visionary.

One of the issues of traditional RAID today is that we still treat the hard disk component in a RAID domain as a whole device. Traditional RAID is designed to protect whole disks with block-level redundancy.  An array of disks is treated as a RAID group, or protection domain, that can tolerate one or more failures and still recover a failed disk by the redundancy encoded on other drives. The RAID recovery requires reading all the surviving blocks on the other disks in the RAID group to recompute blocks lost on the failed disk. In short, the recovery, in the event of a disk failure, is on the whole object and therefore, a entire 4TB HDD has to be recovered. This is not good.

The concept of RAID declustering is to break away from the whole device idea. Apply RAID at a more granular scale. IBM GPFS works with logical tracks and RAID is applied at the logical track level. Here’s an overview of how is compares to the traditional RAID:

The logical tracks are spread out algorithmically spread out across all physical HDDs and the RAID protection layer is applied at the track level, not at the HDD device level. So, when a disk actually fails, the RAID rebuild is applied at the track level. This significant improves the rebuild times of the failed device, and does not affect the performance of the entire RAID volume much. The diagram below shows the declustered RAID’s time and performance impact when compared to a traditional RAID:

While the IBM GPFS approach to declustered RAID is applied at a semi-device level, the future is leaning towards OSD. OSD or object storage device is the next generation of storage and I blogged about it some time back. Panasas is the leader when it comes to OSD and their radical approach to this is applying RAID at the object level. They call this Object RAID.

With object RAID, data protection occurs at the file-level. The Panasas system integrates the file system and data protection to provide novel, robust data protection for the file system.  Each file is divided into chunks that are stored in different objects on different storage devices (OSD).  File data is written into those container objects using a RAID algorithm to produce redundant data specific to that file.  If any object is damaged for whatever reason, the system can recompute the lost object(s) using redundant information in other objects that store the rest of the file.

The above was a quote from the blog of Brent Welch, Panasas’ Director of Software Architecture. As mentioned, the RAID protection of the objects in the OSD architecture in Panasas occurs at file-level, and the file or files constitute the object. Therefore, the recovery domain in Object RAID is at the file level, confining the risk and damage of data loss within the file level and not at the entire device level. Consequently, the speed of recovery is much, much faster, even for 4TB HDDs.

Reliability is the key objective here. Without reliability, there is no availability. Without availability, there is no performance factors to consider. Therefore, the system’s reliability is paramount when it comes to having the data protected. RAID has been the guardian all these years. It’s time to have a revolutionary approach to safeguard the reliability and ensure data availability.

So, how many vendors can claim they have declustered RAID?

Panasas is a big YES, and they apply their intelligence in large HPC (high performance computing) environments. Their technology is tried and tested. IBM GPFS is another. But where are the rest?


We raid vRAID

I took a bit of time off to read through Violin’s vRAID technology because I realized that vRAID (other than Violin’s vXM architecture) is the other most important technology that differentiates Violin Memory from the other upstarts. I blogged at a high-level about Violin a few entries ago, and we are continuing Violin impressive entrance with a storage technology that have been around for almost 25 years – RAID. Incidentally, I found this picture of the original RAID paper (see below):

Has RAID evolved with solid state storage? Evidently, no, because I have not read of any vendors (so far) touting any RAID revolution in their solid state offerings. There has been a lot of negative talks about RAID, but RAID has been the cornerstone and the foundation of storage ever since the beginning. But with the onslaughts of very large capacity HDDs, the demands of packing more bits-per-inch and the insatiable needs for reliability, RAID is slowly beginning to show its age. Cracks in the armour, I would say. And there are many newer, slightly more refined versions of RAID, from the Network RAID-style of HP P4000 or the Dell EqualLogic, to the RAID-X of IBM XIV, to innovations of declustered RAID in Panasas. (Interestingly, one of the early founders of the actual RAID concept paper, Garth Gibson, is the founder of Panasas).

And the new vRAID from Violin-System doesn’t sway much from the good ol’ RAID, but it has been adapted to address the issues of Solid State Devices.

Solid State devices (notably NAND Flash since everyone is using them) are very different from the usual spinning disks of HDDs. They behave differently and pairing solid state devices with the present implementations of RAID could be like mixing oil and water. I am not saying that the present RAID cannot work with solid state devices, but has RAID adapted to the idiosyncrasies of Flash?

It is like putting an old crank shaft into a new car. It might work for a while, but in the long run, it could damage the car. Similarly, conventional RAID might have detrimental performance and availability impact with solid state devices. And we have hardly seen storage vendors coming up to say that their RAID technology has been adapted to the solid state devices that they are selling. This silence could likely mean that they are just adapting to market requirements and not changing their RAID codes very much to take advantage of Flash or other solid state storage for that matter. Violin Memory has boldly come forward to meet that requirement and vRAID is their answer.

Violin argues that there are bottlenecks at the external RAID controller or software RAID level as well as use of legacy disk drive interfaces. And this is indeed true, because this very common RAID implementation squeezes performance at the expense of the other components such as CPU cycles.

Furthermore, there are plenty of idiosyncrasies in Flash with things such as erase-first, then write mechanism. The nature of NAND Flash, unlike DRAM, requires a block to be erased first before a write to the block is allowed. It does not “modify” per se, where the operations of read-modify-write is often applied in parity-based RAIDs of 5 and 6. Because of this nature, it is more like read-erase-write, and when the erase of the block is occurring, the read operation is stalled. That is why most SSDs will have impressive read latency (in microseconds), but very poor writes (in milliseconds). Furthermore, the parity-based RAID’s write penalty, can further aggravate the situation when the typical RAID technology is applied to NAND Flash solid state storage.

As the blocks in the NAND Flash build up, the accumulation of read-erase-write will not only reduce the lifespan of the blocks in the NAND Flash, it will also reduce the IOPS to a state we called Normalized Steady State. I wrote about this in my blog, “Not all SSDs are the same” some moons ago. In my blog, SNIA Solid State Storage Performance Testing Suite (SSS-PTS), there were 3 distinct phases of a typical NAND Flash SSD:

  • Fresh of out the Box (FOB)
  • Transition
  • Steady State
This performance degradation is part of what vendors call “Write Cliff”, where there is a sudden drop in IOPS performance as the NAND Flash SSD ages. Here’s a graph that shows the performance drop.
Violin’s vRAID, implemented within its switched vXM architecture itself, and using proprietary high performance flash controllers and the flash-optimized vRAID technology, is able deliver sustained IOPS throughout the lifespan of the flash SSD, as shown below:
To understand vRAID we have to understand the building blocks of the Violin storage array. NAND Flash chips of 4GB are packed into a Flash Package of 8 giving it 32GB. And 16 of these 32GB Flash Package are then consolidated into a 512GB VIMM (Violin Inline Memory Module). The VIMM is the starting block and can be considered as a “disk”, since we are used to the concept of “disk” in the storage networking world. 5 of these VIMMs will create a RAID group of 4+1 (four data and one parity), giving the redundancy, performance and capacity similar to RAID-5.
The block size used is 4K block and this 4K block is striped across the RAID group with 1K pages each on each of the VIMMs in the RAID group. Each of this 1K page is managed independently and can be placed anywhere in any flash block in the VIMMs, and spread out for lowest possible latency and bandwidth. This contributes to the “spike free latency” of Violin Memory. Additionally, there is ECC protection within each 1K page to correct flash bit error.
To protect against metadata corruption, there is an additional, built-in RAID Check bit to correct the VIMM errors. Lastly, one important feature that addresses the read-erase-write weakness of NAND Flash, the vRAID ensures that the slow erases never block a Read or a Write. This architectural feature enable spike-free latency in mixed Read/Write environments.
Here’s a quick overview of Violin’s vRAID architecture:
I still feel that we need a radical move away from the traditional RAID and vRAID is moving in the right direction to evolve RAID to meet the demands of the data storage market. Revolutionary and radical it may not be, but then again, is the market ready for anything else?
As I said, so far Violin is the only all-Flash vendor that has boldly come forward to meet the storage latency problem head-on, and they have been winning customers very quickly. Well done!

Storage must go on a diet

Nowadays, the capacity of the hard disk drives (HDDs) are really big. 3TB is out and 4TB is in the horizon. What’s next?

For small-medium businesses in Malaysia, depending on their data requirements and applications, 3-10TB is pretty sufficient  and with room to grow as well. Therefore, a 6TB requirement can be easily satisfied with 2 x 3TB HDDs.

If I were the customer, why would I buy a storage array, with the software licenses and other stuff that will not only increase my cost of equipment acquisition and data management, it will also increase the complexity of my IT infrastructure? I could just slot HDDs into my existing server, RAID it with RAID-0 (not a good idea but to save costs, most customers would do that) and I have a 6TB volume! It’s cheaper, easier to manage with Windows or Linux, and my system administrator doesn’t have to fuss about lack of storage experience.

And RAID isn’t really keeping up with the tremendous growth of HDD’s capacity as well. In fact, RAID is at risk. RAID (especially RAID 5/6) just cannot continue provide the LUN or volume reliability and data availability because it just takes too damn long to rebuild the volume after the failure of a disk.

Back in the days where HDDs were less than 500GB, RAID-5 would still hold up but after passing the 1TB mark, RAID-6 became more prevalent. But now, that 1TB has ballooned to 3TB and RAID-6 is on shaky ground. What’s next? RAID-7? ZFS has RAID-Z3, triple parity but come on, how many vendors have that? With triple parity or stronger RAID (is there one?), the price of the storage array is going to get too costly.

Experts have been speaking about parity-declustering,  but that’s something that a few vendors have right now. Panasas, founded by one of forefathers of RAID, Garth Gibson, comes to mind. In fact, Garth Gibson and Mark Holland of Cargenie-Mellon University’s Parallel Data Lab (PDL) presented a paper about parity-declustering more than 10 years ago.

Let’s get back to our storage fatty. Yes, our storage is getting fat, obese, rotund or whatever you want to call it. And storage vendors have been pushing a concept in hope that storage administrators and customers can take advantage of it. It is called Storage Optimization or Storage Efficiency.

Here are a few ways you can consider to put your storage on a diet.

  • Compression
  • Thin Provisioning
  • Deduplication
  • Storage Tiering
  • Tapes and SSDs

To me, compression has not taken the storage world by storm. But then again, there aren’t many vendors that tout compression as a feature for storage optimization. Most of them rather prefer to push the darling of data reduction, data deduplication, as the main feature for save more space. Theoretically, data deduplication makes more sense when the data is inactive, and has high occurrence of duplicated data. That is why secondary storage such  as backup deduplication targets like Data Domain, HP StoreOnce, Quantum DXi can publish 20:1 rates and over time, that rate can get even higher.

NetApp also has been pushing their A-SIS data deduplication on primary storage. Yes, it helps with the storage savings in primary but when the need for higher data transfer rates and time to access “manipulated” data (deduped or compressed), it is likely that compression is a better choice for primary, active data.

So who has compression? NetApp ONTAP 8.0.1 has compression now and IBM with its Storewize V7000 started as a compression device. Read about IBM Storewize in my blog here. Dell has Ocarina Networks, which was recently unleashed. I am a big fan of Ocarina Networks and I wrote about the technology in my previous blog. EMC, during the Celerra days of DART has compression but I don’t hear much about it in their VNX. Compression is there, believe me, embedded all the loads of EMC marketing.

Thin Provisioning is now a must-have and standard feature of all storage vendors. What is Thin Provisioning? The diagram below shows you:

In the past, storage systems aren’t so intelligent. You ask for 10TB, you are given 10TB and that 10TB is “deducted” from the storage capacity. That leads to wastage and storage inefficiencies. Today, Thin Provisioning will give you 10TB but storage capacity is consumed as it is being used. The capacity is not pre-allocated as in the past. Thin provisioning is a great diet pill for bloated storage projects. 

Another up and coming feature is storage tiering. Storage tiering, when associated to storage optimization, should include hierarchical storage management (HSM) and tape-out as well. Storage optimization solutions should not offer only in the storage array itself. Storage tiering within the storage array is available with most vendors – IBM EasyTier, EMC FAST2, Dell Fluid Data Management and many others. But what about data being moved out of the storage array? What about reducing the capacity of the data online or near-line? Why not put them offline if there isn’t a need for it?

I term this as Active Archiving, something I learned while I was at EMC. Here’s a look at EMC’s style of Active Archiving:

Active Archiving promotes the concept of data archiving and is not unique only to EMC. Almost all storage vendors, either natively or with 3rd party vendors, can perform fairly efficient data archiving in one way or another. One of the software that I liked (and not unique!) is Quantum Stornext. Here’s a video of how Quantum Stornext helps reduce the fat of the storage.

With the single-copy sharing feature of Quantum Stornext to multiple disparate OSes, there are lesser duplicate files in storage as well.

Tapes have been getting a bad name in the past few years. It has been repositioned and repurposed as an archive medium rather than a backup medium. But tape is the greenest and most powerful storage diet pill around. And we should not be discount tapes because tapes are fighting back. Pretty soon you will be hearing about Linear Tape File System (LTFS). In a nutshell, Linear Tape File System (LTFS) allows you to use the tape almost as if it were a hard disk. You can drag and drop files from your server to the tape, see the list of saved files using a standard operating system directory (no backup software catalog needed), and use point and click to restore. How cool is that!

And Solid State Drives (SSDs) makes sense as well.

There are times that we need IOPS and using spinning drives, we have to set up many disk spindles to achieve the IOPS that we want.  For example, using the diagram below from the godfather of storage, Greg Schulz,

The set of 16 spinning HDD drives on the left can only deliver 3,520 IOPS. The problem is, we have wasted a lot of disk space, as seen in the diagram below. This design, which most customer would be accustomed to, may look cheaper but in actual fact, is NOT.

If the price of a Fibre Channel HDD is RM2,000, the total of 16 would make up RM32,000.00. That is not inclusive of additional power and cooling and rack space and also the data management costs. Assuming the SSDs costs 5 times more than the Fibre Channel HDD. SSDs are capable of delivering very high IOPS. Here I am putting a modest 5,000 IOPS per SSDs. With just 2 SSDs (as the right design suggests), the total costs is only RM20,000. It has greater performance room to grow, and also savings in data management, power and cooling.

Folks, consider SSDs as part of your storage diet plan.

All these features are available, in whole or in part, and they are part of the storage technology offerings that is out there. With all these being said, are you doing something about it? Get off your lazy bum and start managing your storage and put your storage on a diet!!!

The recipe for storage performance modeling

Good morning, afternoon, evening, Ladies & Gentlemen, wherever you are.

Today, we are going to learn how to bake, errr … I mean, make a storage performance model. Before we begin, allow me to set the stage.

Don’t you just hate it when you are asked to do storage performance sizing and you don’t have a freaking idea how to get started? A typical techie would probably say, “Aiya, just use the capacity lah!”, and usually, they will proceed to size the storage according to capacity. In fact, sizing by capacity is the worst way to do storage performance modeling.

Bear in mind that storage is not a black box, although some people wished it was. It is not black magic when it comes to performance sizing because things can be applied in a very scientific and logical manner.

SNIA (Storage Networking Industry Association) has made a storage performance modeling methodology (that’s quite a mouthful), and basically simplified it into these few key ingredients. This recipe is for storage performance modeling in general and I am advising you guys out there to engage your storage vendors professional services. They will know their storage solutions best.

And I am going to say to you – Don’t be cheap and not engage professional services – to get to the experts out there. I was having a chat with an consultant just now at McDonald’s. I have known this friend of mine for about 6-7 years now and his name is Sugen Sumoo, the Director of DBORA Consulting. They specialize in Oracle and database performance tuning and performance forecasting and it is something that a typical DBA can’t do, because DBORA Consulting is the Professional Service that brings expertise and value to Oracle customers. Likewise, you have to engage your respective storage professional services as well.

In a cook book or a cooking show, you are presented with the ingredients used and in this recipe for storage performance modeling, the ingredients (in no particular order) are:

  • Application block size
  • Read and Write ratio
  • Application access patterns
  • Working set size
  • IOPS or throughput
  • Demand intensity

Application Block Size

First of all, the storage is there to serve applications. We always have to look from the applications’ point of view, not storage’s point of view.  Different applications have different block size. Databases typically range from 8K-64K and backup applications usually deal with larger block sizes. Video applications can have 256K block sizes or higher. It all depends.

The best way is to find out from the DBA, email administrator or application developers. The unfortunate thing is most so-called technical people or administrators in Malaysia doesn’t have a clue about the applications they manage. So, my advice to you storage professionals, do your research on the application and take the default value. These clueless fellas are likely to take the default.

Read and Write ratio

Applications behave differently at different times of the day, and at different times of the month (no, it’s not PMS). At the end of the financial year or calendar, there are some tasks that these applications do as well. But in a typical day, there are different weightage or percentage of read operations versus write operations.

Most OLTP (online transaction processing)-based applications tend to be read heavy and write light, but we need to find out the ratio. Typically, it can be a 2:1 ratio or 60%:40%, but it is best to speak to the application administrators about the ratio. DSS (Decision Support Systems) and data warehousing applications could have much higher reads than writes while a seismic-analysis applications can have multiple writes during the analysis periods. It all depends.

To counter the “clueless” administrators, ask lots of questions. Find out the workflow of several key tasks and ask what that particular tasks do at different checkpoints of the application’s processing. If you are lazy (please don’t be lazy, because it degrades your value as a storage professional), use a rule of thumb.

Application access patterns

Applications behave differently in general. They can be sequential, like backup or video streaming. They can be random like emails, databases at certain times of the day, and so on. All these behavioral patterns affect how we design and size the disks in the storage.

Some RAID levels tend to work well with sequential access and others, with random access. It is not difficult to find out about the applications’ pattern and if you read more about the different RAID-levels in storage, you can easily identify the type of RAID levels suitable for each type of behavioral patterns.

Working set size

This variable is a bit more difficult to determine. This means that a chunk of the application has to be loaded into a working area, usually memory and cache memory, to be used and abused by the application users.

Unless someone is well versed with the applications, one would not be able to determine how much of the applications would be placed in memory and in cache memory. Typically, this can only be determined after the application has been running for some time.

The flexibility of having SSDs, especially the DRAM-type of SSDs, are very useful to ensure that there is sufficient “working space” for these applications.

IOPS or Throughput

According to SNIA model, for I/O less than 64K, IOPS should be used as a yardstick to do storage performance modeling. Anything larger, use throughput, in which MB/sec is the measurement unit.

The application guy would be able to tell you what kind of IOPS their application is expecting or what kind of throughput they want. Again, ask a lot of questions, because this will help you determine the type of disks and the kind of performance you give to the application guys.

If the application guy is clueless again, ask someone more senior or ask the vendor. If the vendor engineers cannot give you an answer, then they should not be working for the vendor.

Demand intensity

This part is usually overlooked when it comes to performance sizing. Demand intensity refers to how intense is the I/O requests. It could come from 1 channel or 1 part of the applications, or it could come from several parts of the applications in parallel. It is as if the storage is being ‘bombarded’ by applications and this is the part that is hard to determine as well.

In some applications, the degree of intensity or parallelism can be tuned and to find out, ask the application administrator or developer. If not, ask the vendor. Also do a lot of research on the application’s architecture.

And one last thing. What I have learned is to add buffers to the storage performance model. Typically I would add about 10-20% extra but you never know. As storage professionals, I would strongly encourage to engage professional services, because it is worthwhile, especially in the early stages of the sizing. It is usually a more expensive affair to size it after the applications have been installed and running.

“Failure to plan is planning to fail”.  The recipe isn’t that difficult. Go figure it out.

Don’t just look at disk reliability!

I am sure that many of you in the storage networking industry can relate to this very well.

When 1 or 2 disk drives fail, the customer will usually press you for an answer and usually this question will pop up. “How come the MTBF is 1.5 million hours but the drive(s) failed after a few months? We also get asked of “How reliable are the disks?” “How sure are you that the storage disks I buy will last?”

And for us in this line, we cannot deny the fact that the customer should be better informed (or at least we get cheesed off by these questions). A few blogs ago, I took the easy way out and educated the customer about MTBF (Mean Time Between Failure). This is only a quarter of the story because MTBF alone does not determine the reliability of the storage ecosystem and the reliability of the storage ecosystem (which translates to data availability) is something that the customer should ask rather than spending their time pressing their annoyance onto you about 1 or 2  disk failures.

I also want to say a little about another disk reliability statistics called AFR. More about that later.

Let’s get a little deeper with disk MTBF. Disk MTBF is a statistically calculated, pre-production measurement. The key word here is “PRE” meaning that THIS IS NOT A FIELD TESTED statistics! This is a statistical likelihood of how long a disk device will last.

One thing to note is how MTBF is derived. In fact, MTBF is established before the entire disk drive line goes into volume production. Typically, there is a process called Real Demonstration Test (RDT). RDT involves putting about 1,000 or more drives into a testing chamber, running them very hard, in elevated temperatures with 100% I/O for about 6-8 weeks. This is to simulate the harshest of an operating environment and inevitably, some disk drives will fail. From these failures, the MTBF is calculated.

A enterprise hard disk drives MTBF will usually be between 1.2 million to 2.0 million hours while the consumer grade drives usually have MTBF of about 300,000-600,000 hours. Therefore, it is important to educate customers because customers like to use some home office/SMB storage solutions to compare with the enterprise storage solution you are about to propose to him.

One of the war stories I heard was from a high-definition video production house. They get hundreds of thousands of Malaysian Ringgit worth of contract from a satellite TV content provider. But being less “educated” (could also be translated to being cheapo), they decided to store their valuable video contents on Buffalo NAS storage. And video production environments can be harsh. The I/O stress on the disks are strenuous and the Buffalo NAS disks crashed. They lost all contents (I don’t know what happened to their backup), and they were fined hundreds of thousands of Malaysian Ringgit and had their contract terminated on the spot. This is not to say that the Buffalo NAS is a poor product, but they got the wrong product for the job. You can’t expect to race the Formula 1 with an old jalopy, can you? You got to get the right solution for the job, even if it costs more.

So the moral of the story is – “Educate yourself and be prepared to invest if the dollar value of the data is more important than what are you think you might be cost-saving”

Over the years, MTBF (even though it is still very much in use today) is getting less and less useful as a reliability measurement. So, what’s better? AFR!

AFR or Annualized Failure Rate has been in use for almost 10 years now, and Seagate, the hard disk manufacturer, uses the AFR value heavily. AFR is the percentage of the installed bases of hard disk drives that failed and returned to factory in a given year. This is a more realistic figure and it is the statistics from the field. The typical value for enterprise disk drives  is usually between 0.7-1.0% although a few years ago, Google created a splash in the industry when they reported in an AFR of 36%. For those who would like to read Google’s paper, click here.

Therefore AFR is a more reliable measurement of disk reliability than MTBF.

But disk reliability is just a 1/4 of the story. We need to be out there educating the customers about the storage ecosystem reliability rather than a specific component. The data availability is paramount because components will fail throughout the lifecycle of the solution. That is why there are technology like RAID, snapshots, backup, mirroring and so on to ensure that the data is made available for the operations and businesses to continue.

Ultimately, if the customer wants to use the disk MTBF onto you, he’s basically shooting at you with the wrong bullet. It’s time you storage networking professional out there educate the customers.

Using simple MTBF to determine reliability to Finance

The other day, a prospect was requesting quotations after quotations from a friend of mine to make so-called “apple-to-apple” comparison with another storage vendor. But it was difficult to have that sort of comparisons because one guy would propose SAS, and the other SATA and so on. I was roped in by my friend to help. So in the end I asked this prospect, which 3 of these criteria matters to him most – Performance, Capacity or Reliability.

He gave me an answer and the reliability criteria was leading his requirement. Then he asked me if I could help determine in a “quick-and-dirty manner” by using MTBF (Mean Time Between Failure) of the disks to convince his finance about the question of reliability.

Well, most HDD vendors published their MTBF as a measuring stick to determine the reliability of their manufactured disks. MTBF is by no means accurate but it is useful to define HDD reliability in a crude manner. If you have seen the components that goes into a HDD, you would be amazed that the HDD components go through a tremendously stressed environment. The Read/Write head operating at a flight height (head gap)  between the platters thinner than a human hair and the servo-controlled technology maintains the constant, never-lagging 7200/10,000/15,000 RPM days-after-days, months-after-months, years-after-years. And it yet, we seem to take the HDD for granted, rarely thinking how much technology goes into it on a nanoscale. That’s technology at its best – bringing something so complex to make it so simple for all of us.

I found that the Seagate Constellation.2 Enterprise-class 3TB 7200 RPM disk MTBF is 1.2 million hours while the Seagate Cheetah 600GB 10,000 RPM disk MTBF is 1.5 million hours. So, the Cheetah is about 30% more reliable than the Constellation.2, right?

Wrong! There are other factors involved. In order to achieve 3TB usable, a RAID 1 (average write performance, very good read performance) would require 2 units of 3TB 7200 RPM disks. On the other hand, using a 10, 000 RPM disks, with the largest shipping capacity of 600GB, you would need 10 units of such HDDs. RAID-DP (this is NetApp by the way) would give average write performance (better than RAID 1 in some cases) and very good read performance (for sequential access).

So, I broke down the above 2 examples to this prospect (to achieve 3TB usable)

  1. Seagate Constellation.2 3TB 7200 RPM HDD MTBF is 1.2 million hours x 2 units
  2. Seagate Cheetah 600GB 10,000 RPM HDD MTBF is 1.5 million hours x 10 units

By using a simple calculation of

    RF (Reliability Factor) = MTBF/#HDDs

the prospect will be able to determine which of the 2 HDD types above could be more reliable.

In case #1, RF is 600,000 hours and in case #2, the RF is 125,000 hours. Suddenly you can see that the Constellation.2 HDDs which has a lower MTBF has a higher RF compared to the Cheetah HDDs. Quick and simple, isn’t it?

Note that I did not use the SAS versus SATA technology into the mixture because they don’t matter. SAS and SATA are merely data channels that drives data in and out of the spinning HDDs. So, folks, don’t be fooled that a SAS drive is more reliable than a SATA drive. Sometimes, they are just the same old spinning HDDs. In fact, the mentioned Seagate Constellation.2 HDD (3TB, 7200 RPM) has both SAS and SATA interface.

Of course, this is just one factor in the whole Reliability universe. Other factors such as RAID-level, checksum, CRC, single or dual-controller also determines the reliability of the entire storage array.

In conclusion, we all know that the MTBF alone does not determine the reliability of the solution the prospect is about to purchase. But this is one way you can use to help the finance people to get the idea of reliability.

Copy-on-Write and SSDs – A better match than other file systems?

We have been taught that file systems are like folders, sub-folders and eventually files. The criteria in designing file systems is to ensure that there are few key features

  • Ease of storing, retrieving and organizing files (sounds like a fridge, doesn’t it?)
  • Simple naming convention for files
  • Performance in storing and retrieving files – hence our write and read I/Os
  • Resilience in restoring full or part of a file when there are discrepancies

In file systems performance design, one of the most important factors is locality. By locality, I mean that data blocks of a particular file should be as nearby as possible. Hence, in most file systems designs originated from the Berkeley Fast File System (BFFS), requires the file system to seek the data block to be modified to ensure locality, i.e. you try not to split up the contiguity of the data blocks. The seek time to find the require data block takes time, but you are compensate with faster reads because the read-ahead feature allows you to read extra blocks ahead in anticipation that the data blocks are related.

In Copy-on-Write file systems (also known as shadow-paging file systems), the seek portion is usually not present because the new modified block is written somewhere else, not the present location of the original block. This is the foundation of Copy-on-Write file systems such as NetApp’s WAFL and Oracle Solaris ZFS. Because the new data blocks are written somewhere else, the storing (write operation) portion is faster. It eliminated the seek time and it also skipped the read-modify-write action to the original location of the data block. Therefore, write is likely to be faster.

However, the read portion will be slower because if you want to read a file, the file system has to go around looking for the data blocks because it lacks the locality. Therefore, as the COW file system ages, it tends to have higher file system fragmentation. I wrote about this in my previous blog. It is a case of ENJOY-FIRST/SUFFER-LATER. I am not writing this to say that COW file systems are bad. Obviously, NetApp and Oracle have done enough homework to make the file systems one of the better storage file systems in the market.

So, that’s Copy-on-Write file systems. But what about SSDs?

Solid State Drives (SSDs) will make enemies with file systems that tend prefer locality. Remember that some file systems prefer its data blocks to be contiguous? Well, SSDs employ “wear-leveling” and required writes to be spread out as much as possible across the SSDs device to prolong the life of the SSD device to reduce “wear-and-tear”. That’s not good news because SSDs just told the file systems, “I don’t like locality and I will spread out the data blocks“.

NAND Flash SSDs (the common ones we find in the market and not DRAM-based SSDs) are funny creatures. When you write to SSDs, you must ERASE first, WRITE AGAIN to the SSDs. This is the part that is creating the wear-and tear of the device. When I mean ERASE first, WRITE AGAIN, I describe it below

  • Writing 1 –> 0 (OK, no problem)
  • Writing 0 –> 1 (not OK, because NAND Flash can’t do that)

So, what does the SSD do? It ERASES everything, writing the entire data blocks on the device to 1s, and then converting some of them to 0s. Crazy, isn’t it? The firmware in the SSDs controller will also spread out the erase-and-then write operations across the entire SSD device to avoid concentrating the operations on a small location or dataset. This is the “wear-leveling” we often hear about.

Since SSDs shun locality and avoid the data blocks to be nearby, and Copy-on-Write file systems are already doing this because its nature to write new data blocks somewhere else, the combination of both COW file system and SSDs seems like a very good fit. It even looks symbiotic because it is a case of “I help you; and you help me“.

From this perspective, the benefits of COW file systems and SSDs extends beyond resiliency of the SSD device but also in performance. Since the data blocks are spread out at different locations in the SSD device, the effect of parallelism will inadvertently help with COW’s performance. Make sense, doesn’t it?

I have not learned about other file systems and how they behave with SSDs, but it is pretty clear that Copy-on-Write file systems works well with Solid State Devices. Have a good week ahead :-)!

What kind of IOPS and throughput do you get from RAID-5/6? – Part 2

In my previous blog entry, I mentioned the write penalty for RAID-5/6. This factor will figure heavily in the way we size the RAID-level for performance capacity planning.

It is difficult to ascertain what kind of IOPS and throughput that are required for an application, especially a database, to run well with additional room to grow. From a DBA or an application developer, I believe they would have adequate information to tell what is the numbers of users that the application can support, both average and peak, transactions per second (TPS), block size required for logs, database files and so on.

But as we are all aware, most of the time, these types of information are not readily available. So, coming from a storage angle, the storage administrator can advise the DBA or the application developer that the configured RAID group or volume or LUN is capable of delivering a certain number of IOPS and is able to achieve a certain throughput MB/sec. These numbers will be off the box itself immediately. Of course, other factors such as HBA speed, the FC/iSCSI configurations, the network traffic and so on will affect the overall performance delivery to the application. But we can safely inform the DBA and/or the application developer that this is what the storage is delivering out of the box.

The building blocks of all storage RAID groups/volumes/LUNs are pretty much your hard disk drives (HDDs) and/or Solid State Drives (SSDs). The manufacturer of these disks will usually publish the IOPS and throughput of individual drives but if these information is not available, we can construct IOPS of an individual HDD from its seek and latency times.

For example, if the HDD’s

average latency = 2.8 ms;          average read seek = 4.2 ms;              average write seek = 4.8 ms

then the IOPS can be calculated as

         IOPS = ---------------------------------------
                (average latency) + (average seek time)

Therefore from the details above,

         IOPS = -------------------  = 136.986 IOPS
                (0.0028) + (0.0045)

That’s pretty simple, right? But of course, it is easier to just accept that a certain type of disk will have a range of IOPS as shown in the table below:

Disk Type RPM IOPS Range
SATA 5,400 50-75
SATA 7,200 75-100
SAS/FC 10,000 100-125
SAS/FC 15,000 175-200
SSD N/A 5,000-10,000

The information from the table above is just for reference only and by no means a very accurate one but it is good enough for us to determine the IOPS of a RAID group/volume/LUN. Let’s look at the RAID write penalty again in the table below:

RAID-level Number of I/O Reads
Number of I/O for Writes
RAID Write Penalty
0 1 1 1
1 (1+0, 0+1) 1 2 2
5 1 4 4
6 1 6 6

Next, we need to know what is the ratio of Reads vs Writes for that particular database or application. I mentioned earlier that in OLTP-type of applications, we usually take a 2:1 or 3:1 ratio in favour of Reads.

To make things simpler, let’s assume we create a RAID-6 volume of 6 data disks and 2 parity disks in a RAID-6 (6+2) configuration. The disks used are SATA disks of 7,200 RPM, with each individual disk of 100 IOPS. Assume we are using a ratio of 2:1 in favour of Reads, which gives us 66.666% and 33.333% respectively for Reads and Writes.

Therefore, the combined IOPS of the 8 disks in the RAID-6 configuration is probably about 800 IOPS. However, because of the write penalty of RAID-6, the effective IOPS for the RAID-6 volume will be lower than that. Let’s do some calculation to see what happens:

1)  Read IOPS + Write IOPS = 800 IOPS

2)  (0.66666 x 800) + (0.33333 x 800) = 800 IOPS

3) Read IOPS will be 0.66666 x 800 = 533.328 IOPS

4) Write IOPS will be 0.33333 x 800 = 266.664 IOPS. However, since RAID-6 has a write penalty of 6, this number has to be divided by 6. 266.664/6 will be 44.444 IOPS for Writes

Therefore, what the RAID-6 volume is capable of is approximately 533 IOPS for Reads and 44 IOPS for Writes.

We have determined IOPS for the RAID volume but what about throughput. Throughput is determined by the block size used. Assume that our RAID-6 volume uses a 4-K block size. With a combined effective IOPS of 577 (533+44), we multiply the IOPS with the block size

     Throughput = 577 IOPS x 4-KB
                = 2308KB/sec

Therefore when I/O is sustained in a sequential manner, the effective throughput is 2308KB/sec.

On the other hand, we often were told to add more spindles to the volume to increase the IOPS. This is true, to a point, where the maximum amount of IOPS that can be delivered will taper into a flatline, because the I/O channel to the RAID volume  has been saturated. Therefore, it is best to know that adding more spindles does not always equate to a higher IOPS.

Performance sizing for a database or an application is both a science and an art. Mathematically, we can prove things to a a certain amount of accuracy and confidence but each storage platform is very different in the way they handle RAID. Newer storage platforms have proprietary RAID that nowadays, it does not matter much what kind of RAID is best for the application. Vendors such as IBM XIV has RAID-X which both radical in design and implementation. NetApp will almost always say RAID-DP is the best no matter what, because RAID-DP is all NetApp.

So there is no right or wrong to choose the RAID-level for the application. But it is VERY important to know what are the best practice are and my advice is everyone is to do Proof-of-Concepts, and TEST, TEST, TEST! And ASK QUESTIONS!

Don’t RAID-5/6 everything! – Part 1

It’s a beautiful Saturday morning … the sun is out, and the birds are chirping … and here I am, thinking about RAID-5/6. What’s wrong with me?

Anyway, have you ever wondered almost all your volumes are in a RAID-5/6 configuration? Like an obedient child, the answer would probably be “Oh, my vendor said it is good for me …”

In storage, the rule is applications-read, applications-write. And different applications have different behaviors but typically, they fall under 2 categories:

  • Random access
  • Sequential access

The next question to ask is how much Read/Writes ratio (or percentage) is in that Random Access behavior and how much of Read/Write ratio in Sequential Access behavior.

We usually pigeonhole transactional databases such as SQL Server, Oracle into OLTP-type characteristics with random access being the dominant access method. Similarly, email applications such as Exchange, Lotus and even SMTP into similar OLTP-type characteristics as well. We typically do a 2:1 or 3:1 ratio for OLTP-type applications with Read heavy and less of Writes. Data warehouse type of databases tend to be more sequential.

However, even within these OLTP applications, there are also sequential access behaviors as well, as the following table for a database shows:

Operation Random or Sequential Read/Write Heavy Block Size
DB-Log Random (Sequential in log recovery) Write Heavy unless you are doing log recovery 1KB – 64KB
DB-Data Files Random Read/Write mix dependent on load 4KB – 32KB
Batch insert Sequential Write Heavy 8KB – 128KB
Index scan Sequential Read Heavy 8KB – 128KB

We will look into 4 RAID-levels in this scenario and see how each RAID-level applies to an OLTP-type of environment. These RAID levels are RAID-0, RAID-1 (1+0, 0+1 included), RAID-5 and RAID-6.

RAID-0 is the baseline, with 1 x Read and 1 x Write being processed as per normal.

In RAID-1, it would require 2 x Writes and 1 x Read, because the write operation is mirrored. The RAID penalty is 2.

To avoid the cost of RAID-1, RAID-5 is almost always the RAID level of choice (unless you speak to those NetApp fellas). RAID-5 is a parity-based RAID and require 2 x Read (1 to read the data block and 1 to read the parity block) AND 2 x Write (1 to write the modified block and 1 to write the modified parity). Hence it has a RAID penalty of 4.

RAID-6 was to address the risk of RAID-5 because disk capacity are so freaking large now (3TB just came out). To rebuild a large-TB drive would take longer time and the RAID-5 volume is at risk if a second disk failure occurs. Hence, double parity RAID in RAID-6. But unfortunately, the RAID penalty for RAID-6 is 6!

To summarize the RAID write penalty,

RAID-level Number of I/O Reads
Number of I/O for Writes
RAID Write Penalty
0 1 1 1
1 (1+0, 0+1) 1 2 2
5 1 4 4
6 1 6 6

So, it is well known that RAID 0 has good performance for reads and writes but with absolutely no protection. RAID-1 would be good for random reads and writes but it is costly. RAID-5 is good for applications with a high ratio of sequential reads vs writes (2:1, 3:1 as mentioned), and RAID-6, errr … should be taken similarly as RAID-5 with some additional performance penalty.

With that in mind, a storage administrator must question why a particular RAID-level was proposed to the database or any like-applications.

I am going out to enjoy the Saturday now … and today, August 13th is the World’s Left-Handed Day. More about this RAID penalty and IOPS in my next entry.