No Flash in the pan

The storage networking market now is teeming with flash solutions. Consumers are probably sick to their stomach getting a better insight which flash solution they should be considering. There are so much hype, fuzz and buzz and like a swarm of bees, in the chaos of the moment, there is actually a calm and discerning pattern slowly, but surely, emerging. Storage networking guys would probably know this thing well, but for the benefit of the other readers, how we view flash (and other solid state storage) becomes clear with the picture below: Flash performance gap

(picture courtesy of  http://electronicdesign.com/memory/evolution-solid-state-storage-enterprise-servers)

Right at the top, we have the CPU/Memory complex (labelled as Processor). Our applications, albeit bytes and pieces of them, run in this CPU/Memory complex.

Therefore, we can see Pattern #1 showing up. Continue reading

Has Object Storage become the everything store?

I picked up a copy of latest Brad Stone’s book, “The Everything Store: Jeff Bezos and the Age of Amazon at the airport on my way to Beijing last Saturday. I have been reading it my whole time I have been in Beijing, reading in awe about the turbulent ups and downs of Amazon.com.

The Everything Store cover

In its own serendipitous ways, Object-based Storage Devices (OSDs) have been floating in my universe in the past few weeks. Seems like OSDs have been getting a lot of coverage lately and suddenly, while in the shower, I just had an epiphany!

Are storage vendors now positioning Object-based Storage Devices (OSDs) as Everything Store?

Continue reading

Correcting NCQ incorrect portrayal with SSDs

A kind reader, Baruch Even, has pointed out my ignorance with SATA Native Command Queuing (NCQ) working with Solid State Drives (SSDs) in my previous blog.

In the post, I have haphazardly stated that NCQ was meant for spinning mechanical drives. I was wrong.

NCQ does indeed improve the performance of SSDs using SATA interfaces, and sometimes as much as 15-20%. I know there is a statement in the SATA Wikipedia page that says that NCQ boosted IOPS by 100% but I would take a much more realistic view of things rather than setting the expectations too high.

The typical SSD consists of flash storage spread across multiple chips, which in turn are a bunch of flash packages. Within each of the flash packages, there are different dies (as in manufacturing terminology “die”, not related to the word of “death”) that houses planes (not related to aeroplanes) and subsequently into blocks and pages.

Continue reading

Boosting Solid States beyond SATA

Lately, I have been getting deeper and deeper into low-level implementation related to storage technologies. In my previous blog, I was writing my learning adventure with Priority Flow Control (PFC) and intend to further the Data Center Bridging concepts with future blog entries.

Before I left for Sydney for a holiday last week, I got sidetracked into exciting stuff that’s happening in my daily encounters with friends and new friends. 2 significant storage related technologies fell onto my lap. One is NVMe (Non-Volatile Memory express) and the other FPGA (Field Programmable Gate Array).

While this blog is going to be about NVMe, I actually found FPGA much more exciting to me. Through conversations, I found that there are 2 “biggies” in the FPGA world, and they are designed and manufactured by Xilink and Altera. I admit that I have not done my homework on FPGA yet, having just returned from Sydney last night. I will blog about FPGA in future blogs.

But NVMe is also an important technology direction to the storage world as well.

I think most of us are probably already mesmerized by solid state drives. The bombardment of marketing, presentations, advertising and whatever else the vendors do to promote (and self-promote) solid state drives are inundating the intellectual senses of consumers and enterprises alike. And yet, many vendors do not explain both the pros and cons of integrating solid states into their IT environment. Even worse, many don’t even know the strengths and weaknesses of solid states, hence creating some exaggeration that continues to create a spiral vortex of inaccuracies. Like a self-feeding frenzy, the industry seems to have placed solid state storage as the saviour of the enterprise storage world. Go figure with that!

Continue reading

The big boys better be flash friendly

An interesting article came up in the news this week. The article, from the ever popular The Register, mentioned 3 up and rising storage stars, Nimble Storage, Tintri and Tegile, and their assault on a flash strategy “blind spot” of the big boys, notably EMC and NetApp.

I have known about Nimble Storage and Tintri for a couple of years now, and I did take some time to read up on their storage technology offering. Tegile is new to me when it appeared on my radar after SearchStorage.com announced as the Gold Winner of the enterprise storage category for 2012.

The Register article intriqued me because it implied that these traditional storage vendors such as EMC and NetApp are probably doing a “band-aid” when putting together their flash storage strategy. And typically, I see these strategic concepts introduced by these 2 vendors:

  1. Have a server-side cache strategy by putting a PCIe card on the hosting server
  2. Have a network-based all-flash caching area
  3. Have a PCIe-based flash card on the storage system
  4. Have solid state drives (SSDs) in its disk shelves enclosures

In (1), EMC has VFCache (the server side caching software has been renamed to XtremSW Cache and under repackaging with the Xtrem brand name) and NetApp has it FlashAccel solution. Previously, as I was informed, FlashAccel was using the FusionIO ioTurbine solution but just days ago, NetApp expanded the LSI Nytro WarpDrive into its FlashAccel solution as well. The main objective of a server-side caching strategy using flash is to accelerate mostly read-based I/O operations for specific application workloads at the server side.

Continue reading

“I want to put in my own hard disk”

I want to put in my own hard disk“.

If a customer ever utter that sentence, it will trigger a storage vendor meltdown. Panic buttons, alarm bells, and everything else that will lead a salesman to go berserk. That’s a big NO, NO!

For decades, storage vendors have relied on proprietary hardware to keep customers in line, and have customers continue to sign hefty maintenance contracts until the next tech refresh. The maintenance contract, with support, software upgrades and hardware spares replacement, defines the storage networking industry that we are in. Even as some vendors have commoditized their hardware on the x86 platforms, and on standard enterprise hard disk drives (HDDs), NICs and HBAs, that openness and convenience of commodity hardware savings are usually not passed on the customers.

It is easy to explain to customers that keeping their enterprise data in reliable and high performance storage hardware with performance optimization and special firmware is paramount, and any unwarranted and unvalidated hardware would put the customer’s data at high risk.

There is a choice now. The ripple of enterprise-grade, open storage kernel and file system has just started its first ring, and we hope that this small ripple will reverberate across the storage industry in the next few years.

Continue reading

4TB disks – the end of RAID

Seriously? 4 freaking terabyte disk drives?

The enterprise SATA/SAS disks have just grown larger, up to 4TB now. Just a few days ago, Hitachi boasted the shipment of the first 4TB HDD, the 7,200 RPM Ultrastar™ 7K4000 Enterprise-Class Hard Drive.

And just weeks ago, Seagate touted their Heat-Assisted Magnetic Recording (HAMR) technology will bring forth the 6TB hard disk drives in the near future, and 60TB HDDs not far in the horizon. 60TB is a lot of capacity but a big, big nightmare for disks availability and data backup. My NetApp Malaysia friend joked that the RAID reconstruction of 60TB HDDs would probably finish by the time his daughter finishes college, and his daughter is still in primary school!.

But the joke reflects something very serious we are facing as the capacity of the HDDs is forever growing into something that could be unmanageable if the traditional implementation of RAID does not change to meet such monstrous capacity.

Yes, RAID has changed since 1988 as every vendor approaches RAID differently. NetApp was always about RAID-4 and later RAID-DP and I remembered the days when EMC had a RAID-S. There was even a vendor in the past who marketed RAID-7 but it was proprietary and wasn’t an industry standard. But fundamentally, RAID did not change in a revolutionary way and continued to withstand the ever ballooning capacities (and pressures) of the HDDs. RAID-6 was introduced when the first 1TB HDDs first came out, to address the risk of a possible second disk failure in a parity-based RAID like RAID-4 or RAID-5. But today, the 4TB HDDs could be the last straw that will break the camel’s back, or in this case, RAID’s back.

RAID-5 obviously is dead. Even RAID-6 might be considered insufficient now. Having a 3rd parity drive (3P) is an option and the only commercial technology that I know of which has 3 parity drives support is ZFS. But having 3P will cause additional overhead in performance and usable capacity. Will the fickle customer ever accept such inadequate factors?

Note that 3P is not RAID-7. RAID-7 is a trademark of a old company called Storage Computer Corporation and RAID-7 is not a standard definition of RAID.

One of the biggest concerns is rebuild times. If a 4TB HDD fails, the average rebuild speed could take days. The failure of a second HDD could up the rebuild times to a week or so … and there is vulnerability when the disks are being rebuilt.

There are a lot of talks about declustered RAID, and I think it is about time we learn about this RAID technology. At the same time, we should demand this technology before we even consider buying storage arrays with 4TB hard disk drives!

I have said this before. I am still trying to wrap my head around declustered RAID. So I invite the gurus on this matter to comment on this concept, but I am giving my understanding on the subject of declustered RAID.

Panasas‘ founder, Dr. Garth Gibson is one of the people who proposed RAID declustering way back in 1999. He is a true visionary.

One of the issues of traditional RAID today is that we still treat the hard disk component in a RAID domain as a whole device. Traditional RAID is designed to protect whole disks with block-level redundancy.  An array of disks is treated as a RAID group, or protection domain, that can tolerate one or more failures and still recover a failed disk by the redundancy encoded on other drives. The RAID recovery requires reading all the surviving blocks on the other disks in the RAID group to recompute blocks lost on the failed disk. In short, the recovery, in the event of a disk failure, is on the whole object and therefore, a entire 4TB HDD has to be recovered. This is not good.

The concept of RAID declustering is to break away from the whole device idea. Apply RAID at a more granular scale. IBM GPFS works with logical tracks and RAID is applied at the logical track level. Here’s an overview of how is compares to the traditional RAID:

The logical tracks are spread out algorithmically spread out across all physical HDDs and the RAID protection layer is applied at the track level, not at the HDD device level. So, when a disk actually fails, the RAID rebuild is applied at the track level. This significant improves the rebuild times of the failed device, and does not affect the performance of the entire RAID volume much. The diagram below shows the declustered RAID’s time and performance impact when compared to a traditional RAID:

While the IBM GPFS approach to declustered RAID is applied at a semi-device level, the future is leaning towards OSD. OSD or object storage device is the next generation of storage and I blogged about it some time back. Panasas is the leader when it comes to OSD and their radical approach to this is applying RAID at the object level. They call this Object RAID.

With object RAID, data protection occurs at the file-level. The Panasas system integrates the file system and data protection to provide novel, robust data protection for the file system.  Each file is divided into chunks that are stored in different objects on different storage devices (OSD).  File data is written into those container objects using a RAID algorithm to produce redundant data specific to that file.  If any object is damaged for whatever reason, the system can recompute the lost object(s) using redundant information in other objects that store the rest of the file.

The above was a quote from the blog of Brent Welch, Panasas’ Director of Software Architecture. As mentioned, the RAID protection of the objects in the OSD architecture in Panasas occurs at file-level, and the file or files constitute the object. Therefore, the recovery domain in Object RAID is at the file level, confining the risk and damage of data loss within the file level and not at the entire device level. Consequently, the speed of recovery is much, much faster, even for 4TB HDDs.

Reliability is the key objective here. Without reliability, there is no availability. Without availability, there is no performance factors to consider. Therefore, the system’s reliability is paramount when it comes to having the data protected. RAID has been the guardian all these years. It’s time to have a revolutionary approach to safeguard the reliability and ensure data availability.

So, how many vendors can claim they have declustered RAID?

Panasas is a big YES, and they apply their intelligence in large HPC (high performance computing) environments. Their technology is tried and tested. IBM GPFS is another. But where are the rest?

 

We raid vRAID

I took a bit of time off to read through Violin’s vRAID technology because I realized that vRAID (other than Violin’s vXM architecture) is the other most important technology that differentiates Violin Memory from the other upstarts. I blogged at a high-level about Violin a few entries ago, and we are continuing Violin impressive entrance with a storage technology that have been around for almost 25 years – RAID. Incidentally, I found this picture of the original RAID paper (see below):

Has RAID evolved with solid state storage? Evidently, no, because I have not read of any vendors (so far) touting any RAID revolution in their solid state offerings. There has been a lot of negative talks about RAID, but RAID has been the cornerstone and the foundation of storage ever since the beginning. But with the onslaughts of very large capacity HDDs, the demands of packing more bits-per-inch and the insatiable needs for reliability, RAID is slowly beginning to show its age. Cracks in the armour, I would say. And there are many newer, slightly more refined versions of RAID, from the Network RAID-style of HP P4000 or the Dell EqualLogic, to the RAID-X of IBM XIV, to innovations of declustered RAID in Panasas. (Interestingly, one of the early founders of the actual RAID concept paper, Garth Gibson, is the founder of Panasas).

And the new vRAID from Violin-System doesn’t sway much from the good ol’ RAID, but it has been adapted to address the issues of Solid State Devices.

Solid State devices (notably NAND Flash since everyone is using them) are very different from the usual spinning disks of HDDs. They behave differently and pairing solid state devices with the present implementations of RAID could be like mixing oil and water. I am not saying that the present RAID cannot work with solid state devices, but has RAID adapted to the idiosyncrasies of Flash?

It is like putting an old crank shaft into a new car. It might work for a while, but in the long run, it could damage the car. Similarly, conventional RAID might have detrimental performance and availability impact with solid state devices. And we have hardly seen storage vendors coming up to say that their RAID technology has been adapted to the solid state devices that they are selling. This silence could likely mean that they are just adapting to market requirements and not changing their RAID codes very much to take advantage of Flash or other solid state storage for that matter. Violin Memory has boldly come forward to meet that requirement and vRAID is their answer.

Violin argues that there are bottlenecks at the external RAID controller or software RAID level as well as use of legacy disk drive interfaces. And this is indeed true, because this very common RAID implementation squeezes performance at the expense of the other components such as CPU cycles.

Furthermore, there are plenty of idiosyncrasies in Flash with things such as erase-first, then write mechanism. The nature of NAND Flash, unlike DRAM, requires a block to be erased first before a write to the block is allowed. It does not “modify” per se, where the operations of read-modify-write is often applied in parity-based RAIDs of 5 and 6. Because of this nature, it is more like read-erase-write, and when the erase of the block is occurring, the read operation is stalled. That is why most SSDs will have impressive read latency (in microseconds), but very poor writes (in milliseconds). Furthermore, the parity-based RAID’s write penalty, can further aggravate the situation when the typical RAID technology is applied to NAND Flash solid state storage.

As the blocks in the NAND Flash build up, the accumulation of read-erase-write will not only reduce the lifespan of the blocks in the NAND Flash, it will also reduce the IOPS to a state we called Normalized Steady State. I wrote about this in my blog, “Not all SSDs are the same” some moons ago. In my blog, SNIA Solid State Storage Performance Testing Suite (SSS-PTS), there were 3 distinct phases of a typical NAND Flash SSD:

  • Fresh of out the Box (FOB)
  • Transition
  • Steady State
This performance degradation is part of what vendors call “Write Cliff”, where there is a sudden drop in IOPS performance as the NAND Flash SSD ages. Here’s a graph that shows the performance drop.
Violin’s vRAID, implemented within its switched vXM architecture itself, and using proprietary high performance flash controllers and the flash-optimized vRAID technology, is able deliver sustained IOPS throughout the lifespan of the flash SSD, as shown below:
To understand vRAID we have to understand the building blocks of the Violin storage array. NAND Flash chips of 4GB are packed into a Flash Package of 8 giving it 32GB. And 16 of these 32GB Flash Package are then consolidated into a 512GB VIMM (Violin Inline Memory Module). The VIMM is the starting block and can be considered as a “disk”, since we are used to the concept of “disk” in the storage networking world. 5 of these VIMMs will create a RAID group of 4+1 (four data and one parity), giving the redundancy, performance and capacity similar to RAID-5.
The block size used is 4K block and this 4K block is striped across the RAID group with 1K pages each on each of the VIMMs in the RAID group. Each of this 1K page is managed independently and can be placed anywhere in any flash block in the VIMMs, and spread out for lowest possible latency and bandwidth. This contributes to the “spike free latency” of Violin Memory. Additionally, there is ECC protection within each 1K page to correct flash bit error.
To protect against metadata corruption, there is an additional, built-in RAID Check bit to correct the VIMM errors. Lastly, one important feature that addresses the read-erase-write weakness of NAND Flash, the vRAID ensures that the slow erases never block a Read or a Write. This architectural feature enable spike-free latency in mixed Read/Write environments.
Here’s a quick overview of Violin’s vRAID architecture:
I still feel that we need a radical move away from the traditional RAID and vRAID is moving in the right direction to evolve RAID to meet the demands of the data storage market. Revolutionary and radical it may not be, but then again, is the market ready for anything else?
As I said, so far Violin is the only all-Flash vendor that has boldly come forward to meet the storage latency problem head-on, and they have been winning customers very quickly. Well done!

Lightning about to strike

Watch out for February 6th, 2012 folks! The Lightning is about to strike!

Yes, it is likely that EMC will be announcing their server-based, 8-lane PCIe Flash memory card in early week of February. The PCIe card was dubbed “Project Lightning” when it was first announced in EMC World in May last year. It represents EMC’s first foray of products that sits on the server side, giving the impression that EMC could be entering the server business. I blogged about this way back in September last year. As explained by the EMC folks, they are not going into the server business but rather “extending” their performance tiering into the server space. Think of it like an umbilical cord that  sucks the server’s CPU processing power to give maximum performance boost for the EMC storage.

The card will sport Solid State Drive from LSI Warp Drive and comes in 100/200/300GB capacity. Here’s a picture of how the Lightning card would look like:

The SSD is an SLC (Single Level Cell) and is capable of delivering 150,000 random reads IOPS based on 4K blocks and 190,000 random writes IOPS. It can squeeze 1.4GB/sec in read throughput. While it is not on par with the performance of Fusion-IO, it can definitely do well leveraging EMC’s huge customer base. Furthermore, PCIe-based Flash memory cards such as Fusion-IO will not be able to take advantage of the bridge that links the server and the storage, making it confined to the server’s resources. The advantage is definitely EMC when you explore the possibilities.

Here’s a view of a slide from Virtual Geek summarizing the Project Lightning:

The Lightning card is aimed at customers who demand the highest performance, even higher that Tier 0. It will be integrated with EMC’s FAST (Fully Automated Storage Tiering) technology and is available to the VNX and VMAX platforms.

So watch out folks, because Lightning is about to strike soon!

Not all SSDs are the same

Happy Lunar New Year! The Chinese around world has just ushered in the Year of the Water Dragon yesterday. To all my friends and family, and readers of my blog, I wish you a prosperous and auspicious Chinese New Year!

Over the holidays, I have been keeping up with the progress of Solid State Drives (SSDs). I am sure many of us are mesmerized by SSDs and the storage vendors are touting the best of SSDs have to offer. But let me tell you one thing – you are probably getting the least of what the best SSDs have to offer. You might be puzzled why I say things like this.

Let me share with a common sales pitch. Most (if not all) storage vendors will tout performance (usually IOPS) as the greatest benefits of SSDs. The performance numbers have to be compared to something, and that something is your regular spinning Hard Disk Drives (HDDs). The slowest SSDs in terms of IOPS is about 10-15x faster than the HDDs. A single SSD can at least churn 5,000 IOPS when compared to the fastest 15,000 RPM HDDs, which churns out about 200 IOPS (depending on HDD vendors). Therefore, the slowest SSDs can be 20-25x faster than the fastest HDDs, when measured in IOPS.

But the intent of this blogger is to share with you more about SSDs. There’s more to know because SSDs are not built the same. There are write-bias SSDs, read-bias SSDs; there are SLC (single level cell) and MLC (multi level cell) SSDs and so on. How do you differentiate them if Vendor A touts their SSDs and Vendor B touts their SSDs as well? You are not comparing SSDs and HDDs anymore. How do you know what questions to ask when they show you their performance statistics?

SNIA has recently released a set of methodology called “Solid State Storage (SSS) Performance Testing Specifications (PTS)” that helps customers evaluate and compare the SSD performance from a vendor-neutral perspective. There is also a whitepaper related to SSS PTS. This is something very important because we have to continue to educate the community about what is right and what is wrong.

In a recent webcast, the presenters from the SNIA SSS TWG (Technical Working Group) mentioned a few questions that I  think we as vendors and customers should think about when working with an SSD sales pitch. I thought I share them with you.

  • Was the performance testing done at the SSD device level or at the file system level?
  • Was the SSD pre-conditioned before the testing? If so, how?
  • Was the performance results taken at a steady state?
  • How much data was written during the testing?
  • Where was the data written to?
  • What data pattern was tested?
  • What was the test platform used to test the SSDs?
  • What hardware or software package(s) used for the testing?
  • Was the HBA bandwidth, queue depth and other parameters sufficient to test the SSDs?
  • What type of NAND Flash was used?
  • What is the target workload?
  • What was the percentage weight of the mix of Reads and Writes?
  • Are there warranty life design issue?

I thought that these questions were very relevant in understanding SSDs’ performance. And I also got to know that SSDs behave differently throughout the life stages of the device. From a performance point of view, there are 3 distinct performance life stages

  • Fresh out of the box (FOB)
  • Transition
  • Steady State

 

As you can see from the graph below, a SSD, fresh out of the box (FOB) displayed considerable performance numbers. Over a period of time (the graph shown minutes), it transitioned into a mezzanine stage of lower IOPS and finally, it normalized to the state called the Steady State. The Steady State is the desirable test range that will give the most accurate type of IOPS numbers. Therefore, it is important that your storage vendor’s performance numbers should be taken during this life stage.

Another consideration when understanding the SSDs’ performance numbers are what type of tests used? The test could be done at the file system level or at the device level. As shown in the diagram below, the test numbers could be taken from many different elements through the stack of the data path.

 

Performance for cached data would given impressive numbers but it is not accurate. File system performance will not be useful because the data travels through different layers, masking the true performance capability of the SSDs. Therefore, SNIA’s performance is based on a synthetic device level test to achieve consistency and a more accurate IOPS numbers.

There are many other factors used to determine the most relevant performance numbers. The SNIA PTS test has 4 main test suite that addresses different aspects of the SSD’s performance. They are:

  • Write Saturation test
  • Latency test
  • IOPS test
  • Throughput test

The SSS PTS would be able to reveal which is a better SSD. Here’s a sample report on latency.

Once again, it is important to know and not to take vendors’ numbers in verbatim. As the SSD market continue to grow, the responsibility lies on both side of the fence – the vendor and the customer.