VAAI to go!

First of all, let me apologize. I am guilty of not updating my blogs as regularly as I did in the past. Things got a bit crazy after Christmas and I had to juggle several things that demand more of my attention but I am confident things will sort itself out soon enough.

Today’s topic is about VMware’s VAAI (vSphere vStorage API for Array Integration). This feature was announced more than 3 years ago but was only introduced in vSphere 4.1 July 2010 and now with newer enhancements in the latest release of vSphere 5.0.

What is this VAAI and what does this mean from a storage perspective?

When VMware came into prominence in version 3.0/3.5 time, the whole world revolved around the ESX hypervisor. It tried to do everything on its own, in its own proprietary nature. Given its nascent existence then, ESX had to do what it had to do and control everything with its hypervisor universe. Yes, it was a good move then and it did what it was supposed to do. This was back when server virtualization was in its infancy, and resources requirements were less demanding.

Hence when VMware wants to initialize VMs, or create VMDK files on the datastore, or creating clones or snapshots, or even executing VMotion and Storage VMotion, it tends to execute it at the hypervisor level. For example, when creating virtual disks with VMFS, most of the commands to initialization of the disks were done at the VMFS level. Zeroing the virtual disks would mean sending zeroing commands to the actual physical disks on the shared storage. And this would go on back and forth, taxing the CPU cycles and memory on the hypervisor layer, and sending wasteful and unnecessary zeroes over the network to the storage array. This was very inefficient, wasteful and degrades the performance tremendously, especially at the hypervisor layer (compute and memory).

There are also other operations such as virtual disks locking that locks up the entire LUN that housed several datastores. Again, not good.

But VMware took off like a rocket, and quickly established itself as a Tier 1, enterprise server virtualization solution addressing the highest demands of the enterprise. It is also defining the future of Cloud Computing, building exorbitant requirements as it pushes forward. And VMware began to realize that if the hypervisor is to scale, it needs to leave the I/O operations to the “experts”, and the “experts” here being the respective storage array itself.

So, in version 4.1, VAAI (vStorage API for Array Integration) was introduced as an API suite, following 3 other earlier APIs – vStorage API for Site Recovery Manager (SRM), vStorage API for Data Protection and vStorage API for Multipathing.

In a nutshell, as I have mentioned before, VAAI offloads I/O and storage related operations to the VAAI-capable storage array (leave it to the experts) as shown in the diagram below:

 

Of course, the storage vendors themselves has to rework their array OS layer to integrate with the VAAI API. You can say that the VAAI are “hooks” that enhances the storage connectivity and communications with vSphere’s hypervisor. But then again, if you look at it from the other angle, vSphere need the storage vendors more in order for its universe to scale. Good thing VMware has a big, big market share. Imagine if there are no takers for the VAAI APIs. That would be a strange predicament instead.

What is the big deal that we get from VAAI? The significant and noticeable benefit is increase performance. By offloading the I/O functionality and operations to the storage array itself, the hypervisor and the compute and memory resource are not bogged down, resulting in higher performance and better response time to serve its VMs and other VM operations.

I am going off to another meeting and I shall write of VAAI in more details later. Until the next entry, adios and have a great year ahead.

Is there IOPS for Cloud Storage? – Nasuni style

I was in Singapore last week attending the Cloud Infrastructure Services course.

In the class, one of the foundation components of Cloud Computing is of course, storage. As the students and the instructor talked about Storage, one very interesting argument surfaced. It revolved around the storage, if it was offered on the cloud. A lot of people assumed that Cloud Storage would be for their databases, and their virtual machines, which of course, is true when the communication between the applications, virtual machines and databases are in the local area network of the Cloud Service Provider (CSP).

However, if the storage is offered through the cloud to applications that are sitting on-premise in the customer’s server room, then we have to think twice of how we perceive Cloud Storage. In this aspect, the Cloud Storage offered by the CSP is a Infrastructure-as-a-Service (IaaS), where the key service is Storage. We have to differentiate that this Storage functions as a data container, and usually not for I/O performance reasons.

Though this concept probably will be easily understood by storage professionals like us, this can cause a bit confusion for someone new to the concept of Cloud Computing and Cloud Storage. This confusion, unfortunately, is caused by many of us who are vendors or solution providers, or even publications and magazines. We are responsible to disseminate correct information to customers, but due to our lack of knowledge and experience in this extremely new market of Cloud Storage, we have created the FUDs (Fear, Uncertainty and Doubt) and hype.

Therefore, it is the duty of this blogger to clear the vapourware, and hopefully pass on the right information to accelerate  the adoption of Cloud Storage in the near future. At this moment, given the various factors such as network costs, high network latency and lack of key network technologies similar to LAN in Cloud Computing, Cloud Storage is, most of the time, for data storage containership and archiving only. And there are no IOPS or any performance related statistics related to Cloud Storage. If any engineer or vendor tells you that they have the fastest Cloud Storage in the industry, do me a favour. Give him/her a knock on the head for me!

Of course, as technologies evolve, this could change in the near future. For now, Cloud Storage is a container, NOT a high performance storage in the cloud. It is usually not meant for transactional data. There are many vendors in the Cloud Storage space from real CSPs to storage companies offering re-packaged storage boxes that are “cloud-ready”. A good example of a CSP offering Cloud Storage is Amazon S3 (Simple Storage Service). And storage vendors such as EMC and HDS are repackaging and rebranding their storage technologies as object storage, ready for the cloud. EMC Atmos is really a repackaged and rebranded Centera, with some slight modifications, while HDS , using their Archiving solution, has HCP (aka HCAP). There’s nothing wrong with what EMC and HDS have done, but before the overhyping of the world of Cloud Computing, these platforms were meant for immutable data archiving reasons. Just thought you should know.

One particular company that captured my imagination and addresses the storage performance portion is Nasuni. Of course, they are quite inventive with the Cloud Storage Gateway approach. Nasuni comes up with a Cloud Storage Gateway filer appliance, which can be either a physical 1U server or as a VMware or Hyper-V virtual appliance sitting on-premise at the customer’s site.

The key to this is “on-premise”, which allows access to data much faster because they are locally-cached in the Nasuni filer appliance itself. This Nasuni filer piece addresses the Cloud Storage “performance” piece but Nasuni do not claim any performance statistics with such implementation. The clever bit is that this addresses data or files that are transactional in nature, i.e. NFS or CIFS, to serve data or files “locally”. (I wonder if Nasuni filer has iSCSI as well. Hmmmm….)

In the Nasuni architecture, they “break up” their “Cloud Storage” into 2 pieces. Piece #1 sits on-premise, at the customer site, and acts as a bridge to the Piece #2, that is sitting in a Cloud Storage. From a simplified view, have a look at the diagram below:

 

 

Piece #1 is the component that handles some of the transactional traffic related to files. In a more technical diagram below, you can see that the Nasuni filer addresses the file sharing portion, using the local disks on the filer appliance as a local caching mechanism.

 

Furthermore, older file pieces are whiffed away to the any Cloud Storage using the Cloud Connector interface, hence giving the customer a sense that their storage capacity needs can be limitless if they want to (for a fee, of course). At the same time, the Nasuni filer support thin provisioning and snapshots. How cool is that!

The Cloud Storage piece (Piece #2) is used for the data container and archiving reasons. This component can be sitting and hosted at Amazon S3, Microsoft Azure, Rackspace Cloud Files, Nirvanix Storage Delivery Network and Iron Mountain Archive Services Platform.

The data communication and transfer between the Nasuni filer is secure, encrypted, deduplication and compressed, giving it the efficiency and security that most customers would be concerned about. The diagram below explains the dat communication and data transfer bit.

 

In this manner, the Nasuni filer can replace traditional NAS platforms and can potentially provide a much lower total cost of ownership (TCO) in the long run. Nasuni does not pretend to be a NAS replacement. To me, this concept is very inventive and could potentially change the way we perceive file sharing and file server, obscuring and blurring concept of NAS.

Again, I would like to reiterate that Nasuni does not attempt to say their solution is a NAS or a performance-based Cloud Storage but what they have cleverly packaged seems to be appealing to customers. Their customer base has grown 78% in Q2 of 2011. It’s just too bad they are not here in Malaysia or this part of the world (yet).

IOPS in Cloud Storage? Not yet.

 

Betcha don’t encrypt your disks

At the Internet Alliance event this morning, someone from Computerworld gave me a copy of their latest issue. The headline was “Security Incidents Soar”, with the details of the half-year review by CyberSecurity Malaysia.

Typically, the usual incidents list evolve around spam, intrusions, frauds, viruses and so on. However, storage always seems to be missing. As I see it, storage security doesn’t sit well with the security guys. In fact, storage is never the sexy thing and it is usually the IPS, IDS, anti-virus and firewall that get the highlights. So, when we talk about storage security, there is so little to talk about. In fact, in my almost 20-years of experience, storage security was only brought up ONCE!

In security, the most valuable piece of asset is data and no matter where the data goes, it always lands on …. STORAGE! That is why storage security could be one of the most overlooked piece in security. Fortunately, SNIA already has this covered. In SNIA’s Solid State Storage Initiative (SSSI), one aspect that was worked on was Self Encrypted Drives (SED).

SED is not new. As early as 2007, Seagate already marketed encrypted hard disk drives. In 2009, Seagate introduced enterprise-level encrypted hard disk drives. And not surprisingly, other manufacturers followed. Today, Hitachi, Toshiba, Samsung, and Western Digital have encrypted hard disk drives.

But there were prohibitive factors that dampened the adoption of self-encrypted drives. First of all, it was the costs. It was expensive a few years ago. There was (and still is) a lack of knowledge between the hardware of Self Encrypted Drives (SED) and software-based encryption. As the SED were manufactured, some had proprietary implementations that did not do their part to promote the adoption of SEDs.

As data travels from one infrastructure to another, data encryption can be implemented at different points. As the diagram below shows,

 

encryption can be put in place at the software level, the OS level, at the HBA, the network itself. It can also happen at the switch (network or fabric), at the storage array controller or at the hard disk level.

EMC multipathing software, PowerPath, has an encryption facility to ensure that data is encryption on its way from the HBA to the EMC CLARiiON storage controllers.

The “bump-in-the-wire” appliance is a bridge device that helps in composing encryption to the data before it reaches the storage. Recall that NetApp had a FIPS 140 certified product called Decru DataFort, which basically encrypted NAS and SAN traffic en-route to the NetApp FAS storage array.

And according to SNIA SSSI member, Tom Coughlin, SED makes more sense that software-based security. How does SED work?

First of all, SED works with 2 main keys:

  • Authentication Key (AK)
  • Drive Encryption Key (DEK)

The DEK is the most important component, because it is a symmetric key that encrypts and decrypts data on the HDDs or SSDs. This DEK is not for any Tom (sorry Tom), Dick and Harry. In order to gain access to DEK, one has to be authenticated and the authentication is completed by having the right authentication key (AK). Usually the AK is based on a 128/26-bit AES or DES and DEK is of a higher bit range. The diagram below shows the AK and DEK in action:

Because SED occurs at the drive level, it is significantly simpler to implement, with lower costs as well. For software-based encryption, one has to set up some form of security architecture. IPSec comes to mind. This is not only more complex, but also more costly to implement as well. Since it is software, the degree of security compromise is higher, meaning, the security model is less secure when compared to SED. The DEK of the SED does not leave the array, and if the DEK is implemented within the disk enclosure or the security module of SoC (System-on-Chip), this makes even more secure that software-based encryption. Also, the DEK is away from the CPU and memory, thus removing these components as a potential attack vendor that could compromise the data on the disks drives.

Furthermore, software-based encryption takes up CPU cycles, thus slows down the overall performance. In the Tom Coughlin study, based on both SSDs and HDDs, the performance of SED outperforms software-based encryption every time. Here’s a table from that study:

Another security concern is about data erasure. According to an old IBM study, about 90% of the retired HDDs still has data that is readable. That means that data erasure techniques used are either not implemented properly or simply not good enough. For us in the storage industry, an effective but time consuming technique is to overwrite the entire disks with 1s and reusing it. But to hackers, there are ways to “undelete” these bits and make the data readable again.

SED provides crypto erasure that is both effective and very quick. Since the data encryption key (DEK) was used to encrypt and decrypt data, the DEK can be changed and renewed in split seconds, making the content of the disk drive unreadable. The diagram below shows how crypto erasure works:

Data security is already at its highest alert and SEDs are going to be a key component in the IT infrastructure. The open and common standards are coming together, thanks to efforts to many bodies including SNIA. At the same time, product certifications are coming up and more importantly, the price of SED has come to the level that it is almost on par with normal, non-encrypted drives.

Hackers and data thieves are getting smarter all the time and yet, the security of the most important place of where the data rest is the least considered. SNIA and other bodies hope to create more awareness and seek greater adoption of self encrypted drives. We hope you will help spread the word too. Betcha thinking twice now about encrypting your data  on your disk drives now.

Performance benchmarks – the games that we play

First of all, congratulations to NetApp for beating EMC Isilon in the latest SPECSfs2008 benchmark for NFS IOPS. The news is everywhere and here’s one here.

EMC Isilon was blowing its horns several months ago when it  hit 1,112,705 IOPS recorded from a 140-node S200 cluster with 3,360 disk drives and a overall response time of 2.54 msecs. Last week, NetApp became top dog, pounding its chest with 1,512,784 IOPS on a 24 x FAS6240 nodes  with an overall response time of 1.53msecs. There were 1,728 450GB, 15,000rpm disk drives and the FAS6240s were fitted with Flash Cache.

And with each benchmark that you and I have seen before and after, we will see every storage vendors trying to best the other and if they did, their horns will be blaring, the fireworks are out and they will pounding their chests like Tarzan, saying “Who’s your daddy?” The euphoria usually doesn’t last long as performance records are broken all the time.

However, the performance benchmark results are not to be taken in verbatim because they are not true representations of real life, production environment. 2 years ago, the magazine, the defunct Byte and Switch (which now is part of Network Computing), did a 9-year study on File Systems and Storage Benchmarking. In a very interesting manner, it revealed that a lot of times, benchmarks results are merely reduced to single graphs which has little information about the details of how the benchmark was conducted, how long the benchmark took and so on.

The paper, published by Avishay Traeger and Erez Zadok from Stony Brook University and Nikolai Joukov and Charles P. Wright from the IBM T.J. Watson Research Center entitled, “A Nine Year Study of File System and Storage Benchmarking” studied 415 file systems from 106 published results and the article quoted:

Based on this examination the paper makes some very interesting observations and 
conclusions that are, in many ways, very critical of the way “research” papers have 
been written about storage and file systems.

 

Therefore, the paper highlighted the way the benchmark was done and the way the benchmark results were reported and judging by the strong title (It was titled “Lies, Damn Lies and File Systems Benchmarks”) of the online article that reviewed the study, benchmarks are not the pictures that says a thousand words.

Be it TPC-C, SPC1 or SPECSfs benchmarks, I have gone through some interesting experiences myself, and there are certain tricks of the trade, just like in a magic show. Some of the very common ones I come across are

  • Short stroking – a method to format a drive so that only the outer sectors of the disk platter are used to store data. This practice is done in I/O-intensive environments to increase performance.
  • Shortened test – performance tests that run for several minutes to achieve the numbers rather than prolonged periods (which mimics real life)
  • Reporting aggregated numbers – Note the number of nodes or controllers used to achieve the numbers. It is not ONE controller than can achieve the numbers, but an aggregated performance results factored by the number of controllers

Hence, to get to the published benchmark numbers in real life is usually not practical and very expensive. But unfortunately, customers are less educated about the way benchmarks are performed and published. We, as storage professionals, have to disseminate this information.

Ok, this sounds oxymoronic because if I am working for NetApp, why would I tell the truth that could actually hurt NetApp sales? But I don’t work for NetApp now and I think it is important for me do my duty to share more information. Either way, many people switch jobs every now and then, and so if you want to keep your reputation, be honest up front. It could save you a lot of work.

NFS version 4

NFS has been around for almost 20 years and is the de-facto network file sharing protocol for Unix and Linux systems. There has been much talk about NFS, especially in version 2 and 3 and IT guys used to joke that NFS stands for “No F*king Security”.

NFS version 4 was different, borrowing ideas from Windows CIFS and Carnegie Mellon’s Andrew File System. And from its inception 11 years ago in IETF RFC 3010 (revised in 2003 with IETF RFC 3530), the notable new features of NFSv4 are:

Performance enhancement – One key enhancement is the introduction of the COMPOUND RPC procedure which allows the NFS client to group together a bunch of file operations into a single request to the NFS server. This not only reduces the network round-trip latency, but also reduces the small little chatters of the smaller file operations.

 

Removing the multiple daemons in NFSv3 – NFSv3 uses various daemons/services and various protocols to do the work. There isportmapper (TCP/UDP 111) which provides the port numbers for mounting and NFS services. There’s the mountd (arbitrary port byportmapper) service that does the mounting of NFS exports for the NFS clients. There’s nfsd on TCP/UDP 2049and so on. The command ‘rpcinfo -p’ below shows all the ports and services related to NFS

 

There are other features as well such as

Firewall friendly – The use ofportmapper dishing out arbitrary ports made it difficult for the firewall. NFSv4 changed that by consolidating most of the TCP/IP services into well-known ports which the security administrator can define in the firewall.

 

Stateful – NFSv3 is stateless and it does not maintain the state of the NFS clients. NFSv4 is stateful and implements a mandatory locking and delegation mechanisms. Leases for locks from the servers to the clients was introduced. A lease is a time-bounded grant for the control of the state of a file and this is implemented through locks.

Mandated Strong Security Architecture – NFSv4 requires the implementation of a strong security mechanism that is based on crytography. Previously the strongest security flavour was AUTH_SYS, which is level 1 clearance.

 "The AUTH_SYS security flavor uses a host-based authentication model
where the client asserts the user's authorization identities using small
integers as user and group identity representations (this is EXACTLY how NFSv3
authenticates by default). Because of the small integer authorization ID
representation, AUTH_SYS can only be used in a name space where all clients and
servers share a uidNumber and gidNumber translation service. A shared translation
service is required because uidNumbers and gidNumbers are passed in the RPC
credential; there is no negotiation of namespace in AUTH_SYS."

NFSv4 security mechanism is based on RPCSEC_GSS, a level 6 clearance. RPCSEC_GSS is an API that is more than an authentication mechanism. It performs integrity checksum and encryption in the entire RPC request and response operations. This is further progressed with the integration of Kerberos v5 for user authentication. This is quite similar to Windows CIFS Kerberos implementation, providing a time-based ticket to gain authentication.

In addition to that, there are many other cool, new features with NFSv4. There was a further extension to NFSv4 last year, in 2010, when NFSv4.1 was added in IETF RFC5661. As quoted in Wikipedia – NFSv4.1 “aims to provide protocol support to take advantage of clustered server deployments including the ability to provide scalable parallel access to files distributed among multiple servers (pNFS extension).

NFSv4 has much to offer. The future is bright.

NFS deserves more credit from guys doing virtualization

I was at the RedHat Forum last week when I chanced upon a conversation between an attendee and one of the ECS engineers. The conversation went like this

Attendee: Is the RHEV running on SAN or NAS?

ECS Engineer: Oh, for this demo, it is running NFS but in production, you should run iSCSI or Fibre Channel. NFS is only for labs only, not good for production.

Attendee: I see … (and he went off)

I was standing next to them munching my mini-pizza and in my mind, “Oh, come on, NFS is better than that!”

NAS has always played a smaller brother to SAN but usually for the wrong reasons. Perhaps it is the perception that NAS is low-end and not good enough for high-end production systems. However, this is very wrong because NAS has been growing at a faster rate than Fibre Channel, and at the same time Fibre Channel growth has been tapering and possibly on the wane. And I have always said that NAS is a better suited protocol when it comes to unstructured data and files because the NAS protocol is the new storage networking currency of Internet storage and the Cloud (this could change very soon with the REST protocol, but that’s another story). Where else can you find a protocol where sharing is key. iSCSI, even though it has been growing at a faster pace in production storage, cannot be shared easily because it is block-based.

Now back to NFS. NFS version 3 has been around for more than 15 years and has taken its share of bad raps. I agree that this protocol is still very much in the landscape of most NFS installations. But NFS version 4 is changing all that taking on the better parts of the CIFS protocol, notably the equivalent of opportunistic locking or oplocks. In addition to that it has greatly enhanced its security, incorporating Kerberos-type of authentication. As for performance, NFS v4 added in a compounded in a COMPOUND operations for aggregating operations into a single request.

Today, most virtualization solutions from VMware and RedHat works with NFS natively. Note that the Windows CIFS protocol is not supported, only NFS.

This blog entry is not stating that NFS is better than iSCSI or FC but to give NFS credit where credit is due. NFS is not inferior to these block-based protocols. In fact, there are situations where NFS is better, like for instance, expanding the NFS-based datastore on the fly in a VMware implementation. I will use several performance related examples since performance is often used as a yardstick when these protocols are compared.

In an experiment conducted by VMware based on a version 4.0, with all things being equal, below is a series of graphs that compares these 3 protocols (NFS, iSCSI and FC). Note the comparison between NFS and iSCSI rather than FC because NFS and iSCSI run on Gigabit Ethernet, whereas FC is on a different networking platform (hey, if you got the money, go ahead and buy FC!)

Based a one virtual machine (VM), the Read throughput statistics (higher is better) are:

 

The red circle shows that NFS is up there with iSCSI in terms of read throughput from 4K blocks to 512K blocks. As for write throughput for 1 VM, the graph is shown below:


Even though NFS suffers in write throughput in the smaller blocks less than 16KB, NFS performance write throughput improves over iSCSI when between 16K and 32K range and is equal when it is in 64K, 128K and 512K block tests.

The 2 graphs above are of a single VM. But in most real production environment, a single ESX host will run multiple VMs and here is the throughput graph for multiple VMs.

Again, you can see that in a multiple VMs environment, NFS and iSCSI are equal in throughput, dispelling the notion that NFS is not as good in performance as iSCSI.

Oh, you might say that this is just VMs without any OSes or any applications running in these VMs. Next, I want to share with you another performance testing conducted by VMware for an Microsoft Exchange environment.

The next statistics are produced from an Exchange Load Generator (popularly known as LoadGen) to simulate the load of 16,000 Exchange users running in 8 VMs. With all things being equal again, you will be surprised after you see these graphs.

The graph above shows the average send mail latency of the 3 protocols (lower is better). On the average, NFS has lower latency than iSCSI, better than what most people might think. Another graph shows the 95th percentile of send mail latency below:

 

Again, you can see that the NFS’s latency is lower than iSCSI. Interesting isn’t it?

What about IOPS then? In another test with an 8-hour DoubleHeavy LoadGen simulator, the IOPS graphs for all 3 protocols are shown below:

In the graph above (higher is better), NFS performed reasonably well compared to the other 2 block-based protocols, and even outperforming iSCSI in this 8-hour load testing. Surprising huh?

As I have shown, NFS is not inferior compared to the block-based protocols such as iSCSI. In fact, VMware in version 4.1 has improved all 3 storage protocols significantly as mentioned in the VMware paper. The following are quoted in the paper for NFS and iSCSI.

  1. Using storage microbenchmarks, we observe that vSphere 4.1 NFS shows improvements in the range of 12–40% for Reads,and improvements in the range of 32–124% for Writes, over 10GbE.
  2. Using storage microbenchmarks, we observe that vSphere 4.1 Software iSCSI shows improvements in the range of 6–23% for Reads, and improvements in the range of 8–19% for Writes, over 10GbE

The performance improvement for NFS is significant when the network infrastructure was 10GbE. The percentage jump between 32-124%! That’s a whopping figure compared to iSCSI which ranged from 8-19%. Since both protocols are neck-to-neck in version 4.0, NFS seems to be taking a bigger lead in version 4.1. With the release of VMware version 5.0 a few weeks ago, we shall know the performance of both NFS and iSCSI soon.

To be fair, NFS does take a higher CPU performance hit compared to iSCSI as the graph below shows:

Also note that the load testing are based on NFS version 3. If version 4 was used, I am sure the performance statistics above will take a whole new plateau.

Therefore, NFS isn’t inferior at all compared to iSCSI, even in a 10GbE environment. We just got to know the facts instead of brushing off NFS.

Having fun with your storage vendor and get the information to fit your data center

I was on my way to Singapore yesterday. At the departure lounge, I just started reading “Data Center Storage” by Hubbert Smith (ISBN#: 978-1439834879) yesterday and I learned something very interesting immediately. Then my thoughts started stirring and I thought I have a bit of fun with what I have learned from the book.

The single, most significant piece of the storage solution is the hard disk drive (HDD). Regardless of SAN or NAS protocols, the data is stored and served from the hard disk drives. And there are 4 key metrics of a HDD, which are

  • Price
  • Performance
  • Capacity
  • Power

As storage professionals, we are often challenged to deliver the best storage solution to meet the customer’s requirements. Therefore, it is not about providing the fastest IOPS or the best availability or the lowest price. It is about providing the best balance of the 4 key metrics above.

The 4 metrics are of little help when they are standalone but if they are combined in relation to each other, you as a customer, can obtain some measurable ratios that will be useful to size for a requirements, keeping the balance of the 4 key metrics better defined rather than getting fluff and BS from the storage vendor.

In the book, the following table was displayed and I found it to be extremely useful:

Key Ratios for HDDs
Ratio
Performance/Price IOPS/$
Performance/Power IOPS/watt
Capacity/Price GB/$
Capacity/Power GB/watt

The relational ratios in red are going to be useful in determining the right type of storage for the requirement. And we will come back to this later. We begin our quest to obtain the information that we want – Performance, Capacity, Price, Power.

Capacity is the easy one because it is a given fact the size of the HDDs.

IOPS for each type of HDDs is also easy to obtain. See table below:

Disk Type RPM IOPS Range
SATA 5,400 50-75
SATA 7,200 75-100
SAS/FC 10,000 100-125
SAS/FC 15,000 175-200
SSD N/A 5,000-10,000

The watt of each HDDs is also quite easy. Just ask the vendor to give the specification of the HDDs.

The pricing part would be part where we can have a bit of fun with the storage vendor. Usually, storage vendors do not release the price of a single HDD in the quotation. The total price is lumped together with everything else, making it harder to decipher the price. So, what can the customer do?

Easy. Get 4-5 quotations from the storage vendor, each with different type of HDDs. This is the customer’s rights. For example, I have created several fictitious quotations, each with a different type of HDDs/SSD and pricing.

Quote #1 (SATA 7200 RPM)

 

Quote #2 (SAS 10,000 RPM)

 

Quote #3 (SAS 15,000 RPM)

 

Quote #4 (SSD)

 

From the 4 quotations, we cannot ascertain the true price of a single disk, but we can assume that the 12 units HDDs/SSDs take up 50% of the entire quotation. With all things being equal, especially the quantity of 12, we can establish the very rough estimate of the price. Having fun asking the storage vendor to run around with the quotations is the added bonus.

But we can derive the following figures (rough estimates but useful when we apply them to the key ratios above)

1TB SATA = 3333.33; 300GB 10,000 RPM SAS = 5000.00; 300GB 15,000 RPM SAS = 6250.00; 100GB SSD = 10416.66

When we juxtapose the information that we have collected i.e. price, performance and capacity (ok, I am skipping power/watt because I am lazy to find out), we come up with a table below:

In the boxed area, we can now easily determine which HDDs/SSDs that give the best value for money either Performance/$ or Capacity/$. The higher the key ratio, the better the value.

From this aspect, the customer can now determine methodically which type of disk he should invest into, in order to get the best value.

This is just a very simplistic method to find the value of the storage solution to be purchased. Bear in mind that there are many other factors to consider as well, such as rack unit height, total power consumption, storage efficiency, data protection and many more.

I am not taking credit for what Hufferd Smith has proposed. All kudos to him but I am using his method to apply to what is relevant to us on the field.

In conclusion, the customer won’t be baffled and confused thinking that they got the best deal at lowest price or fastest performance. This crude method can help turn perception into something that is more concrete and analytical. It’s time we, as customer, know our rights, and know what we are buying into and have a bit of fun too with the storage vendor.

What kind of IOPS and throughput do you get from RAID-5/6? – Part 2

In my previous blog entry, I mentioned the write penalty for RAID-5/6. This factor will figure heavily in the way we size the RAID-level for performance capacity planning.

It is difficult to ascertain what kind of IOPS and throughput that are required for an application, especially a database, to run well with additional room to grow. From a DBA or an application developer, I believe they would have adequate information to tell what is the numbers of users that the application can support, both average and peak, transactions per second (TPS), block size required for logs, database files and so on.

But as we are all aware, most of the time, these types of information are not readily available. So, coming from a storage angle, the storage administrator can advise the DBA or the application developer that the configured RAID group or volume or LUN is capable of delivering a certain number of IOPS and is able to achieve a certain throughput MB/sec. These numbers will be off the box itself immediately. Of course, other factors such as HBA speed, the FC/iSCSI configurations, the network traffic and so on will affect the overall performance delivery to the application. But we can safely inform the DBA and/or the application developer that this is what the storage is delivering out of the box.

The building blocks of all storage RAID groups/volumes/LUNs are pretty much your hard disk drives (HDDs) and/or Solid State Drives (SSDs). The manufacturer of these disks will usually publish the IOPS and throughput of individual drives but if these information is not available, we can construct IOPS of an individual HDD from its seek and latency times.

For example, if the HDD’s

average latency = 2.8 ms;          average read seek = 4.2 ms;              average write seek = 4.8 ms

then the IOPS can be calculated as

                                  1
         IOPS = ---------------------------------------
                (average latency) + (average seek time)

Therefore from the details above,

                    1
         IOPS = -------------------  = 136.986 IOPS
                (0.0028) + (0.0045)

That’s pretty simple, right? But of course, it is easier to just accept that a certain type of disk will have a range of IOPS as shown in the table below:

Disk Type RPM IOPS Range
SATA 5,400 50-75
SATA 7,200 75-100
SAS/FC 10,000 100-125
SAS/FC 15,000 175-200
SSD N/A 5,000-10,000

The information from the table above is just for reference only and by no means a very accurate one but it is good enough for us to determine the IOPS of a RAID group/volume/LUN. Let’s look at the RAID write penalty again in the table below:

RAID-level Number of I/O Reads
Number of I/O for Writes
RAID Write Penalty
0 1 1 1
1 (1+0, 0+1) 1 2 2
5 1 4 4
6 1 6 6

Next, we need to know what is the ratio of Reads vs Writes for that particular database or application. I mentioned earlier that in OLTP-type of applications, we usually take a 2:1 or 3:1 ratio in favour of Reads.

To make things simpler, let’s assume we create a RAID-6 volume of 6 data disks and 2 parity disks in a RAID-6 (6+2) configuration. The disks used are SATA disks of 7,200 RPM, with each individual disk of 100 IOPS. Assume we are using a ratio of 2:1 in favour of Reads, which gives us 66.666% and 33.333% respectively for Reads and Writes.

Therefore, the combined IOPS of the 8 disks in the RAID-6 configuration is probably about 800 IOPS. However, because of the write penalty of RAID-6, the effective IOPS for the RAID-6 volume will be lower than that. Let’s do some calculation to see what happens:

1)  Read IOPS + Write IOPS = 800 IOPS

2)  (0.66666 x 800) + (0.33333 x 800) = 800 IOPS

3) Read IOPS will be 0.66666 x 800 = 533.328 IOPS

4) Write IOPS will be 0.33333 x 800 = 266.664 IOPS. However, since RAID-6 has a write penalty of 6, this number has to be divided by 6. 266.664/6 will be 44.444 IOPS for Writes

Therefore, what the RAID-6 volume is capable of is approximately 533 IOPS for Reads and 44 IOPS for Writes.

We have determined IOPS for the RAID volume but what about throughput. Throughput is determined by the block size used. Assume that our RAID-6 volume uses a 4-K block size. With a combined effective IOPS of 577 (533+44), we multiply the IOPS with the block size

     Throughput = 577 IOPS x 4-KB
                = 2308KB/sec

Therefore when I/O is sustained in a sequential manner, the effective throughput is 2308KB/sec.

On the other hand, we often were told to add more spindles to the volume to increase the IOPS. This is true, to a point, where the maximum amount of IOPS that can be delivered will taper into a flatline, because the I/O channel to the RAID volume  has been saturated. Therefore, it is best to know that adding more spindles does not always equate to a higher IOPS.

Performance sizing for a database or an application is both a science and an art. Mathematically, we can prove things to a a certain amount of accuracy and confidence but each storage platform is very different in the way they handle RAID. Newer storage platforms have proprietary RAID that nowadays, it does not matter much what kind of RAID is best for the application. Vendors such as IBM XIV has RAID-X which both radical in design and implementation. NetApp will almost always say RAID-DP is the best no matter what, because RAID-DP is all NetApp.

So there is no right or wrong to choose the RAID-level for the application. But it is VERY important to know what are the best practice are and my advice is everyone is to do Proof-of-Concepts, and TEST, TEST, TEST! And ASK QUESTIONS!

SSDs coming into mainstream … be Ready!

There has been a slew of SSD news in the storage blogosphere with the big one from eBay.

eBay has just announced that it has 100TB of SSDs from Nimbus Data Systems. On top of that, OCZ, SanDisk and STEC, all major SSD manufacturers, have announced a whole lot of new products with the PCIe SSD cards leading the way. The most interesting thing was the factor of $/GB has gone down significantly, getting very close to the $/GB of spinning disks. This is indeed good news to the industry because SSDs delivers low latency, high IOPS, low power consumption and many other new benefits.

Side note: As I am beginning to understand more about SSDs, I found out that NAND flash SSD has a latency in the microseconds compared to spinning HDDs, which has milliseconds latency range. In addition to that DRAM SSDs have latency that is in the range of nano seconds, which is basically memory type of access. DRAM SSDs are of course, more expensive. 

The SSDs are coming very soon into the mainstream, and this will inadvertently, drive a new generation of applications and accelerate growth in knowledge acquisition. We are already seeing the decline of Fibre Channel disks and the rise of SAS and SATA disks but SSDs in the enterprise storage, as far as I am concerned, brings forth 2 new challenges which we, as professionals and users in the storage networking environment, must address.

These challenges can be simplified to

  1. Are we ready?
  2. Where is the new bottleneck?

To address the first challenge, we must understand the second challenge first.

In system architectures, we know of various of performance bottlenecks that exist either in CPU, memory, bus, bridge, buffer, I/O devices and so on. In order to deliver the data to be process, we have to view the data block/byte service request in its entirety.

When a user request for a file, this is a service request. The end objective is the user is able to read and write the file he/she requested. The time taken from the beginning of the request to the end of it, is known as service time, which latency plays a big part of it. We assume that the file resides in a NAS system in the network.

The request for the file begins by going through the file system layer of the host the user is accessing, then to the user and kernel space, moving on through the device driver of the NIC card, through the TCP/IP stack (which has its own set of buffer overheads and so on), passing the request through the physical wire. From there it moves on through the NAS system with the RAID system, file system and so on until it reaches the file request. Note that I have shortened the entire process for simple explanation but it shows that the service request passes through a whole lot of things in order to complete the request.

Bottlenecks exist everywhere within the service request path and is also subjected to external factors related to that service request. For a long, long time, I/O has been biggest bottleneck to the processing of the service request because it is usually and almost always the slowest component in the entire scheme of things.

The introduction of SSDs will improve the I/O performance tremendously, into the micro- or even nano-seconds range, putting it in almost equal performance terms with other components in the system architecture. The buses and the bridges in the computer systems could be the new locations where the bottleneck of a service request exist. Hence we have use this understanding to change the modus operandi of the existing types of applications such as databases, email servers and file servers.

The usual tried-and-tested best practices may have to be changed to adapt to the shift of the bottleneck.

So, we have to equip ourselves with what SSDs is doing and will do to the industry. We have to be ready and take advantage of this “quiet” period to learn and know more about SSD technology and what the experts are saying. I found a great website that introduces and speaks about SSD in depth. It is called StorageSearch and it is what I consider the best treasure trove on the web right now for SSD information. It is run by a gentleman named Zsolt Kerekes. Go check it out.

Yup, we must be get ready when SSDs hit the mainstream, and ride the wave.