Don’t just look at disk reliability!

I am sure that many of you in the storage networking industry can relate to this very well.

When 1 or 2 disk drives fail, the customer will usually press you for an answer and usually this question will pop up. “How come the MTBF is 1.5 million hours but the drive(s) failed after a few months? We also get asked of “How reliable are the disks?” “How sure are you that the storage disks I buy will last?”

And for us in this line, we cannot deny the fact that the customer should be better informed (or at least we get cheesed off by these questions). A few blogs ago, I took the easy way out and educated the customer about MTBF (Mean Time Between Failure). This is only a quarter of the story because MTBF alone does not determine the reliability of the storage ecosystem and the reliability of the storage ecosystem (which translates to data availability) is something that the customer should ask rather than spending their time pressing their annoyance onto you about 1 or 2  disk failures.

I also want to say a little about another disk reliability statistics called AFR. More about that later.

Let’s get a little deeper with disk MTBF. Disk MTBF is a statistically calculated, pre-production measurement. The key word here is “PRE” meaning that THIS IS NOT A FIELD TESTED statistics! This is a statistical likelihood of how long a disk device will last.

One thing to note is how MTBF is derived. In fact, MTBF is established before the entire disk drive line goes into volume production. Typically, there is a process called Real Demonstration Test (RDT). RDT involves putting about 1,000 or more drives into a testing chamber, running them very hard, in elevated temperatures with 100% I/O for about 6-8 weeks. This is to simulate the harshest of an operating environment and inevitably, some disk drives will fail. From these failures, the MTBF is calculated.

A enterprise hard disk drives MTBF will usually be between 1.2 million to 2.0 million hours while the consumer grade drives usually have MTBF of about 300,000-600,000 hours. Therefore, it is important to educate customers because customers like to use some home office/SMB storage solutions to compare with the enterprise storage solution you are about to propose to him.

One of the war stories I heard was from a high-definition video production house. They get hundreds of thousands of Malaysian Ringgit worth of contract from a satellite TV content provider. But being less “educated” (could also be translated to being cheapo), they decided to store their valuable video contents on Buffalo NAS storage. And video production environments can be harsh. The I/O stress on the disks are strenuous and the Buffalo NAS disks crashed. They lost all contents (I don’t know what happened to their backup), and they were fined hundreds of thousands of Malaysian Ringgit and had their contract terminated on the spot. This is not to say that the Buffalo NAS is a poor product, but they got the wrong product for the job. You can’t expect to race the Formula 1 with an old jalopy, can you? You got to get the right solution for the job, even if it costs more.

So the moral of the story is – “Educate yourself and be prepared to invest if the dollar value of the data is more important than what are you think you might be cost-saving”

Over the years, MTBF (even though it is still very much in use today) is getting less and less useful as a reliability measurement. So, what’s better? AFR!

AFR or Annualized Failure Rate has been in use for almost 10 years now, and Seagate, the hard disk manufacturer, uses the AFR value heavily. AFR is the percentage of the installed bases of hard disk drives that failed and returned to factory in a given year. This is a more realistic figure and it is the statistics from the field. The typical value for enterprise disk drives  is usually between 0.7-1.0% although a few years ago, Google created a splash in the industry when they reported in an AFR of 36%. For those who would like to read Google’s paper, click here.

Therefore AFR is a more reliable measurement of disk reliability than MTBF.

But disk reliability is just a 1/4 of the story. We need to be out there educating the customers about the storage ecosystem reliability rather than a specific component. The data availability is paramount because components will fail throughout the lifecycle of the solution. That is why there are technology like RAID, snapshots, backup, mirroring and so on to ensure that the data is made available for the operations and businesses to continue.

Ultimately, if the customer wants to use the disk MTBF onto you, he’s basically shooting at you with the wrong bullet. It’s time you storage networking professional out there educate the customers.