Data Deduplication – Dell is first and last

A very interesting report surfaced in front of me today. It is Information Week’s IT Pro ranking of Data Deduplication vendors, just made available a few weeks ago, and it is the overview of the dedupe market so far.

It surveyed over 400 IT professionals from various industries with companies ranging from less than 50 employees to over 10,000 employees and revenues of less than USD5 million to USD1 billion. Overall, it had a good mix of respondents. But the results were quite interesting.

It surveyed 2 segments

  1. Overall performance – product reliability, product performance, acquisition costs, operations costs etc.
  2. Technical features – replication, VTL, encryption, iSCSI and FCoE support etc.

When I saw the results (shown below), surprise, surprise! Here’s the overall performance survey chart:

Dell/Compellent scored the highest in this survey while EMC/Data Domain ranked the lowest. However, the difference between the first place and the last place vendor is only 4%, and this is to suggest that EMC/Data Domain was about just as good as the Dell/Compellent solution, but it scored poorly in the areas that matters most to the customer. In fact, as we drill down into the requirements of the overall performance one-by-one, as shown below,

there is little difference among the 7 vendors.

However, when it comes to Technical Features, Dell/Compellent is ranked last, the complete opposite. As you can see from the survey chart below, IBM ProtecTier, NetApp and HP are all ranked #1.

The details, as per the technical requirements of the customers, are shown below:

These figures show that the competition between the vendors is very, very stiff, with little edge difference from one to another. But what I was more interested were the following findings, because these figures tell a story.

In the survey, only 34% of the respondents say they have implemented some data deduplication solutions, while the rest are evaluating and plan to evaluation. This means that the overall market is not saturated and there is still a window of opportunity for the vendors. However, the speed of the a maturing data deduplication market, from early adopters perhaps 4-5 years ago to overall market adoption, surprised many, because the storage industry tend to be a bit less trendy than most areas of IT. With the way the rate of data deduplication is going, it will be very much a standard feature of all storage vendors in the very near future.

The second figures that is probably not-so-surprising is, for most of the customers who have already implemented the data deduplication solution, almost 99% are satisfied or somewhat satisfied with their solutions. Therefore, the likelihood of these customer switching vendors and replacing their gear is very low, perhaps partly because of the reliability of the solution as well as those products performing as they should.

The Information Week’s IT Pro survey probably reflected well of where the deduplication market is going and there isn’t much difference in terms of technical and technology features from vendor to vendor. Customer will have to choose beyond the usual technology pitch, and look for other (and perhaps more important) subtleties such as customer service, price and flexibility of doing business with. EMC/Data Domain, being king-of-the-hill, has not been the best of vendor when it comes to price, quality of post-sales support and service innovation. Let’s hope they are not like the EMC sales folks of the past, carrying the “Take it or leave it” tag when they develop their relationship with their future customers. And it will not help if word-of-mouth goes around the industry about EMC’s arrogance of their dominance. It may not be true, and let’s hope it is not true because the EMC of today has changed plenty compared to the Symmetrix days. EMC/Data Domain is now part of their Backup Recovery Service (BRS) team, and I have good friends there at EMC Malaysia and Singapore. They are good guys but remember guys, customer is still king!

Dell, new with their acquisition of Compellent and Ocarina Networks, seems very eager to win the business and kudos to them as well. In fact, I heard from a little birdie that Dell is “giving away” several units of Compellents to selected customers in Malaysia. I did not and cannot ascertain if this is true or not but if it is, that’s what I call thinking-out-of-the-box, given Dell as a late comer into the storage game. Well done!

One thing to note is that the survey took in 17 vendors, including Exagrid, Falconstor, Quantum, Sepaton and so on, but only the top-7 shown in the charts qualified.

In the end, I believe the deduplication vendors had better scramble to grab as much as they can in the coming months, because this market will be going, going, gone pretty soon with nothing much to grab after that, unless there is a disruptive innovation to the deduplication technology

Ocarina rising

After more than a year since Dell acquired Ocarina Networks, it has finally surfaced last week in the form of Dell DX Object Storage 6000G SCN (Storage Compression Node).

Ocarina is a content-aware storage optimization engine, and their solution is one of the best I have seen out there. Its unique ECOsystem technology, as described in the diagram below, is impressive.

Unlike most deduplication and compression solutions out there, Ocarina Networks solution takes storage optimization a step further.  Ocarina works at the file level and given the rise and crazy, crazy growth of unstructured files in the NAS space, the web and the clouds, storage optimization is one priority that has to be addressed immediately. It takes a 3-step process – Extract, Correlate and Optimize.

Today’s files are no longer a flat structure of a single object but more of a compounded file where many objects are amalgamated from different sources. Microsoft Office is a perfect example of this. An Excel file would consists of objects from Windows Metafile Formats, XML objects, OLE (Object Linking and Embedding) Compound Storage Objects and so on. (Note: That’s just Microsoft way of retaining monopolistic control). Similarly, a web page is a compound of XML, HTML, Flash, ASP, PHP object codes.

In Step 1, the technology takes files and breaks it down to its basic components. It is kind of like breaking apart every part of a car down to its nuts and bolt and layout every bit on the gravel porch. That is the “Extraction” process and it decodes each file to get the fundamental components of the files.

Once the compounded file object is “extracted”, identified and indexed, each fundamental object is Correlated in Step 2. The correlation is executed with the file and across files under the purview of Ocarina. Matching and duplicated objects are flagged and deduplicated. The deduplication is done at the byte-level, unlike most deduplication solutions that operate at the block-level. This deeper and more granular approach further reduces the capacity of the storage required, making Ocarina one of the most efficient storage optimization solutions currently available. That is why Ocarina can efficiently reduce the size of even zipped and highly encoded files.

It takes this storage optimization even further in Step 3. It applies content-aware compactors for each fundamental object type, uniquely compressing each object further. That means that there are specialized compactors for PDF objects, ZIP objects and so on. They even have compactors for Oil & Gas seismic files. At the time I was exposed to Ocarina Networks and evaluating it, it had about 600+ unique compactors.

After Dell bought Ocarina in July 2010, the whole Ocarina went into a stealth mode. Many already predicted that the Ocarina technology would be integrated and embedded into Dell’s primary storage solutions of Compellent and EqualLogic. It is not there yet, but will likely be soon.

Meanwhile, the first glimpse of Ocarina will be integrated as a gateway solution to Dell DX6000 Object Storage. DX Object Storage is a technology which Dell has OEMed from Caringo. DX6000 Object Storage (I did not read in depth) has the concept of the old EMC Centera, but with a much newer, and more approach based on XML and HTTP REST. It has published an open API and Dell is getting ISV partners to develop their applications to interact with the DX6000 including Commvault, EMC, Symantec, StoredIQ are some of the ISV partners working closely with Dell.

(24/10/2011: Editor note: Previously I associated Dell DX6000 Object Storage with Exanet. I was wrong and I would like to thank Jim Dtuton of Caringo for pointing out my mistake)

Ocarina’s first mission is to reduce the big, big capacities in Big Data space of the DX6000 Object Storage, and the Ocarina ECOsystem technology looks a good bet for Dell as a key technology differentiator.

Deduplication – a fancy form of Compression?

One of the things that peeved at the HP D2D Workshop a few days ago was this heading in the HP PowerPoint slides – “Deduplication – a fancy form of Compression”. Somehow it bothered me.

I have always placed both deduplication and compression into a bucket I called “Data Reduction“. Some vendors might call it Storage Economics, spinning it in a cooler manner. Either way, both attempt and succeed to reduce the capacity required to store the amount of data and this translates into benefits in storage management and network. With a smaller data set, lesser processing and capacity are required, likely speeding up the performance of the storage array. At the same time, the primary data backup set (you know, the data that you back up every night?) becomes smaller, making backup and restore faster (not necessarily, but you have to rehydrate the data from its reduced state). Another obvious benefit is the ability to transfer the smaller data set over the network more efficiently, compared to its original state and size, making Disaster Recovery more possible and so on.

I have always known that deduplication works with data objects using a differential method. Whether the data object is a file or a chunk of the file, deduplication attempts to differentiate similarities (duplicates), and store one copy of that object and have others referencing to the single object. The differentiation methods commonly used are hashing and delta differential. In hashing, MD-5 and SHA-1 are the popular hashing algorithms used, while in delta differentials, the data objects are compared (usually in a scrutinizing manner) to find the differences. The duplicates or similarities are discarded.

There are many factors involved in deduplication. It could be the types of data, the processing power required to do the deduplication task, and throughput of processing and so on and resulting in the different deduplication ratio and time required to complete the process. I am not going to delve into that as there are many vendors who will be able to articulate this, such as EMC Data Domain, HP D2D/VLS with its StoreOnce technology, Exagrid, Sepaton, Dell Ocarina Networks, NetApp, EMC Centera, CommVault Simpana, Symantec PureDisk, Symantec NetBackup, EMC Avamar and many more.

Meanwhile, compression (especially most commercial compression technology) are based on dictionary coding, a lossless data reduction algorithm. Note that I am using the term encoding rather than compression because factually, encoding is the right word. You can’t squeeze the data into a smaller size like you do with a real life object.

The technique works like this.

  1. When being encoded, a bit/byte or a set of bytes are compared to a “dictionary” which is a pool of  “words”  in a data structure maintained by the encoding technology
  2. If a match is found, the bit/byte or set of bytes is substituted by an “word”, usually a much shorter (hence smaller size) representation form of the bytes being encoded.
  3. As the encoding process continues, more “dictionary words” are built into the “dictionary” based on the bytes already encoded. This is popularly known as the sliding window implementation.
  4. The end result is the data is highly encoded (heavily replaced) by “dictionary words”  and of a much smaller size.

One of the heavily implemented compression technique is based on the theory and methodology introduced by Lempel-Ziv and further enhanced by the Lempel-Ziv-Welch trio. A very good explanation of LZ method can be found here.

Both deduplication and compression have the same objective – that is to reduce the data size for more efficient storage. But both approach it from a different angle but they are by no means, exclusive. Both can be used to complement each other and further reduce the capacity required to store the data.

Deduplication usually works with larger data objects (chunks, files etc) while compression works harder at the lower level (byte range level). Deduplication is heavily deployed in secondary data sets (or backup) because you can find plenty of duplicates while in primary data sets (the data in production), deduplication and compression are deployed, either in a singular fashion or one after another. Deduplication is usually run as Step 1 and then Compression is run in Step 2.

So far, the only one that has impressed me for the primary data reduction is Ocarina Networks, which uses a 3 step approach in dedupe, compress and using specialized compactors to reduce the data even more. I have seen the ability of Ocarina reducing Schlumberger Geoframe and Petrel seismic data to more than 50%. That was impressive!

Having my bothered state satisfied, I guess having the say of “Deduplication – a fancy form of Compression” is someone else’s cup of tea. I would rather say “Deduplication – a fancy form of Data Reduction Technology” but I am not complaining as much I did before.