Praying to the hypervisor God

I was reading a great article by Frank Denneman about storage intelligence moving up the stack. It was pretty much in line with what I have been observing in the past 18 months or so, about the storage pendulum having swung back to DAS (direct attached storage). To be more precise, the DAS form factor I am referring to are physical server hardware that houses many disk drives.

Like it or not, the hypervisor has become the center of the universe in the IT space. VMware has become the indomitable force in the hypervisor technology, with Microsoft Hyper-V playing catch-up. The seismic shift of these 2 hypervisor technologies are leading storage vendors to place them on to the altar and revering them as deities. The others, with the likes of Xen and KVM, and to lesser extent Solaris Containers aren’t really worth mentioning.

This shift, as the pendulum swings from networked storage back to internal “direct-attached” storage are dictated by 4 main technology factors:

  • The x86 server architecture
  • Software-defined
  • Scale-out architecture
  • Flash-based storage technology

Anyone remember Thumper? Not the Disney character from the Bambi movie!

thumper-bambi-cartoon-character

When the SunFire X4500 (aka Thumper) was first released in (intermission: checking Wiki for the right year) in 2006, I felt that significant wound inflicted in the networked storage industry. Instead of the usual 4-8 hard disk drives in the all the industry servers at the time, the X4500 4U chassis housed 48 hard disk drives. The design and architecture were so astounding to me, I even went and bought a 1U SunFire X4150 for my personal server collection. Such was my adoration for Sun’s technology at the time.

Continue reading

Technology prowess of Riverbed SteelFusion

The Riverbed SteelFusion (aka Granite) impressed me the moment it was introduced to me 2 years ago. I remembered that genius light bulb moment well, in December 2012 to be exact, and it had left its mark on me. Like I said last week in my previous blog, the SteelFusion technology is unique in the industry so far and has differentiated itself from its WAN optimization competitors.

To further understand the ability of Riverbed SteelFusion, a deeper inspection of the technology is essential. I am fortunate to be given the opportunity to learn more about SteelFusion’s technology and here I am, sharing what I have learned.

What does the technology of SteelFusion do?

Riverbed SteelFusion takes SAN volumes from supported storage vendors in the central datacenter and projects the storage volumes (aka LUNs)to applications and hosts at the remote branches. The technology requires a paired relationship between SteelFusion Core (in the centralized datacenter) and SteelFusion Edge (at the branch). Both SteelFusion Core and Edge are fronted respectively by the Riverbed SteelHead WAN optimization device, to deliver the performance required.

The diagram below gives an overview of how the entire SteelFusion network architecture is like:

Riverbed SteelFusion Overall Solution 2 Continue reading

Convergence data strategy should not forget the branches

The word “CONVERGENCE” is boiling over as the IT industry goes gaga over darlings like Simplivity and Nutanix, and the hyper-convergence market. Yet, if we take a step back and remove our emotional attachment from the frenzy, we realize that the application and implementation of hyper-convergence technologies forgot one crucial elementThe other people and the other offices!

ROBOs (remote offices branch offices) are part of the organization, and often they are given the shorter end of the straw. ROBOs are like the family’s black sheeps. You know they are there but there is little mention of them most of the time.

Of course, through the decades, there are efforts to consolidate the organization’s circle to include ROBOs but somehow, technology was lacking. FTP used to be a popular but crude technology that binds the branch offices and the headquarter’s operations and data services. FTP is still used today, in countries where network bandwidth costs a premium. Data cloud services are beginning to appear of part of the organization’s outreaching strategy to include ROBOs but the fear of security weaknesses, data breaches and misuses is always there. Often, concerns of the weaknesses of the cloud overcome whatever bold strategies concocted and designed.

For those organizations in between, WAN acceleration/optimization techonolgy is another option. Companies like Riverbed, Silverpeak, F5 and Ipanema have addressed the ROBOs data strategy market well several years ago, but the demand for greater data consolidation and centralization, tighter and more effective data management and data control to meet the data compliance and data governance requirements, has grown much more sophisticated and advanced. Continue reading

SMB on steroids but CIFS lord isn’t pleased

I admit it!

I am one of the guilty parties who continues to use CIFS (Common Internet File System) to represent the Windows file sharing protocol. And a lot of vendors continue to use the “CIFS” word loosely without knowing that it was a something from a bygone era. One of my friends even pronounced it as “See Fist“, which sounded even funnier when he said it. (This is for you Adrian M!)

And we couldn’t be more wrong because we shouldn’t be using the CIFS word anymore. It is so 90’s man! And the tell-tale signs have already been there but most of us chose to ignore it with gusto. But a recent SNIA Webinar titled “SMB 3.0 – New opportunities for Windows Environment” aims to dispel our incompetence and change our CIFS-venture to the correct word – SMB (Server Message Block).

A selfie photo of Dennis Chapman, Senior Technical Director for Microsoft Solutions at NetApp from the SNIA webinar slides above, wants to inform all of us that … SMB History Continue reading

Xtreme future?

EMC acquisition of XtremIO sent shockwaves across the industry. The news of the acquisition, reported costing EMC USD$430 million can be found here, here and here.

The news of EMC’s would be acquisition a few weeks ago was an open secret and rumour has it that NetApp was eyeing XtremIO as well. Looks like EMC has beaten NetApp to it yet again.

The interesting part was of course, the price. USD$430 million is a very high price to pay for a stealthy, 2-year old company which has 2 rounds of funding totaling USD$25 million. Why such a large amount?

XtremIO has a talented team of engineers; the notable ones being Yaron Segev and Shahar Frank. They have their background in InfiniBand, and Shahar Frank was the chief architect of Exanet scale-out NAS (which was acquired by Dell). However, as quoted by 451Group, XtremeIO is building an all-flash SAN array that “provides consistently high performance, high levels of flash endurance, and advanced functionality around thin provisioning, de-dupe and space-efficient snapshots“.

Furthermore, XtremeIO has developed a real-time inline deduplication engine that does not degrade performance. It does this by spreading the write I/Os over the entire array. There is little information about this deduplication engine, but I bet XtremIO has developed a real-time, inherent deduplication file system that spreads all the I/Os to balance the wear-leveling as well as having scaling performance. I bet XtremIO will dedupe everything that it stores, has a B+ tree, copy-on-write file system with a super-duper efficient hashing algorithm for address mapping (pointers) with this deduplication file system. Ok, ok, I am getting carried away here, because it is likely that I will be wrong, but I can imagine, can’t I? Continue reading

NFO for DFR

It has not caused severe pain yet but it will. Storage is cheap but as capacity grows, it will eventually hit a limit that makes storage difficult to maintain from a cost perspective.

I wrote about the lack of attention of primary storage deduplication solutions in the local industry. Perhaps deduplication has matured to a point that it has become a no-brainer or perhaps customers are already getting sick and tired of the word “dedupe”. Either way, we should not be distracted from the fact that data footprint reduction (DFR) in a generic sense or storage efficiency as a fancy marketing term, must be applied somewhere to slow down the purchase of storage capacity.

Storage is getting fatter, and storage vendors’ revenue is getting fatter along with it. While this is good for the pockets of vendors, the customers have to face higher costs associated with

  • Power, Cooling and Floorspace
  • Administration and management
  • Bandwidth
  • Resource utilization

All these are not prudent storage management practices, because fat storage is bad, just like human beings getting fatter. Similarly, storage must go on a diet and deduplication is one of the few solutions out there. However, I have spoken out that deduplication is just shrinking the container that holds the bits of data, completely unaware of what the content is. Deduplication does not shrink the data itself, and if the occurrence of the data is high, deduplication does not help in reducing the storage capacity. There is no advantage unless the data footprint reduction (DFR) technology is content aware. (Note that I am using DFS as a generic term rather than data deduplication. The reason is obvious.)

That is why data deduplication technologies does not work well with seismic files or encoded video files, because the files are already highly optimized. But there is a technology that can look deeper into such unstructured files and produce storage capacity reduction with specific algorithms for specific type of files and file objects. That technology, I believe, is the truest form of data footprint reduction and it is called Native Format Optimization (NFO).

I want to relate an old story I had experienced when I brought an EMC BURA (Back Up Recovery & Archive – a precursor to its present BRS division) senior manager to see a highly respected technical manager in Schlumberger in Malaysia a few years back. Schlumberger is the world’s largest oilfield services company and provides seismic analysis and interpretation software and seismic files are highly encoded and compressed.

As usual, the senior manager being a typical sales guy started blabbering how great Data Domain (this was just after the EMC acquisition) was, and how it can dedupe any kind of files giving 20:1 (exaggerated to 500:1 to certain text files), even for seismic files. I was signalling to the EMC senior manager to stop his bullsh*t, but he went on and on. In the end, the Schlumberger technical manager politely told the EMC senior manager to shut up, because he has little understand of what seismic files are like.

Now, back to Native Format Optimization (NFO) technology. In a nutshell, NFO plays trick with our human visual system. The goal is to reduce the size of unstructured files without reducing the visual quality of the images (text, texture, colour, resolution, depth, hue, contrast, etc) of the files. 

Have a look at these 2 files. One is optimized with NFO and one is un-optimized. Can you tell the difference?

 

The human visual system is known to be:

  • Less sensitive to high frequency of colour variation
  • More sensitive to brightness than colour variation
  • Less sensitive to background colour in lower resolution
  • More sensitive to a picture’s motion than picture’s texture

Therefore, the eyes perceive an image based on mostly the lowest quality baseline. I got this information from George Crump’s Storage Switzerland’s article.

Because NFO is already in its native form, the files does not need to be rehydrated like deduped files.

The capacity reduction savings is tremendous and because NFO approach is content aware, the benefits translates to higher cost savings in

  • Reduction of power, cooling and floorspace
  • Reduction in data management and administration tasks, especially backup
  • Improved bandwidth and improved disaster recovery
  • Higher performance
  • Delayed storage capacity purchase
  • many more

After Ocarina acquisition by Dell in 2010, a search on the web revealed that probably only one vendor in Europe has boldly continued to enhance NFO technology in their products. The company is balesio and you can read about their NFO technology here.

 

Primary Dedupe where are you?

I am a bit surprised that primary storage deduplication has not taken off in a big way, unlike the times when the buzz of deduplication first came into being about 4 years ago.

When the first deduplication solutions first came out, it was particularly aimed at the backup data space. It is now more popularly known as secondary data deduplication, the technology has reduced the inefficiencies of backup and helped sparked the frenzy of adulation of companies like Data Domain, Exagrid, Sepaton and Quantum a few years ago. The software vendors were not left out either. Symantec, Commvault, and everyone else in town had data deduplication for backup and archiving.

It was no surprise that EMC battled NetApp and finally won the rights to acquire Data Domain for USD$2.4 billion in 2009. Today, in my opinion, the landscape of secondary data deduplication has pretty much settled and matured. Practically everyone has some sort of secondary data deduplication technology or solution in place.

But then the talk of primary data deduplication hardly cause a ripple when compared a few years ago, especially here in Malaysia. Yeah, the IT crowd is pretty fickle that way because most tend to follow the trend of the moment. Last year was Cloud Computing and now the big buzz word is Big Data.

We are here to look at technologies to solve problems, folks, and primary data deduplication technology solutions should be considered in any IT planning. And it is our job as storage networking professionals to continue to advise customers about what is relevant to their business and addressing their pain points.

I get a bit cheesed off that companies like EMC, or HDS continue to spend their marketing dollars on hyping the trends of the moment rather than using some of their funds to promote good technologies such as primary data deduplication that solve real life problems. The same goes for most IT magazines, publications and other communications mediums, rarely giving space to technologies that solves problems on the ground, and just harping on hypes, fuzz and buzz. It gets a bit too ordinary (and mundane) when they are trying too hard to be extraordinary because everyone is basically talking about the same freaking thing at the same time, over and over again. (Hmmm … I think I am speaking off topic now .. I better shut up!)

We are facing an avalanche of data. The other day, the CEO of Nexenta used the word “data tsunami” but whatever terms used do not matter. There is too much data. Secondary data deduplication solved one part of the problem and now it’s time to talk about the other part, which is data in primary storage, hence primary data deduplication.

What is out there?  Who’s doing what in term of primary data deduplication?

NetApp has their A-SIS (now NetApp Dedupe) for years and they are good in my books. They talk to customers about the benefits of deduplication on their FAS filers. (Side note: I am seeing more benefits of using data compression in primary storage but I am not going to there in this entry). EMC has primary data deduplication in their Celerra years ago but they hardly talk much about it. It’s on their VNX as well but again, nobody in EMC ever speak about their primary deduplication feature.

I have always loved Ocarina Networks ECO technology and Dell don’t give much hoot about Ocarina since the acquisition in  2010. The technology surfaced a few months ago in Dell DX6000G Storage Compression Node for its Object Storage Platform, but then again, all Dell talks about is their Fluid Data Architecture from the Compellent division. Hey Dell, you guys are so one-dimensional! Ocarina is a wonderful gem in their jewel case, and yet all their storage guys talk about are Compellent  and EqualLogic.

Moving on … I ought to knock Oracle on the head too. ZFS has great data deduplication technology that is meant for primary data and a couple of years back, Greenbytes took that and made a solution out of it. I don’t follow what Greenbytes is doing nowadays but I do hope that the big wave of primary data deduplication will rise for companies such as Greenbytes to take off in a big way. No thanks to Oracle for ignoring another gem in ZFS and wasting their resources on pre-sales (in Malaysia) and partners (in Malaysia) that hardly know much about the immense power of ZFS.

But an unexpected source coming from Microsoft could help trigger greater interest in primary data deduplication. I have just read that the next version of Windows Server OS will have primary data deduplication integrated into NTFS. The feature will be available in Windows 8 and the architectural view is shown below:

The primary data deduplication in NTFS will be a feature add-on for Windows Server users. It is implemented as a filter driver on a per volume basis, with each volume a complete, self describing unit. It is cluster aware, and fully crash consistent on all operations.

The technology is Microsoft’s own technology, built from scratch and will be working to position Hyper-V as an strong enterprise choice in its battle for the server virtualization space with VMware. Mind you, VMware already has a big, big lead and this is just something that Microsoft must do-or-die to keep Hyper-V playing catch-up. Otherwise, the gap between Microsoft and VMware in the server virtualization space will be even greater.

I don’t have the full details of this but I read that the NTFS primary deduplication chunk sizes will be between 32KB to 128KB and it will be post-processing.

With Microsoft introducing their technology soon, I hope primary data deduplication will get some deserving accolades because I think most companies are really not doing justice to the great technologies that they have in their jewel cases. And I hope Microsoft, with all its marketing savviness and adeptness, will do some justice to a technology that solves real life’s data problems.

I bid you good luck – Primary Data Deduplication! You deserved better.

Data Deduplication – Dell is first and last

A very interesting report surfaced in front of me today. It is Information Week’s IT Pro ranking of Data Deduplication vendors, just made available a few weeks ago, and it is the overview of the dedupe market so far.

It surveyed over 400 IT professionals from various industries with companies ranging from less than 50 employees to over 10,000 employees and revenues of less than USD5 million to USD1 billion. Overall, it had a good mix of respondents. But the results were quite interesting.

It surveyed 2 segments

  1. Overall performance – product reliability, product performance, acquisition costs, operations costs etc.
  2. Technical features – replication, VTL, encryption, iSCSI and FCoE support etc.

When I saw the results (shown below), surprise, surprise! Here’s the overall performance survey chart:

Dell/Compellent scored the highest in this survey while EMC/Data Domain ranked the lowest. However, the difference between the first place and the last place vendor is only 4%, and this is to suggest that EMC/Data Domain was about just as good as the Dell/Compellent solution, but it scored poorly in the areas that matters most to the customer. In fact, as we drill down into the requirements of the overall performance one-by-one, as shown below,

there is little difference among the 7 vendors.

However, when it comes to Technical Features, Dell/Compellent is ranked last, the complete opposite. As you can see from the survey chart below, IBM ProtecTier, NetApp and HP are all ranked #1.

The details, as per the technical requirements of the customers, are shown below:

These figures show that the competition between the vendors is very, very stiff, with little edge difference from one to another. But what I was more interested were the following findings, because these figures tell a story.

In the survey, only 34% of the respondents say they have implemented some data deduplication solutions, while the rest are evaluating and plan to evaluation. This means that the overall market is not saturated and there is still a window of opportunity for the vendors. However, the speed of the a maturing data deduplication market, from early adopters perhaps 4-5 years ago to overall market adoption, surprised many, because the storage industry tend to be a bit less trendy than most areas of IT. With the way the rate of data deduplication is going, it will be very much a standard feature of all storage vendors in the very near future.

The second figures that is probably not-so-surprising is, for most of the customers who have already implemented the data deduplication solution, almost 99% are satisfied or somewhat satisfied with their solutions. Therefore, the likelihood of these customer switching vendors and replacing their gear is very low, perhaps partly because of the reliability of the solution as well as those products performing as they should.

The Information Week’s IT Pro survey probably reflected well of where the deduplication market is going and there isn’t much difference in terms of technical and technology features from vendor to vendor. Customer will have to choose beyond the usual technology pitch, and look for other (and perhaps more important) subtleties such as customer service, price and flexibility of doing business with. EMC/Data Domain, being king-of-the-hill, has not been the best of vendor when it comes to price, quality of post-sales support and service innovation. Let’s hope they are not like the EMC sales folks of the past, carrying the “Take it or leave it” tag when they develop their relationship with their future customers. And it will not help if word-of-mouth goes around the industry about EMC’s arrogance of their dominance. It may not be true, and let’s hope it is not true because the EMC of today has changed plenty compared to the Symmetrix days. EMC/Data Domain is now part of their Backup Recovery Service (BRS) team, and I have good friends there at EMC Malaysia and Singapore. They are good guys but remember guys, customer is still king!

Dell, new with their acquisition of Compellent and Ocarina Networks, seems very eager to win the business and kudos to them as well. In fact, I heard from a little birdie that Dell is “giving away” several units of Compellents to selected customers in Malaysia. I did not and cannot ascertain if this is true or not but if it is, that’s what I call thinking-out-of-the-box, given Dell as a late comer into the storage game. Well done!

One thing to note is that the survey took in 17 vendors, including Exagrid, Falconstor, Quantum, Sepaton and so on, but only the top-7 shown in the charts qualified.

In the end, I believe the deduplication vendors had better scramble to grab as much as they can in the coming months, because this market will be going, going, gone pretty soon with nothing much to grab after that, unless there is a disruptive innovation to the deduplication technology

HP StoreOnce – Further Depth

I promised last week I will look deeper into HP StoreOnce technology and I did. As I mentioned in my previous blog, HP StoreOnce technology now embedded in its D2D series of secondary, target backup devices that does the job with no fuss and no fancy bells and whistles.

Here’s the lineup of the present HP D2D solutions.

 

HP Malaysia has constantly reminded me that their D2D deduplication solution is much more price competitive than their competitors and this is something you, the readers, have to find out on your own. But I do believe that they are. Unfortunately they did not have the first mover’s advantage when Data Domain took the industry by storm in 2009, since HP StoreOnce was only launched with much fanfare last year in June 2010. Despite that, there still plenty of room in the IT market to grow, especially in HP’s huge set of customers.

Without the first movers advantage, HP StoreOnce has to differentiate itself from the existing competitors such as EMC Data Domain and Quantum. Labeling their deduplication technology as version 2.0 (whereas the competitors are still at “Version 1.0″?), HP StoreOnce banks on 3 key technologies. They are

  • Sparse Indexing
  • Intelligent Block Size Management
  • Reduction in Disk Fragmentation

Out of these 3, sparse indexing is the most interesting but I will save the best from last. Let’s start with Intelligent Block Size Management.

HP StoreOnce uses a variable chunking method with a smaller granularity of 4K in size and this is managed intelligently, thus achieving a higher deduplication ratio compared to its competitors which either uses a fixed chunking method or with a variable chunking method of larger block sizes in the range of 8K to 32K. The HP Lab’s testing reveals that the space savings was significant when compared with others.

Below are a set of results for a PowerPoint presentation and you can see for yourself.

 

(NOTE: Please note that the savings/deduplication ratio can be very different and can range from good to bad for different types of data. Video and images files are highly encoded. Seismic and geo-mapping files are highly compressed. It is very likely that most deduplication solutions cannot achieve a high percentage with these types of files)

Point #2 talks about Reduction in Disk Fragmentation. The inherent benefits from Intelligent Block Size Management brings about the Reduction in Disk Fragmentation. The smaller chunks means lesser space wastage, especially when the block size is 4K or lower. HP StoreOnce also uses an intelligent algorithm to place the blocks that are perceived to be related close to one another. Hence this “locality” presence helps and the retrieval and restore process will be faster and more efficient.

Sparse Indexing is where HP StoreOnce touts to be a game changer. Today’s data is already as massive as a mountain, and it’s going to get bigger and growing faster. Using “Version 1.0″ type of deduplication, the hashes created are stored in either memory or on disks. However, the massive data sets (especially unstructured data) are already producing massive amounts of hashes. Hashes are used to identify unique data blocks but the avalanche of unstructured data means that most deduplication solutions are generating more and more hashes, making most Version 1.0s hashes sluggish and difficult to retrieve.

Sparse Indexing addresses this hash problem (by the way, HP StoreOnce uses SHA-1 hash) by intelligently sampling a small chunks and creating a very fast index lookup mechanism that stays in the system’s memory all the time. As the engineers at HP Labs put it

Instead of holding every index item in RAM ready for comparison,
the HP team keeps just one in every hundred or so items in RAM
and puts the rest onto a hard drive. Duplicate data almost
always arrives in bursts. In other words, if one chunk of the
arriving stream is a duplicate, it is very likely that many
following chunks are duplicates. Sparse indexing takes advantage
of this phenomenon by storing the sequence of hashes of the
stored chunks next to each other on disk. As a result, a ‘hit’
in the sample RAM index can direct the system to an area of
the disk where many duplicates are likely to be found.

Sparse Indexing is not unique in the industry, but the engineers at HP Labs have put their thinking hats on and applied it to improve the search and looking up of the hashes in the StoreOnce deduplication technology.

Further savings are also achieved when the deduped data is compressed with the LZ (Lempel-Ziv) compression method before it is stored into the disks.

The HP StoreOnce technology is 100% fully concocted in the renown HP Labs and according to sources, this technology will indeed permeate across all HP StorageWorks (HP has since renamed it to HP Storage) line. With this strategy, HP hopes to address the “fragmented and complicated” (as quoted by HP) deduplication and data protection strategy across the enterprise. By “fragmented and complicated”, they mean that the deduplicated data constant has to be rehydrated and deduped again as the data moves across different IT devices and functions.

In a perfect world, HP wants their StoreOnce technology to be like the diagram below.

 

However, one very interesting fact that I found was HP does not believe that primary storage deduplication is a good idea. They claim that it complicates the whole thing. Whether HP likes it or not, NetApp has been dishing out primary storage deduplication for several years now and you don’t see their customers unhappy with NetApp about this feature.

In one of the HP Business whitepapers I read, one of the takeaways was

 

I was like, “Whoa! What’s this?”. I felt bemused about what was mentioned in the whitepaper. After all the best claims of the HP StoreOnce technology, I can’t help but to think that this could be a banana skin on the pavement for HP.


Deduplication – a fancy form of Compression?

One of the things that peeved at the HP D2D Workshop a few days ago was this heading in the HP PowerPoint slides – “Deduplication – a fancy form of Compression”. Somehow it bothered me.

I have always placed both deduplication and compression into a bucket I called “Data Reduction“. Some vendors might call it Storage Economics, spinning it in a cooler manner. Either way, both attempt and succeed to reduce the capacity required to store the amount of data and this translates into benefits in storage management and network. With a smaller data set, lesser processing and capacity are required, likely speeding up the performance of the storage array. At the same time, the primary data backup set (you know, the data that you back up every night?) becomes smaller, making backup and restore faster (not necessarily, but you have to rehydrate the data from its reduced state). Another obvious benefit is the ability to transfer the smaller data set over the network more efficiently, compared to its original state and size, making Disaster Recovery more possible and so on.

I have always known that deduplication works with data objects using a differential method. Whether the data object is a file or a chunk of the file, deduplication attempts to differentiate similarities (duplicates), and store one copy of that object and have others referencing to the single object. The differentiation methods commonly used are hashing and delta differential. In hashing, MD-5 and SHA-1 are the popular hashing algorithms used, while in delta differentials, the data objects are compared (usually in a scrutinizing manner) to find the differences. The duplicates or similarities are discarded.

There are many factors involved in deduplication. It could be the types of data, the processing power required to do the deduplication task, and throughput of processing and so on and resulting in the different deduplication ratio and time required to complete the process. I am not going to delve into that as there are many vendors who will be able to articulate this, such as EMC Data Domain, HP D2D/VLS with its StoreOnce technology, Exagrid, Sepaton, Dell Ocarina Networks, NetApp, EMC Centera, CommVault Simpana, Symantec PureDisk, Symantec NetBackup, EMC Avamar and many more.

Meanwhile, compression (especially most commercial compression technology) are based on dictionary coding, a lossless data reduction algorithm. Note that I am using the term encoding rather than compression because factually, encoding is the right word. You can’t squeeze the data into a smaller size like you do with a real life object.

The technique works like this.

  1. When being encoded, a bit/byte or a set of bytes are compared to a “dictionary” which is a pool of  “words”  in a data structure maintained by the encoding technology
  2. If a match is found, the bit/byte or set of bytes is substituted by an “word”, usually a much shorter (hence smaller size) representation form of the bytes being encoded.
  3. As the encoding process continues, more “dictionary words” are built into the “dictionary” based on the bytes already encoded. This is popularly known as the sliding window implementation.
  4. The end result is the data is highly encoded (heavily replaced) by “dictionary words”  and of a much smaller size.

One of the heavily implemented compression technique is based on the theory and methodology introduced by Lempel-Ziv and further enhanced by the Lempel-Ziv-Welch trio. A very good explanation of LZ method can be found here.

Both deduplication and compression have the same objective – that is to reduce the data size for more efficient storage. But both approach it from a different angle but they are by no means, exclusive. Both can be used to complement each other and further reduce the capacity required to store the data.

Deduplication usually works with larger data objects (chunks, files etc) while compression works harder at the lower level (byte range level). Deduplication is heavily deployed in secondary data sets (or backup) because you can find plenty of duplicates while in primary data sets (the data in production), deduplication and compression are deployed, either in a singular fashion or one after another. Deduplication is usually run as Step 1 and then Compression is run in Step 2.

So far, the only one that has impressed me for the primary data reduction is Ocarina Networks, which uses a 3 step approach in dedupe, compress and using specialized compactors to reduce the data even more. I have seen the ability of Ocarina reducing Schlumberger Geoframe and Petrel seismic data to more than 50%. That was impressive!

Having my bothered state satisfied, I guess having the say of “Deduplication – a fancy form of Compression” is someone else’s cup of tea. I would rather say “Deduplication – a fancy form of Data Reduction Technology” but I am not complaining as much I did before.