I like LTFS (Linear Tape File System). I was hoping it would take off but it has not. And looking at its future, its significance is becoming less and less relevant. I look if Cloud has been a factor in the possible demise of LTFS in the next few years.
What is LTFS?
In a nutshell, Linear Tape File System makes LTO tapes look like a disk with a file system. It takes a tape and divides it into 2 partitions:
Index Partition (XML Index Schema with file names, metadata and attributes details)
Data Partition (where the data resides)
Diagram from https://www.snia.org/sites/default/orig/SDC2011/presentations/tuesday/DavidPease_LinearTape_File_System.pdf
It has a File System module which is implemented in supported OS of Unix/Linux, MacOS and Windows. And the mounted file system “tape partition” shows up as a drive or device.
There were many attempts to kill off tapes and so far, none has been successful.
The flash frenzy has reached its zenith in 2016. We now no longer are interested in listening to storage technology vendors touting the power of solid state storage (NAND Flash included) over spinning drives.
The capacity of 3D NAND Flash SSDs has reached a whopping 15.3TB (that is even bigger than the 12TB 7200RPM HDDs of today), and with deduplication and compression, the storage efficiency has reached a conservative 4:1 or 5:1. Effective capacity of most mid-end storage arrays can easily reach 1-2 Petabytes.
And flash and hybrid platforms have reached maturity in these few short years. So what is next?
The landscape has obviously changed. The performance landscape, the capacity landscape and all related to the storage data points have changed. And the speed of SSDs together with the up-and-coming NVMe and NVDIMM technology in new storage array controllers are also shifting the data bottlenecks to another part of the architecture. The development of I/O communications and interfaces has to change as well, to take advantage of the asynchronous I/Os in storage tiering and caching using NAND Flash.
With this mature and well understood landscape, it is time to take Flash to the next level. This next level comes in the form of an exciting end-user conference in Singapore on 25th April 2017. It is called FlashForward.
The 2016 FlashForward event in Europe has already garnered great support from the cream of the storage technologists around the world, and had fantastic feedbacks from the end-user attendees. That FlashForward event has also seen the birth of an international business and technology exchange in its inaugural introduction. Yes, it is time to learn from the field experts, and it is time to build on the Flash Platform for new Data Services.
From the sponsorship package brochure I have received, it is definitely an event not to be missed.
The FlashForward Conference in Singapore is exquisitely procured by Evito Ltd, under the stewardship of Mr. Paul Talbut. Paul is a very seasoned veteran in the global circuit as an SNIA director of several initiatives. He has been immensely involved in the development of several SNIA chapters around the world, including South Asia, Malaysia, India, China, and even Brazil. He also leads by example with the SNIA Global Steering Committee (GSC); he is the SNIA Global Education Director and at one time, SNIA DPCO (Data Protection & Capacity Optimization) global proctor.
I have had the honour working with Paul for almost 8 years now, and I am sure he will lead the FlashForward Conference with valuable insights and experiences.
This is probably the greatest period for the industry and end users to get involved in the FlashForward Conference. For one, it is endorsed by SNIA, the vendor-neutral association which has been the growth beacon of the storage networking industry.
Secondly, it is the perfect opportunity for technology vendors to build their mindshare with end users and customers. And with the endorsement of the independent field experts and technology practitioners, end users would have a field day garnering approvals for their decisions, as well as learning the best practices to build upon the Flash technology they have implemented in their data center space.
The sponsorship packages are listed below, and I do encourage technology vendors, especially the All-Flash vendors to use the FlashForward conference as a platform to build their mindshare, and most of all, their branding. Continue reading →
The Riverbed SteelFusion (aka Granite) impressed me the moment it was introduced to me 2 years ago. I remembered that genius light bulb moment well, in December 2012 to be exact, and it had left its mark on me. Like I said last week in my previous blog, the SteelFusion technology is unique in the industry so far and has differentiated itself from its WAN optimization competitors.
To further understand the ability of Riverbed SteelFusion, a deeper inspection of the technology is essential. I am fortunate to be given the opportunity to learn more about SteelFusion’s technology and here I am, sharing what I have learned.
What does the technology of SteelFusion do?
Riverbed SteelFusion takes SAN volumes from supported storage vendors in the central datacenter and projects the storage volumes (aka LUNs)to applications and hosts at the remote branches. The technology requires a paired relationship between SteelFusion Core (in the centralized datacenter) and SteelFusion Edge (at the branch). Both SteelFusion Core and Edge are fronted respectively by the Riverbed SteelHead WAN optimization device, to deliver the performance required.
The diagram below gives an overview of how the entire SteelFusion network architecture is like:
The word “CONVERGENCE” is boiling over as the IT industry goes gaga over darlings like Simplivity and Nutanix, and the hyper-convergence market. Yet, if we take a step back and remove our emotional attachment from the frenzy, we realize that the application and implementation of hyper-convergence technologies forgot one crucial element – The other people and the other offices!
ROBOs (remote offices branch offices) are part of the organization, and often they are given the shorter end of the straw. ROBOs are like the family’s black sheeps. You know they are there but there is little mention of them most of the time.
Of course, through the decades, there are efforts to consolidate the organization’s circle to include ROBOs but somehow, technology was lacking. FTP used to be a popular but crude technology that binds the branch offices and the headquarter’s operations and data services. FTP is still used today, in countries where network bandwidth costs a premium. Data cloud services are beginning to appear of part of the organization’s outreaching strategy to include ROBOs but the fear of security weaknesses, data breaches and misuses is always there. Often, concerns of the weaknesses of the cloud overcome whatever bold strategies concocted and designed.
For those organizations in between, WAN acceleration/optimization techonolgy is another option. Companies like Riverbed, Silverpeak, F5 and Ipanema have addressed the ROBOs data strategy market well several years ago, but the demand for greater data consolidation and centralization, tighter and more effective data management and data control to meet the data compliance and data governance requirements, has grown much more sophisticated and advanced. Continue reading →
EMC acquisition of XtremIO sent shockwaves across the industry. The news of the acquisition, reported costing EMC USD$430 million can be found here, here and here.
The news of EMC’s would be acquisition a few weeks ago was an open secret and rumour has it that NetApp was eyeing XtremIO as well. Looks like EMC has beaten NetApp to it yet again.
The interesting part was of course, the price. USD$430 million is a very high price to pay for a stealthy, 2-year old company which has 2 rounds of funding totaling USD$25 million. Why such a large amount?
XtremIO has a talented team of engineers; the notable ones being Yaron Segev and Shahar Frank. They have their background in InfiniBand, and Shahar Frank was the chief architect of Exanet scale-out NAS (which was acquired by Dell). However, as quoted by 451Group, XtremeIO is building an all-flash SAN array that “provides consistently high performance, high levels of flash endurance, and advanced functionality around thin provisioning, de-dupe and space-efficient snapshots“.
Furthermore, XtremeIO has developed a real-time inline deduplication engine that does not degrade performance. It does this by spreading the write I/Os over the entire array. There is little information about this deduplication engine, but I bet XtremIO has developed a real-time, inherent deduplication file system that spreads all the I/Os to balance the wear-leveling as well as having scaling performance. I bet XtremIO will dedupe everything that it stores, has a B+ tree, copy-on-write file system with a super-duper efficient hashing algorithm for address mapping (pointers) with this deduplication file system. Ok, ok, I am getting carried away here, because it is likely that I will be wrong, but I can imagine, can’t I? Continue reading →
It has not caused severe pain yet but it will. Storage is cheap but as capacity grows, it will eventually hit a limit that makes storage difficult to maintain from a cost perspective.
I wrote about the lack of attention of primary storage deduplication solutions in the local industry. Perhaps deduplication has matured to a point that it has become a no-brainer or perhaps customers are already getting sick and tired of the word “dedupe”. Either way, we should not be distracted from the fact that data footprint reduction (DFR) in a generic sense or storage efficiency as a fancy marketing term, must be applied somewhere to slow down the purchase of storage capacity.
Storage is getting fatter, and storage vendors’ revenue is getting fatter along with it. While this is good for the pockets of vendors, the customers have to face higher costs associated with
Power, Cooling and Floorspace
Administration and management
All these are not prudent storage management practices, because fat storage is bad, just like human beings getting fatter. Similarly, storage must go on a diet and deduplication is one of the few solutions out there. However, I have spoken out that deduplication is just shrinking the container that holds the bits of data, completely unaware of what the content is. Deduplication does not shrink the data itself, and if the occurrence of the data is high, deduplication does not help in reducing the storage capacity. There is no advantage unless the data footprint reduction (DFR) technology is content aware. (Note that I am using DFS as a generic term rather than data deduplication. The reason is obvious.)
That is why data deduplication technologies does not work well with seismic files or encoded video files, because the files are already highly optimized. But there is a technology that can look deeper into such unstructured files and produce storage capacity reduction with specific algorithms for specific type of files and file objects. That technology, I believe, is the truest form of data footprint reduction and it is called Native Format Optimization (NFO).
I want to relate an old story I had experienced when I brought an EMC BURA (Back Up Recovery & Archive – a precursor to its present BRS division) senior manager to see a highly respected technical manager in Schlumberger in Malaysia a few years back. Schlumberger is the world’s largest oilfield services company and provides seismic analysis and interpretation software and seismic files are highly encoded and compressed.
As usual, the senior manager being a typical sales guy started blabbering how great Data Domain (this was just after the EMC acquisition) was, and how it can dedupe any kind of files giving 20:1 (exaggerated to 500:1 to certain text files), even for seismic files. I was signalling to the EMC senior manager to stop his bullsh*t, but he went on and on. In the end, the Schlumberger technical manager politely told the EMC senior manager to shut up, because he has little understand of what seismic files are like.
Now, back to Native Format Optimization (NFO) technology. In a nutshell, NFO plays trick with our human visual system. The goal is to reduce the size of unstructured files without reducing the visual quality of the images (text, texture, colour, resolution, depth, hue, contrast, etc) of the files.
Have a look at these 2 files. One is optimized with NFO and one is un-optimized. Can you tell the difference?
The human visual system is known to be:
Less sensitive to high frequency of colour variation
More sensitive to brightness than colour variation
Less sensitive to background colour in lower resolution
More sensitive to a picture’s motion than picture’s texture
Because NFO is already in its native form, the files does not need to be rehydrated like deduped files.
The capacity reduction savings is tremendous and because NFO approach is content aware, the benefits translates to higher cost savings in
Reduction of power, cooling and floorspace
Reduction in data management and administration tasks, especially backup
Improved bandwidth and improved disaster recovery
Delayed storage capacity purchase
After Ocarina acquisition by Dell in 2010, a search on the web revealed that probably only one vendor in Europe has boldly continued to enhance NFO technology in their products. The company is balesio and you can read about their NFO technology here.
The next all-Flash product in my review list is SolidFire. Immediately, the niche that SolidFire is trying to carve out is obvious. It’s not for regular commercial customers. It is meant for Cloud Service Providers, because the features and the technology that they have innovated are quite cloud-intended.
Are they solid (pun intended)? Well, if they have managed to secure a Series B funding of USD$25 million (total of USD$37 million overall) from VCs such as NEA and Valhalla, and also angel investors such as Frank Slootman (ex-Data Domain CEO) and Greg Papadopoulus(ex-Sun Microsystems CTO), then obviously there is something more than meets the eye.
The one thing I got while looking up SolidFire is there is probably a lot of technology and innovation behind their Nodes and their Element OS. They hold their cards very, very close to their chest, and I couldn’t not get much good technology related information from their website or in Google. But here’s a look of how the SolidFire is like:
The SolidFire only has one product model, and that is the 1U SF3010. The SF3010 has 10 x 2.5″ 300GB SSDs giving it a raw total of 3TB per 1U. The minimum configuration is 3 nodes, and it scales to 100 nodes. The reason for starting with 3 nodes is of course, for redundancy. Each SF3010 node has 8GB NVRAM and 72GB RAM and sports 2 x 10GbE ports for iSCSI connectivity, especially when the core engineering talents were from LeftHand Networks. LeftHand Networks product is now HP P4000. There is no Fibre Channel or NAS front end to the applications.
Each node runs 2 x Intel Xeon 2.4GHz 6-core CPUs. The 1U height is important to the cloud provider, as the price of floor space is an important consideration.
Aside from the SF3010 storage nodes, the other important ingredient is their SolidFire Element OS.
Cloud storage needs to be available. The SolidFire Helix Self-Healing data protection is a feature that is capable of handling multiple concurrent failures across all levels of their storage. Data blocks are replicated randomly but intelligently across all storage nodes to ensure that the failure or disruption of access to a particular data block is circumvented with another copy of the data block somewhere else within the cluster. The idea is not new, but effective because solutions such as EMC Centera and IBM XIV employ this idea in their data availability. But still, the ability for self-healing ensures a very highly available storage where data is always available.
To address the efficiency of storage, having 3TB raw in the SF3010 is definitely not sufficient. Therefore, the Element OS always have thin provision, real-time compression and in-line deduplication turned on. These features cannot be turned off and operate at a fine-grained 4K blocks. Also important is the intelligence to reclaim of zeroed blocks, no-reservation, and no data movement in these innovations. This means that there will be no I/O impact, as claimed by SolidFire.
But the one feature that differentiates SolidFire when targeting storage for Cloud Service Providers is their guaranteed volume level Quality of Service (QOS). This is important and SolidFire has positioned their QOS settings into an advantage. As best practice, Cloud Service Providers should always leverage the QOS functionality to improve their storage utilization
The QOS has:
Minimum IOPS – Lower IOPS means lower performance priority (makes good sense)
Burst IOPS – for those performance spikes moments
Maximum and Burst MB/sec
The combination of QOS and storage capacity efficiency gives SolidFire the edge when cloud providers can scale both performance and capacity in a more balanced manner, something that is not so simple with traditional storage vendors that relies on lots of spindles to achieve IOPS performance sacrificing capacity in the process. But then again, with SSDs, the IOPS are plenty (for now). SolidFire does not boast performance numbers of millions of IOPS or having throughput into the tens of Gigabytes like Violin, Virident or Kaminario, but what they want to be recognized as the cloud storage as it should be in a cloud service provider environment.
SolidFire calls this Performance Virtualization. Just as we would get to carve our storage volumes from a capacity pool, SolidFire allows different performance profiles to be carved out from the performance pool. This gives SolidFire the ability to mix storage capacity and storage performance in a seemingly independent manner, customizing the type of storage bundling required of cloud storage.
In fact, SolidFire only claims 50,000 IOPS per storage node (including the IOPS means for replicating data blocks). Together with their native multi-tenancy capability, the 50,000 or so IOPS will align well with many virtualized applications, rather than focusing on a 10x performance improvement on a single applications. Their approach is more about a more balanced and spread-out I/O architecture for cloud service providers and the applications that they service.
Their management is also targeted to the cloud. It has a REST API that integrates easily into OpenStack, Citrix CloudStack and VMware vCloud Director. This seamless and easy integration, is more relevant because the CSPs already have their own management tools. That is why SolidFire API is a REST-ready, integration ready to do just that.
The power of the SolidFire API is probably overlooked by storage professionals trained in the traditional manner. But what SolidFire API has done is to provide the full (I mean FULL) capability of the management and provisioning of the SolidFire storage. Fronting the API with REST means that it is real easy to integrate with existing CSP management interface.
Together with the Storage Nodes and the Element OS, the whole package is aimed towards a more significant storage platform for Cloud Service Providers(CSPs). Storage has always been a tricky component in Cloud Computing (despite what all the storage vendors might claim), but SolidFire touts that their solution focuses on what matters most for CSPs.
CSPs would want to maximize their investment without losing their edge in the cloud offerings to their customers. SolidFire lists their benefits in these 3 areas:
The edge in cloud storage is definitely solid for SolidFire. Their ability to leverage on their position and steering away from other all-Flash vendors’ battlezone could all make sense, as they aim to gain market share in the Cloud Service Provider space. I only wish they can share more about their technology online.
Fortunately, I found a video by SolidFire’s CEO, Dave Wright which gives a great insight about SolidFire’s technology. Have a look (it’s almost 2 hour long):
[2 hours later]: Phew, I just finished the video above and the technology is solid. Just to summarize,
No RAID (which is a Godsend for service providers)
Aiming for USD5.00 or less per Gigabyte (a good number!)
General availability in Q1 2012
Lots of confidence about the superiority of their technology, as portrayed by their CEO, Dave Wright.
Nowadays, the capacity of the hard disk drives (HDDs) are really big. 3TB is out and 4TB is in the horizon. What’s next?
For small-medium businesses in Malaysia, depending on their data requirements and applications, 3-10TB is pretty sufficient and with room to grow as well. Therefore, a 6TB requirement can be easily satisfied with 2 x 3TB HDDs.
If I were the customer, why would I buy a storage array, with the software licenses and other stuff that will not only increase my cost of equipment acquisition and data management, it will also increase the complexity of my IT infrastructure? I could just slot HDDs into my existing server, RAID it with RAID-0 (not a good idea but to save costs, most customers would do that) and I have a 6TB volume! It’s cheaper, easier to manage with Windows or Linux, and my system administrator doesn’t have to fuss about lack of storage experience.
And RAID isn’t really keeping up with the tremendous growth of HDD’s capacity as well. In fact, RAID is at risk. RAID (especially RAID 5/6) just cannot continue provide the LUN or volume reliability and data availability because it just takes too damn long to rebuild the volume after the failure of a disk.
Back in the days where HDDs were less than 500GB, RAID-5 would still hold up but after passing the 1TB mark, RAID-6 became more prevalent. But now, that 1TB has ballooned to 3TB and RAID-6 is on shaky ground. What’s next? RAID-7? ZFS has RAID-Z3, triple parity but come on, how many vendors have that? With triple parity or stronger RAID (is there one?), the price of the storage array is going to get too costly.
Experts have been speaking about parity-declustering, but that’s something that a few vendors have right now. Panasas, founded by one of forefathers of RAID, Garth Gibson, comes to mind. In fact, Garth Gibson and Mark Holland of Cargenie-Mellon University’s Parallel Data Lab (PDL) presented a paper about parity-declustering more than 10 years ago.
Let’s get back to our storage fatty. Yes, our storage is getting fat, obese, rotund or whatever you want to call it. And storage vendors have been pushing a concept in hope that storage administrators and customers can take advantage of it. It is called Storage Optimization or Storage Efficiency.
Here are a few ways you can consider to put your storage on a diet.
Tapes and SSDs
To me, compression has not taken the storage world by storm. But then again, there aren’t many vendors that tout compression as a feature for storage optimization. Most of them rather prefer to push the darling of data reduction, data deduplication, as the main feature for save more space. Theoretically, data deduplication makes more sense when the data is inactive, and has high occurrence of duplicated data. That is why secondary storage such as backup deduplication targets like Data Domain, HP StoreOnce, Quantum DXi can publish 20:1 rates and over time, that rate can get even higher.
NetApp also has been pushing their A-SIS data deduplication on primary storage. Yes, it helps with the storage savings in primary but when the need for higher data transfer rates and time to access “manipulated” data (deduped or compressed), it is likely that compression is a better choice for primary, active data.
So who has compression? NetApp ONTAP 8.0.1 has compression now and IBM with its Storewize V7000 started as a compression device. Read about IBM Storewize in my blog here. Dell has Ocarina Networks, which was recently unleashed. I am a big fan of Ocarina Networks and I wrote about the technology in my previous blog. EMC, during the Celerra days of DART has compression but I don’t hear much about it in their VNX. Compression is there, believe me, embedded all the loads of EMC marketing.
Thin Provisioning is now a must-have and standard feature of all storage vendors. What is Thin Provisioning? The diagram below shows you:
In the past, storage systems aren’t so intelligent. You ask for 10TB, you are given 10TB and that 10TB is “deducted” from the storage capacity. That leads to wastage and storage inefficiencies. Today, Thin Provisioning will give you 10TB but storage capacity is consumed as it is being used. The capacity is not pre-allocated as in the past. Thin provisioning is a great diet pill for bloated storage projects.
Another up and coming feature is storage tiering. Storage tiering, when associated to storage optimization, should include hierarchical storage management (HSM) and tape-out as well. Storage optimization solutions should not offer only in the storage array itself. Storage tiering within the storage array is available with most vendors – IBM EasyTier, EMC FAST2, Dell Fluid Data Management and many others. But what about data being moved out of the storage array? What about reducing the capacity of the data online or near-line? Why not put them offline if there isn’t a need for it?
I term this as Active Archiving, something I learned while I was at EMC. Here’s a look at EMC’s style of Active Archiving:
Active Archiving promotes the concept of data archiving and is not unique only to EMC. Almost all storage vendors, either natively or with 3rd party vendors, can perform fairly efficient data archiving in one way or another. One of the software that I liked (and not unique!) is Quantum Stornext. Here’s a video of how Quantum Stornext helps reduce the fat of the storage.
With the single-copy sharing feature of Quantum Stornext to multiple disparate OSes, there are lesser duplicate files in storage as well.
Tapes have been getting a bad name in the past few years. It has been repositioned and repurposed as an archive medium rather than a backup medium. But tape is the greenest and most powerful storage diet pill around. And we should not be discount tapes because tapes are fighting back. Pretty soon you will be hearing about Linear Tape File System (LTFS). In a nutshell, Linear Tape File System (LTFS) allows you to use the tape almost as if it were a hard disk. You can drag and drop files from your server to the tape, see the list of saved files using a standard operating system directory (no backup software catalog needed), and use point and click to restore. How cool is that!
And Solid State Drives (SSDs) makes sense as well.
There are times that we need IOPS and using spinning drives, we have to set up many disk spindles to achieve the IOPS that we want. For example, using the diagram below from the godfather of storage, Greg Schulz,
The set of 16 spinning HDD drives on the left can only deliver 3,520 IOPS. The problem is, we have wasted a lot of disk space, as seen in the diagram below. This design, which most customer would be accustomed to, may look cheaper but in actual fact, is NOT.
If the price of a Fibre Channel HDD is RM2,000, the total of 16 would make up RM32,000.00. That is not inclusive of additional power and cooling and rack space and also the data management costs. Assuming the SSDs costs 5 times more than the Fibre Channel HDD. SSDs are capable of delivering very high IOPS. Here I am putting a modest 5,000 IOPS per SSDs. With just 2 SSDs (as the right design suggests), the total costs is only RM20,000. It has greater performance room to grow, and also savings in data management, power and cooling.
Folks, consider SSDs as part of your storage diet plan.
All these features are available, in whole or in part, and they are part of the storage technology offerings that is out there. With all these being said, are you doing something about it? Get off your lazy bum and start managing your storage and put your storage on a diet!!!
I promised last week I will look deeper into HP StoreOnce technology and I did. As I mentioned in my previous blog, HP StoreOnce technology now embedded in its D2D series of secondary, target backup devices that does the job with no fuss and no fancy bells and whistles.
Here’s the lineup of the present HP D2D solutions.
HP Malaysia has constantly reminded me that their D2D deduplication solution is much more price competitive than their competitors and this is something you, the readers, have to find out on your own. But I do believe that they are. Unfortunately they did not have the first mover’s advantage when Data Domain took the industry by storm in 2009, since HP StoreOnce was only launched with much fanfare last year in June 2010. Despite that, there still plenty of room in the IT market to grow, especially in HP’s huge set of customers.
Without the first movers advantage, HP StoreOnce has to differentiate itself from the existing competitors such as EMC Data Domain and Quantum. Labeling their deduplication technology as version 2.0 (whereas the competitors are still at “Version 1.0”?), HP StoreOnce banks on 3 key technologies. They are
Intelligent Block Size Management
Reduction in Disk Fragmentation
Out of these 3, sparse indexing is the most interesting but I will save the best from last. Let’s start with Intelligent Block Size Management.
HP StoreOnce uses a variable chunking method with a smaller granularity of 4K in size and this is managed intelligently, thus achieving a higher deduplication ratio compared to its competitors which either uses a fixed chunking method or with a variable chunking method of larger block sizes in the range of 8K to 32K. The HP Lab’s testing reveals that the space savings was significant when compared with others.
Below are a set of results for a PowerPoint presentation and you can see for yourself.
(NOTE: Please note that the savings/deduplication ratio can be very different and can range from good to bad for different types of data. Video and images files are highly encoded. Seismic and geo-mapping files are highly compressed. It is very likely that most deduplication solutions cannot achieve a high percentage with these types of files)
Point #2 talks about Reduction in Disk Fragmentation. The inherent benefits from Intelligent Block Size Management brings about the Reduction in Disk Fragmentation. The smaller chunks means lesser space wastage, especially when the block size is 4K or lower. HP StoreOnce also uses an intelligent algorithm to place the blocks that are perceived to be related close to one another. Hence this “locality” presence helps and the retrieval and restore process will be faster and more efficient.
Sparse Indexing is where HP StoreOnce touts to be a game changer. Today’s data is already as massive as a mountain, and it’s going to get bigger and growing faster. Using “Version 1.0” type of deduplication, the hashes created are stored in either memory or on disks. However, the massive data sets (especially unstructured data) are already producing massive amounts of hashes. Hashes are used to identify unique data blocks but the avalanche of unstructured data means that most deduplication solutions are generating more and more hashes, making most Version 1.0s hashes sluggish and difficult to retrieve.
Sparse Indexing addresses this hash problem (by the way, HP StoreOnce uses SHA-1 hash) by intelligently sampling a small chunks and creating a very fast index lookup mechanism that stays in the system’s memory all the time. As the engineers at HP Labs put it
Instead of holding every index item in RAM ready for comparison,
the HP team keeps just one in every hundred or so items in RAM
and puts the rest onto a hard drive. Duplicate data almost
always arrives in bursts. In other words, if one chunk of the
arriving stream is a duplicate, it is very likely that many
following chunks are duplicates. Sparse indexing takes advantage
of this phenomenon by storing the sequence of hashes of the
stored chunks next to each other on disk. As a result, a ‘hit’
in the sample RAM index can direct the system to an area of
the disk where many duplicates are likely to be found.
Sparse Indexing is not unique in the industry, but the engineers at HP Labs have put their thinking hats on and applied it to improve the search and looking up of the hashes in the StoreOnce deduplication technology.
Further savings are also achieved when the deduped data is compressed with the LZ (Lempel-Ziv) compression method before it is stored into the disks.
The HP StoreOnce technology is 100% fully concocted in the renown HP Labs and according to sources, this technology will indeed permeate across all HP StorageWorks (HP has since renamed it to HP Storage) line. With this strategy, HP hopes to address the “fragmented and complicated” (as quoted by HP) deduplication and data protection strategy across the enterprise. By “fragmented and complicated”, they mean that the deduplicated data constant has to be rehydrated and deduped again as the data moves across different IT devices and functions.
In a perfect world, HP wants their StoreOnce technology to be like the diagram below.
However, one very interesting fact that I found was HP does not believe that primary storage deduplication is a good idea. They claim that it complicates the whole thing. Whether HP likes it or not, NetApp has been dishing out primary storage deduplication for several years now and you don’t see their customers unhappy with NetApp about this feature.
I was like, “Whoa! What’s this?”. I felt bemused about what was mentioned in the whitepaper. After all the best claims of the HP StoreOnce technology, I can’t help but to think that this could be a banana skin on the pavement for HP.
One of the things that peeved at the HP D2D Workshop a few days ago was this heading in the HP PowerPoint slides – “Deduplication – a fancy form of Compression”. Somehow it bothered me.
I have always placed both deduplication and compression into a bucket I called “Data Reduction“. Some vendors might call it Storage Economics, spinning it in a cooler manner. Either way, both attempt and succeed to reduce the capacity required to store the amount of data and this translates into benefits in storage management and network. With a smaller data set, lesser processing and capacity are required, likely speeding up the performance of the storage array. At the same time, the primary data backup set (you know, the data that you back up every night?) becomes smaller, making backup and restore faster (not necessarily, but you have to rehydrate the data from its reduced state). Another obvious benefit is the ability to transfer the smaller data set over the network more efficiently, compared to its original state and size, making Disaster Recovery more possible and so on.
I have always known that deduplication works with data objects using a differential method. Whether the data object is a file or a chunk of the file, deduplication attempts to differentiate similarities (duplicates), and store one copy of that object and have others referencing to the single object. The differentiation methods commonly used are hashing and delta differential. In hashing, MD-5 and SHA-1 are the popular hashing algorithms used, while in delta differentials, the data objects are compared (usually in a scrutinizing manner) to find the differences. The duplicates or similarities are discarded.
There are many factors involved in deduplication. It could be the types of data, the processing power required to do the deduplication task, and throughput of processing and so on and resulting in the different deduplication ratio and time required to complete the process. I am not going to delve into that as there are many vendors who will be able to articulate this, such as EMC Data Domain, HP D2D/VLS with its StoreOnce technology, Exagrid, Sepaton, Dell Ocarina Networks, NetApp, EMC Centera, CommVault Simpana, Symantec PureDisk, Symantec NetBackup, EMC Avamar and many more.
Meanwhile, compression (especially most commercial compression technology) are based on dictionary coding, a lossless data reduction algorithm. Note that I am using the term encoding rather than compression because factually, encoding is the right word. You can’t squeeze the data into a smaller size like you do with a real life object.
The technique works like this.
When being encoded, a bit/byte or a set of bytes are compared to a “dictionary” which is a pool of “words” in a data structure maintained by the encoding technology
If a match is found, the bit/byte or set of bytes is substituted by an “word”, usually a much shorter (hence smaller size) representation form of the bytes being encoded.
As the encoding process continues, more “dictionary words” are built into the “dictionary” based on the bytes already encoded. This is popularly known as the sliding window implementation.
The end result is the data is highly encoded (heavily replaced) by “dictionary words” and of a much smaller size.
One of the heavily implemented compression technique is based on the theory and methodology introduced by Lempel-Ziv and further enhanced by the Lempel-Ziv-Welch trio. A very good explanation of LZ method can be found here.
Both deduplication and compression have the same objective – that is to reduce the data size for more efficient storage. But both approach it from a different angle but they are by no means, exclusive. Both can be used to complement each other and further reduce the capacity required to store the data.
Deduplication usually works with larger data objects (chunks, files etc) while compression works harder at the lower level (byte range level). Deduplication is heavily deployed in secondary data sets (or backup) because you can find plenty of duplicates while in primary data sets (the data in production), deduplication and compression are deployed, either in a singular fashion or one after another. Deduplication is usually run as Step 1 and then Compression is run in Step 2.
So far, the only one that has impressed me for the primary data reduction is Ocarina Networks, which uses a 3 step approach in dedupe, compress and using specialized compactors to reduce the data even more. I have seen the ability of Ocarina reducing Schlumberger Geoframe and Petrel seismic data to more than 50%. That was impressive!
Having my bothered state satisfied, I guess having the say of “Deduplication – a fancy form of Compression” is someone else’s cup of tea. I would rather say “Deduplication – a fancy form of Data Reduction Technology” but I am not complaining as much I did before.