I find the terminology of WORM (Write Once Read Many) coming back into the IT speak in recent years. In the era of rip and burn, WORM was a natural thing where many of us “youngsters” used to copy files to a blank CD or DVD. I got know about how WORM worked when I learned that the laser in the CD burning process alters the chemical compound in a segment on the plastic disc of the CD, rendering the “burned” segment unwritable once it was written but it could be read many times.
At the enterprise level, I got to know about WORM while working with tape drives and tape libraries in the mid-90s. The objective of WORM is to save and archive the data and files in a non-rewritable format for compliance reasons. And it was the data compliance and data protection parts that got me interested into data management. WORM is a big deal in many heavily regulated industries such as finance and banking, insurance, oil and gas, transportation and more.
Obviously things have changed. WORM, while very much alive in the ageless tape industry, has another up-and-coming medium in Object Storage. The new generation of data infrastructure and data management specialists are starting to take notice.
Worm Storage – Image from Hubstor (https://www.hubstor.net/blog/write-read-many-worm-compliant-storage/)
I take this opportunity to take MinIOobject storage for a spin in creating WORM buckets which can be easily architected as data compliance repositories with many applications across regulated industries. Here are some relevant steps.
Last week was World Backup Day. It is on March 31st every year so that you don’t lose your data and become an April’s Fool the next day.
Amidst the growing awareness of the importance of backup, no thanks to the ever growing destructive nature of ransomware, it is important to look into other aspects of data protection – both a data backup/recovery and a data security – point of view as well.
The A-B-C is to look at the production dataset and decide if the data should be stored in the Tier 1 storage. In most cases, the data becomes less active and these datasets may be good candidates to be archived. Once archived, the production dataset is smaller and data backup operations become lighter, faster and have positive causation as well.
Air gaps have returned to prominence since the heightened threats on data in recent years. The threats have pushed organizations to consider doing data offsite and offline with air gaps. Cost considerations and speed of recovery can be of concerns, and logical air gaps are also gaining style as an acceptable extra layer of data. protection.
Backup is not total Data Protection cyberdefence
If we view data protection more holistically and comprehensively, backup (and recovery) is not the total data protection solution. We must ignore the fancy rhetorics of the technology marketers that backup is the solution to ensure data protection because there is much more than that.
My Sunday morning was muddled 2 weeks ago. There was a frenetic call from someone whom I knew a while back and he needed some advice. Turned out that his company’s files were encrypted and the “backups” (more on this later) were gone. With some detective work, I found that their files were stored in a Synology® NAS, often accessed via QuickConnect remotely, and “backed up” to Microsoft® Azure. I put “Backup” in inverted commas because their definition of “backup” was using Synology®’s Cloud Sync to Azure. It is not a true backup but a file synchronization service that often mislabeled as a data protection backup service.
All of his company’s projects files were encrypted and there were no backups to recover from. It was a typical ransomware cluster F crime scene.
I would have gloated because many of small medium businesses like his take a very poor and lackadaisical attitude towards good data management practices. No use crying over spilled milk when prevention is better than cure. But instead of investing early in the prevention, the cure would likely be 3x more expensive. And in this case, he wanted to use Deloitte® recovery services, which I did not know existed. Good luck with the recovery was all I said to him after my Sunday morning was made topsy turvy of sorts.
NAS is the ransomware goldmine
I have said it before and I am saying it again. NAS devices, especially the consumer and prosumer brands, are easy pickings because there was little attention paid to implement a good data management practice either by the respective vendor or the end users themselves. 2 years ago I was already seeing a consistent pattern of the heightened ransomware attacks on NAS devices, especially the NAS devices that proliferated the small medium businesses market segment.
The WFH (work from home) practice trigged by the Covid-19 pandemic has made NAS devices essential for businesses. NAS are the workhorses of many businesses after all. The ease of connecting from anywhere with features similar to the Synology® QuickConnect I mentioned earlier, or through VPNs (virtual private networks), or a self created port forwarding (for those who wants to save a quick buck [ sarcasm ]), opened the doors to bad actors and easy ransomware incursions. Good data management practices are often sidestepped or ignored in exchange for simplicity, convenience, and trying to save foolish dollars.Until ….
I mentioned that long term digital data preservation is a segment within the data lifecycle which has merits and prominence. SNIA® has proved that this is a strong growing market segment through its 2007 and 2017 “100 Year Archive” surveys, respectively. 3 critical challenges of this long, long-term digital data preservation is to keep the archives
For the longest time, tape technology has been the king of the hill for digital data preservation. The technology is cheap, mature, and many enterprises has built their long term strategy around it. And the pulse in the tape technology market is still very healthy.
The challenges of tape remain. Every 5 years or so, companies have to consider moving the data on the existing tape technology to the next generation. It is widely known that LTO can read tapes of the previous 2 generations, and write to it a generation before. The tape transcription process of migrating digital data for the sake of data preservation is bad because it affects the structural integrity and quality of the content of the data.
In my times covering the Oil & Gas subsurface data management, I have seen NOCs (national oil companies) with 500,000 tapes of all generations, from 1/2″ to DDS, DAT to SDLT, 3590 to LTO 1-7. And millions are spent to transcribe these tapes every few years and we have folks like Katalyst DM, Troika and more hovering this landscape for their fill.
One of the historical feats which had me mesmerized for a long time was the 14-year journey China’s imperial treasures took to escape the Japanese invasion in the early 1930s, sandwiched between rebellions and civil wars in China. More than 20,000 pieces of the imperial treasures took a perilous journey to the west and back again. Divided into 3 routes over a decade and four years, not a single piece of treasure was broken or lost. All in the name of preservation.
Digital data preservation is on another end of the data lifecycle spectrum. More often than not, it is not the part that many pay attention to. In the past 2 decades, digital data has grown so much that it is now paramount to keep the data forever. Mind you, this is not the data hoarding kind but to preserve the knowledge and wisdom which is in the digital content of the data.
SNIA (Storage Networking Industry Association) conducted 2 surveys – one in 2007 and another in 2017 – called the 100 Year Archive, and found that the requirement for preserving digital data has grown multiple folds over the 10 years. In the end, the final goal is to ensure that the perpetual digital contents are
All at an affordable cost. Therefore, SNIA has the vision that the digital content must transcend beyond the storage medium, the storage system and the technology that holds it.
We are all familiar with the concept of data archiving. Passive data gets archived from production storage and are migrated to a slower and often, cheaper storage medium such tapes or SATA disks. Hence the terms nearline and offline data are created. With that, IT constantly reminds users that the archived data is infrequently accessed, and therefore, they have to accept the slower access to passive, archived data.
The business conditions have certainly changed, because the need for data to be 100% online is becoming more relevant. The new competitive nature of businesses dictates that data must be at the fingertips, because speed and agility are the new competitive advantage. Often the total amount of data, production and archived data, is into hundred of TBs, even into PetaBytes!
The industries I am familiar with – Oil & Gas, and Media & Entertainment – are facing this situation. These industries have a deluge of files, and unstructured data in its archive, and much of it dormant, inactive and sitting on old tapes of a bygone era. Yet, these files and unstructured data have the most potential to be explored, mined and analyzed to realize its value to the organization. In short, the archived data and files must be democratized!
The flip side is, when the archived files and unstructured data are coupled with a slow access interface or unreliable storage infrastructure, the value of archived data is downgraded because of the aggravated interaction between access and applications and business requirements.How would organizations value archived data more if the access path to the archived data is so damn hard???!!!
An interesting solution fell upon my lap some months ago, and putting A and B together (A + B), I believe the access path to archived data can be unbelievably of high performance, simple, transparent and most importantly, remove the BLOODY PAIN of FILE AND DATA MIGRATION!For storage administrators and engineers familiar with data migration, especially if the size of the migration is into hundreds of TBs or even PBs, you know what I mean!
I have known this solution for some time now, because I have been avidly following its development after its founders left NetApp following their Spinnaker venture to start Avere Systems.
I was in Singapore last week attending the Cloud Infrastructure Services course.
In the class, one of the foundation components of Cloud Computing is of course, storage. As the students and the instructor talked about Storage, one very interesting argument surfaced. It revolved around the storage, if it was offered on the cloud. A lot of people assumed that Cloud Storage would be for their databases, and their virtual machines, which of course, is true when the communication between the applications, virtual machines and databases are in the local area network of the Cloud Service Provider (CSP).
However, if the storage is offered through the cloud to applications that are sitting on-premise in the customer’s server room, then we have to think twice of how we perceive Cloud Storage. In this aspect, the Cloud Storage offered by the CSP is a Infrastructure-as-a-Service (IaaS), where the key service is Storage. We have to differentiate that this Storage functions as a data container, and usually not for I/O performance reasons.
Though this concept probably will be easily understood by storage professionals like us, this can cause a bit confusion for someone new to the concept of Cloud Computing and Cloud Storage. This confusion, unfortunately, is caused by many of us who are vendors or solution providers, or even publications and magazines. We are responsible to disseminate correct information to customers, but due to our lack of knowledge and experience in this extremely new market of Cloud Storage, we have created the FUDs (Fear, Uncertainty and Doubt) and hype.
Therefore, it is the duty of this blogger to clear the vapourware, and hopefully pass on the right information to accelerate the adoption of Cloud Storage in the near future. At this moment, given the various factors such as network costs, high network latency and lack of key network technologies similar to LAN in Cloud Computing, Cloud Storage is, most of the time, for data storage containership and archiving only. And there are no IOPS or any performance related statistics related to Cloud Storage. If any engineer or vendor tells you that they have the fastest Cloud Storage in the industry, do me a favour. Give him/her a knock on the head for me!
Of course, as technologies evolve, this could change in the near future. For now, Cloud Storage is a container, NOT a high performance storage in the cloud. It is usually not meant for transactional data. There are many vendors in the Cloud Storage space from real CSPs to storage companies offering re-packaged storage boxes that are “cloud-ready”. A good example of a CSP offering Cloud Storage is Amazon S3 (Simple Storage Service). And storage vendors such as EMC and HDS are repackaging and rebranding their storage technologies as object storage, ready for the cloud. EMC Atmos is really a repackaged and rebranded Centera, with some slight modifications, while HDS , using their Archiving solution, has HCP (aka HCAP). There’s nothing wrong with what EMC and HDS have done, but before the overhyping of the world of Cloud Computing, these platforms were meant for immutable data archiving reasons. Just thought you should know.
One particular company that captured my imagination and addresses the storage performance portion is Nasuni. Of course, they are quite inventive with the Cloud Storage Gateway approach. Nasuni comes up with a Cloud Storage Gateway filer appliance, which can be either a physical 1U server or as a VMware or Hyper-V virtual appliance sitting on-premise at the customer’s site.
The key to this is “on-premise”, which allows access to data much faster because they are locally-cached in the Nasuni filer appliance itself. This Nasuni filer piece addresses the Cloud Storage “performance” piece but Nasuni do not claim any performance statistics with such implementation. The clever bit is that this addresses data or files that are transactional in nature, i.e. NFS or CIFS, to serve data or files “locally”. (I wonder if Nasuni filer has iSCSI as well. Hmmmm….)
In the Nasuni architecture, they “break up” their “Cloud Storage” into 2 pieces. Piece #1 sits on-premise, at the customer site, and acts as a bridge to the Piece #2, that is sitting in a Cloud Storage. From a simplified view, have a look at the diagram below:
Piece #1 is the component that handles some of the transactional traffic related to files. In a more technical diagram below, you can see that the Nasuni filer addresses the file sharing portion, using the local disks on the filer appliance as a local caching mechanism.
Furthermore, older file pieces are whiffed away to the any Cloud Storage using the Cloud Connector interface, hence giving the customer a sense that their storage capacity needs can be limitless if they want to (for a fee, of course). At the same time, the Nasuni filer support thin provisioning and snapshots. How cool is that!
The Cloud Storage piece (Piece #2) is used for the data container and archiving reasons. This component can be sitting and hosted at Amazon S3, Microsoft Azure, Rackspace Cloud Files, Nirvanix Storage Delivery Network and Iron Mountain Archive Services Platform.
The data communication and transfer between the Nasuni filer is secure, encrypted, deduplication and compressed, giving it the efficiency and security that most customers would be concerned about. The diagram below explains the dat communication and data transfer bit.
In this manner, the Nasuni filer can replace traditional NAS platforms and can potentially provide a much lower total cost of ownership (TCO) in the long run. Nasuni does not pretend to be a NAS replacement. To me, this concept is very inventive and could potentially change the way we perceive file sharing and file server, obscuring and blurring concept of NAS.
Again, I would like to reiterate that Nasuni does not attempt to say their solution is a NAS or a performance-based Cloud Storage but what they have cleverly packaged seems to be appealing to customers. Their customer base has grown 78% in Q2 of 2011. It’s just too bad they are not here in Malaysia or this part of the world (yet).
Nowadays, the capacity of the hard disk drives (HDDs) are really big. 3TB is out and 4TB is in the horizon. What’s next?
For small-medium businesses in Malaysia, depending on their data requirements and applications, 3-10TB is pretty sufficient and with room to grow as well. Therefore, a 6TB requirement can be easily satisfied with 2 x 3TB HDDs.
If I were the customer, why would I buy a storage array, with the software licenses and other stuff that will not only increase my cost of equipment acquisition and data management, it will also increase the complexity of my IT infrastructure? I could just slot HDDs into my existing server, RAID it with RAID-0 (not a good idea but to save costs, most customers would do that) and I have a 6TB volume! It’s cheaper, easier to manage with Windows or Linux, and my system administrator doesn’t have to fuss about lack of storage experience.
And RAID isn’t really keeping up with the tremendous growth of HDD’s capacity as well. In fact, RAID is at risk. RAID (especially RAID 5/6) just cannot continue provide the LUN or volume reliability and data availability because it just takes too damn long to rebuild the volume after the failure of a disk.
Back in the days where HDDs were less than 500GB, RAID-5 would still hold up but after passing the 1TB mark, RAID-6 became more prevalent. But now, that 1TB has ballooned to 3TB and RAID-6 is on shaky ground. What’s next? RAID-7? ZFS has RAID-Z3, triple parity but come on, how many vendors have that? With triple parity or stronger RAID (is there one?), the price of the storage array is going to get too costly.
Experts have been speaking about parity-declustering, but that’s something that a few vendors have right now. Panasas, founded by one of forefathers of RAID, Garth Gibson, comes to mind. In fact, Garth Gibson and Mark Holland of Cargenie-Mellon University’s Parallel Data Lab (PDL) presented a paper about parity-declustering more than 10 years ago.
Let’s get back to our storage fatty. Yes, our storage is getting fat, obese, rotund or whatever you want to call it. And storage vendors have been pushing a concept in hope that storage administrators and customers can take advantage of it. It is called Storage Optimization or Storage Efficiency.
Here are a few ways you can consider to put your storage on a diet.
Tapes and SSDs
To me, compression has not taken the storage world by storm. But then again, there aren’t many vendors that tout compression as a feature for storage optimization. Most of them rather prefer to push the darling of data reduction, data deduplication, as the main feature for save more space. Theoretically, data deduplication makes more sense when the data is inactive, and has high occurrence of duplicated data. That is why secondary storage such as backup deduplication targets like Data Domain, HP StoreOnce, Quantum DXi can publish 20:1 rates and over time, that rate can get even higher.
NetApp also has been pushing their A-SIS data deduplication on primary storage. Yes, it helps with the storage savings in primary but when the need for higher data transfer rates and time to access “manipulated” data (deduped or compressed), it is likely that compression is a better choice for primary, active data.
So who has compression? NetApp ONTAP 8.0.1 has compression now and IBM with its Storewize V7000 started as a compression device. Read about IBM Storewize in my blog here. Dell has Ocarina Networks, which was recently unleashed. I am a big fan of Ocarina Networks and I wrote about the technology in my previous blog. EMC, during the Celerra days of DART has compression but I don’t hear much about it in their VNX. Compression is there, believe me, embedded all the loads of EMC marketing.
Thin Provisioning is now a must-have and standard feature of all storage vendors. What is Thin Provisioning? The diagram below shows you:
In the past, storage systems aren’t so intelligent. You ask for 10TB, you are given 10TB and that 10TB is “deducted” from the storage capacity. That leads to wastage and storage inefficiencies. Today, Thin Provisioning will give you 10TB but storage capacity is consumed as it is being used. The capacity is not pre-allocated as in the past. Thin provisioning is a great diet pill for bloated storage projects.
Another up and coming feature is storage tiering. Storage tiering, when associated to storage optimization, should include hierarchical storage management (HSM) and tape-out as well. Storage optimization solutions should not offer only in the storage array itself. Storage tiering within the storage array is available with most vendors – IBM EasyTier, EMC FAST2, Dell Fluid Data Management and many others. But what about data being moved out of the storage array? What about reducing the capacity of the data online or near-line? Why not put them offline if there isn’t a need for it?
I term this as Active Archiving, something I learned while I was at EMC. Here’s a look at EMC’s style of Active Archiving:
Active Archiving promotes the concept of data archiving and is not unique only to EMC. Almost all storage vendors, either natively or with 3rd party vendors, can perform fairly efficient data archiving in one way or another. One of the software that I liked (and not unique!) is Quantum Stornext. Here’s a video of how Quantum Stornext helps reduce the fat of the storage.
With the single-copy sharing feature of Quantum Stornext to multiple disparate OSes, there are lesser duplicate files in storage as well.
Tapes have been getting a bad name in the past few years. It has been repositioned and repurposed as an archive medium rather than a backup medium. But tape is the greenest and most powerful storage diet pill around. And we should not be discount tapes because tapes are fighting back. Pretty soon you will be hearing about Linear Tape File System (LTFS). In a nutshell, Linear Tape File System (LTFS) allows you to use the tape almost as if it were a hard disk. You can drag and drop files from your server to the tape, see the list of saved files using a standard operating system directory (no backup software catalog needed), and use point and click to restore. How cool is that!
And Solid State Drives (SSDs) makes sense as well.
There are times that we need IOPS and using spinning drives, we have to set up many disk spindles to achieve the IOPS that we want. For example, using the diagram below from the godfather of storage, Greg Schulz,
The set of 16 spinning HDD drives on the left can only deliver 3,520 IOPS. The problem is, we have wasted a lot of disk space, as seen in the diagram below. This design, which most customer would be accustomed to, may look cheaper but in actual fact, is NOT.
If the price of a Fibre Channel HDD is RM2,000, the total of 16 would make up RM32,000.00. That is not inclusive of additional power and cooling and rack space and also the data management costs. Assuming the SSDs costs 5 times more than the Fibre Channel HDD. SSDs are capable of delivering very high IOPS. Here I am putting a modest 5,000 IOPS per SSDs. With just 2 SSDs (as the right design suggests), the total costs is only RM20,000. It has greater performance room to grow, and also savings in data management, power and cooling.
Folks, consider SSDs as part of your storage diet plan.
All these features are available, in whole or in part, and they are part of the storage technology offerings that is out there. With all these being said, are you doing something about it? Get off your lazy bum and start managing your storage and put your storage on a diet!!!