Can snapshots replace traditional backups?

Backup is necessary evil. In IT, every operator, administrator, engineer, manager, and C-level executive knows that you got to have backup. When it comes to the protection of data and information in a business, backup is the only way.

Backup has also become the bane of IT operations. Every product that is out there in the market is trying to cram as much production data to backup as possible just to fit into the backup window. We only have 24 hours in a day, so there is no way the backup window can be increased unless

  • You reduce the size of the primary data to be backed up – think compression, deduplication, archiving
  • You replicate the primary data to a secondary device and backup the secondary device – which is ironic because when you replicate, you are creating a copy of the primary data, which technically is a backup. So you are technically backing up a backup
  • You speed up the transfer of primary data to the backup device

Either way, the IT operations is trying to overcome the challenges of the backup window. And the whole purpose for backup is to be cock-sure that data can be restored when it comes to recovery. It’s like insurance. You pay for the premium so that you are able to use the insurance facility to recover during the times of need. We have heard that analogy many times before.

On the flip side of the coin, a snapshot is also a backup. Snapshots are point-in-time copies of the primary data and many a times, snapshots are taken and then used as the source of a “true” backup to a secondary device, be it disk-based or tape-based. However, snapshots have suffered the perception that it is a pseudo-backup, until recent last couple of years.

Here are some food for thoughts …

WHAT IF we eliminate backing data to a secondary device?

WHAT IF the IT operations is ready to embrace snapshots as the true backup?

WHAT IF we rely on snapshots for backup and replicated snapshots for disaster recovery?

First of all, it will solve the perennial issues of backup to a “secondary device”. The operative word here is the “secondary device”, because that secondary device is usually external to the primary storage.

Tape subsystems and tape are constantly being ridiculed as the culprit of missing backup windows. Duplications after duplications of the same set of files in every backup set triggered the adoption of deduplication solutions from Data Domain, Avamar, PureDisk, ExaGrid, Quantum and so on. Networks are also blamed because network backup runs through the LAN. LANless backup will use another conduit, usually Fibre Channel, to transport data to the secondary device.

If we eliminate the “secondary device” and perform backup in the primary storage itself, then networks are no longer part of the backup. There is no need for deduplication because the data could already have been deduplicated and compressed in the primary storage.

Note that what I have suggested is to backup, compress and dedupe, AND also restore from the primary storage. There is no secondary storage device for backup, compress, dedupe and restore.

Wouldn’t that paint a better way of doing backup?

Snapshots will be the only mechanism to backup. Snapshots are quick, usually in minutes and some in seconds. Most snapshot implementations today are space efficient, consuming storage only for delta changes. The primary device will compress and dedupe, depending on the data’s characteristics.

For DR, snapshots are shipped to a remote storage of equal prowess at the DR site, where the snapshot can be rebuild and be in a ready mode to become primary data when required. NetApp SnapVault is one example. ZFS snapshot replication is another.

And when it comes to recovery, quick restores of primary data will be from snapshots. If the primary storage goes down, clients and host initiators can be rerouted quickly to the DR device for services to resume.

I believe with the convergence of multi-core processing power, 10GbE networks, SSDs, very large capacity drives, we could be seeing a shift in the backup design model and possible the entire IT landscape. Snapshots could very likely replace traditional backup in the near future, and secondary device may be a thing of the past.

Solid State Drives … are they reliable?

There’s been a lot of questions about Solid State Drives (SSD), aka Enterprise Flash Drives (EFD) by some vendors. Are they less reliable than our 10K or 15K RPM hard disk drives (HDDs)? I was asked this question in the middle of the stage when I was presenting the topic of Green Storage 3 weeks ago.

Well, the usual answer from the typical techie is … “It depends”.

We all fear the unknown and given the limited knowledge we have about SSDs (they are fairly new in the enterprise storage market), we tend to be drawn more to the negatives than the positives of what SSDs are and what they can be. I, for one, believe that SSDs have more positives and over time, we will grow to accept that this is all part of what the IT evolution. IT has always evolved into something better, stronger, faster, more reliable and so on. As famously quoted by Jeff Goldblum’s character Dr. Ian Malcolm, in the movie Jurassic Park I, “Life finds a way …”, IT will always find a way to be just that.

SSDs are typically categorized into MLCs (multi-level cells) and SLCs (single-level cells). They have typically predictable life expectancy ranging from tens of thousands of writes to more than a million writes per drive. This, by no means, is a measure of reliability of the SSDs versus the HDDs. However, SSD controllers and drives employ various techniques to enhance the durability of the drives. A common method is to balance the I/O accesses to the disk block to adapt the I/O usage patterns which can prolong the lifespan of the disk blocks (and subsequently the drives itself) and also ensure performance of the drive does not lag since the I/O is more “spread-out” in the drive. This is known as “wear-leveling” algorithm.

Most SSDs proposed by enterprise storage vendors are MLCs to meet the market price per IOP/$/GB demand because SLC are definitely more expensive for higher durability. Also MLCs have higher BER (bit-error-rate) and it is known than MLCs have 1 BER per 10,000 writes while SLCs have 1 BER per 100,000 writes.

But the advantage of SSDs clearly outweigh HDDs. Fast access (much lower latency) is one of the main advantages. Higher IOPS is another one. SSDs can provide from several thousand IOPS to more than 1 million IOPS when compared to enterprise HDDs. A typical 7,200 RPM SATA drive has less than 120 IOPS while a 15,000 RPM Fibre Channel or SAS drive ranges from 130-200 IOPS. That IOPS advantage is definitely a vast differentiator when comparing SSDs and HDDs.

We are also seeing both drive-format and card-format SSDs in the market. The drive-format type are typically in the 2.5″ and 3.5″ profile and they tend to fit into enterprise storage systems as “disk drives”. They are known to provide capacity. On the other hand, there are also card-format type of SSDs, that fit into a PCIe card that is inserted into host systems. These tend to address the performance requirement of systems and applications. The well known PCIe vendors are Fusion-IO which is in the high-end performance market and NetApp which peddles the PAM (Performance Access Module) card in its filers. The PAM card has been renamed as FlashCache. Rumour has it that EMC will be coming out with a similar solution soon.

Another to note is that SSDs can be read-biased or write-biased. Most SSDs in the market tend to be more read-biased, published with high read IOPS, not write IOPS. Therefore, we have to be prudent to know what out there. This means that some solution, such as the NetApp FlashCache, is more suitable for heavy-read I/O rather than writes I/O. The FlashCache addresses a large segment of the enterprise market because most applications are heavy on reads than writes.

SSDs have been positioned as Tier 0 layer in the Automated Storage Tiering segment of Enterprise Storage. Vendors such as Dell Compellent, HP 3PAR and also EMC FAST2 position themselves with enhanced tiering techniques to automated LUN and sub-LUN tiering and customers have been lapping up this feature like little puppies.

However, an up-and-coming segment for SSDs usage is positioning the SSDs as extended read or write cache to the existing memory of the systems. NetApp’s Flashcache is a PCIe solution that is basically an extended read cache. An interesting feature of Oracle Solaris ZFS called Hybrid Storage Pool allows the creation of read and write cache using SSDs. The Sun fellas even come up with cool names – ReadZilla and LogZilla – for this Hybrid Storage Pool features.

Basically, I have poured out what I know about SSDs (so far) and I intend to learn more about it. SNIA (Storage Networking Industry Association) has a Technical Working Group for Solid State Storage. I advise the readers to check it out.

Silent Data Corruption (SDC) …it’s more prevalent that you think

Have you heard about Silent Data Corruption (SDC)? It’s everywhere and yet in the storage networking world, you can hardly find a storage vendor talking about it.

I did a paper for MNCC (Malaysian National Computer Confederation) a few years ago and one of the examples I used was what they found at CERN. CERN, the European Center for Nuclear Research published a paper in 2007 describing the issue of SDC. Later in 2008, they found approximately 38,000 files were corrupted in the 15,000TB of data they generated. Therefore SDC is very real and yet to the people in the storage networking industry, where data matters the most, it is one of the issues that is the least talked about.

What is Silent Data Corruption? Every computer component that we use is NOT perfect. It could be the memory; it could be the network interface cards (NICs); it could be the hard disk; it could also be the bus, the file system, the data block structure. Any computer component, whether it is hardware or software, which deals with the bits of data is subjected to the concern of SDC.

Data corruption happens all the time. It is when a bit or a set of bits is changed unintentionally due to various reasons. Some of the reasons are listed below:

  • Hardware errors
  • Data transfer noise
  • Electromagnetic Interference (EMI)
  • Firmware bugs
  • Software bugs
  • Poor electrical current distribution
  • Many more …

And that is why there are published statistics for some hardware components such as memory, NICs, hard disks, and even protocols such as Fibre Channel. These published statistics talk about BER or bit-error-rate, which is the occurrence of an erroneous bit in every billion or trillion of bits transferred or processed.

And it is also why there are inherent mechanisms within these channels to detect data corruption. We see them all the time in things such as checksums (CRC32, SHA1, MD5 …), parity and ECC (error correction code). Because we can detect them, we see errors and warnings about their existence.

However, SILENT data corruption does not appear as errors and warnings, and they do OCCUR! And this problem is getting more and more prevalent in modern day disk drives, especially solid state drives (SSDs). As the disk manufacturers are coming out with more compact, higher capacity and performance drives, the cell geometry of SSDs are becoming smaller and smaller. This means each cell will have a smaller area to contain the electrical charge and maintain the bit-value, either a -0 or -1. At the same time, the smaller cell is more sensitive and susceptible to noise, electrical charge leakage and interference of nearby cells as some SSDs has different power modes to address green requirements.

When such things happen, a 0 can look like a 1 or vice versa and if the error is undetected, this becomes silent data corruption.

Most common storage networking technology such as RAID or file systems were introduced during the 80’s or 90’s when disks were 9GB, 18GB and so on, and FastEthernet was the standard for networking. Things have changed at a very fast pace, and data growth has been phenomenal. We need to look at storage vendors’ technology more objectively now and get more in-depth about issues such as SDC.

SDC is very real but until and unless we learn and equip ourselves with the knowledge, just don’t take things from vendors verbatim. Find out … and be in control of what you are putting into your IT environment.

Virtualization and cloud aren’t what they are without storage

I was chatting with a friend yesterday and we were discussing about virtualization and cloud, the biggest things that are happening in the IT industry right now. We were talking about the VMware vSphere 5 arrival, the cool stuff VMware is bringing into the game, pushing the technology juggernaut farther and farther ahead of its rivals Hyper-V, Xen and Virtual Box.

And in the technology section of the newspaper yesterday, I saw news of Jaring OneCloud offering and one of the local IT players just brought in Joyent. Fantastic stuff! But for us in IT, we have been inundated with cloud, cloud and more cloud. The hype, the fuzz and the reality. It’s all there but back to our conversation. We realized that virtualization and cloud aren’t much without storage, the cornerstone of virtualization and cloud. And in the storage networking layer, there are the data management piece, the information infrastructure piece and so on and yet … why are there so few storage networking professional out there in our IT scene.

I have been lamenting this for a long time because we have been facing this problem for a long time. We are facing a shortage of qualified and well experienced storage networking professionals. There are plenty of jobs out there but not enough resources to meet the demand. As SNIA Malaysia Chairman, it is my duty to work with my committee members of HP, IBM, EMC, NetApp, Symantec and Cisco to create the awareness, and more importantly the passion to get the local IT’s storage networking professional voice together. It has been challenging but my advice to all those people out there – “Why be ordinary when you can become extra-ordinary?”

We have to make others realize that storage networking is what makes virtualization and cloud happen. Join us at SNIA Malaysia and be part of something extra-ordinary. Storage networking IS the foundation of virtualization and cloud. You can’t exclude it.

10Gigabit Ethernet will rule

As far as how the next generation storage networks would look like, 10Gigabit Ethernet (10GbE) is definitely the strongest candidate for the storage network. And this is made possible with key enhancements to Ethernet that has made it possible for greater reliability and performance. This enhancement goes by several names such as Data Center Ethernet (a term coined by Cisco) and Converged Enhanced Ethernet (CEE). But probably the more widely use term is DCB or Data Center Bridging.

Ethernet, so far, has never failed to deliver and as far as I am concerned, Ethernet will rule for the next 10 years or more. Ethernet has evolved several generations from Ethernet running at 10Mbits/sec to FastEthernet, then Gigabit Ethernet and now 10Gigabit Ethernet. Pretty soon, it will be looking at 40Gbits/sec and 100Gbits/sec. It is a tremendous piece of protocol, allowing it to evolve and adapt to the modern data networks.

But before 10GbE, the delivery of packets were of best effort basis. But today’s networks demand scalability, security, performance and most of reliability. However, since the advent of DCB, 10GbE is fortified with these key technologies

  • 10GBASE-T – using Cat 6/6A cabling standards, 10GBASE-T delivers low cost, simple UTP (unshielded twisted pair) networking to the masses
  • iWARP – Support for iWARP is crucial for RDMA (Remote Direct Memory Access). RDMA, in a nutshell, reduces overhead of typical networking buffer-to-buffer copy, by bypassing these bottlenecks, and placing the data blocks and its bits/bytes directly into the access points of the corresponding requesting node.
  • Low latency cut-switching at Layer 2 by reading just the header of the packet instead of the entire full length of the packet. The information contained in the header of the packet is sufficient for it to make a switching/forwarding decision
  • Energy Efficient by introducing low power idle state and other implementations which makes the power consumption usage more proportional to the network utilization rate
  • Congestion notification and pause frame which handles 8 different classes of traffic to ensure lossless network delivery
  • Shortest path adaptive routing protocol for Ethernet forwarding. TRILL (Transparent Interconnections with Lots of Links) is one of the implementation. Lately OpenFlow has been jumping into the bandwagon as a viable option but I need to check out OpenFlow support with 10GbE and DCB.
  • FCoE (Fibre Channel over Ethernet) is all the rage these days and 10GbE has the ability to carry Fibre Channel traffic. This has sparked a initial frenzy among storage vendors.

Of course, last but not least, we are already seeing the sunset of Fibre Channel. While 8Gbps FC has been out for a while, its adoption rate seemed to have stalled. Many vendors and customers are at the 4Gbps range, adopting a wait-and-see game. 16Gbps FC has been in the talks but it seems that all the fireworks are with 10Gigabit Ethernet right now. It will rule …

Dell acquiring Force10

What do you think of Dell acquiring Force10? My first reaction was surprise, very surprised.

I was in the middle of a conversation with a friend when the RSS feed popped up in front of me – “Dell acquiring Force10”! I cut that conversation short to read the rest of the details … wow, that’s a good buy!

With all the rumors flying around that Brocade was the most obvious choice, Force10 was out of the blue for me. As the euphoria settled down, I thought Dell had made a very smart move. Brocade, unfortunately, is still pretty much a Fibre Channel company, with 75% of its business relying heavily on Fibre Channel and FCoE. Even though Brocade has Foundry now, Brocade has not strongly asserted itself as an front runner and innovator of 10Gigabit Ethernet.

Meanwhile, Force10 has been a up-and-coming force (pun intended) to be reckon with, strengthening its position as a 10GbE player in the market. And with 10GbE now, and 40GbE or 100GbE coming in the next 2-3 years, Force10 will be riding the wave of the future. Dell can only benefit from that momentum.

Dell has been very, very aggressive to push itself into the enterprise storage space. From its acquisition of EqualLogic in 2007, to Exanet, Ocarina and Compellent last year, there is no doubt that Dell wants this space badly.

The first challenge for Dell is to put its story together and convince the customers that they are no longer Dell, the PC/laptop direct seller, but a formidable company capable of providing enterprise solutions, services and support.

The second challenge, and even bigger one, is itself; its culture of changing mindset. The game has changed; the rule has change. The enterprise is a totally different ballgame. Is Dell ready? Is Dell ready to change itself?

Maybe the Force(10) be with Dell!

Going Ga Ga over Storage Networking

Before you start thinking that I am ripping off Lady Gaga, this blog’s name of “Storage Gaga” is NOT from Lady Gaga. It’s from Queen’s Radio Ga Ga song which I happen to be listening in my car.

Why Ga Ga? Ga Ga in the Free Dictionary (link: http://www.thefreedictionary.com/gaga) means crazy over something (at least one of the meanings anyway). That’s what I am. Since leaving my last job – which was on Tuesday (July 19th 2011) this week – I want to do more for storage networking and data management. I want to share things I find out, information that I have learned and so on.

So watch this space for more info … more on the way.

p/s. This rainy morning, I am going to arrange and organize all my computer books. It’s going to be fun!