“I want to put in my own hard disk”

I want to put in my own hard disk“.

If a customer ever utter that sentence, it will trigger a storage vendor meltdown. Panic buttons, alarm bells, and everything else that will lead a salesman to go berserk. That’s a big NO, NO!

For decades, storage vendors have relied on proprietary hardware to keep customers in line, and have customers continue to sign hefty maintenance contracts until the next tech refresh. The maintenance contract, with support, software upgrades and hardware spares replacement, defines the storage networking industry that we are in. Even as some vendors have commoditized their hardware on the x86 platforms, and on standard enterprise hard disk drives (HDDs), NICs and HBAs, that openness and convenience of commodity hardware savings are usually not passed on the customers.

It is easy to explain to customers that keeping their enterprise data in reliable and high performance storage hardware with performance optimization and special firmware is paramount, and any unwarranted and unvalidated hardware would put the customer’s data at high risk.

There is a choice now. The ripple of enterprise-grade, open storage kernel and file system has just started its first ring, and we hope that this small ripple will reverberate across the storage industry in the next few years.

Continue reading

ARC reactor also caches?

The fictional arc reactor in Iron Man’s suit was the epitome of coolness for us geeks. In the latest edition of Oracle Magazine, Iron Man is on the cover, as well as the other 5 Avengers in a limited edition series (see below).

Just about the same time, I am reading up on the ARC (Adaptive Replacement Caching) that is adopted in ZFS. I am learning in depth of how ZFS caching works as opposed to the more popular LRU (Least Recently Used) caching algorithm that is used in most storage cache memory. Having said that, most storage vendors employed a modified LRU algorithm, with the intention to keep the most recently accessed pages in memory as long as possible. This is true in NetApp’s Data ONTAP (maybe not the ONTAP GX in which I have little experience) and EMC FlareOE. ONTAP goes further to by keeping the most frequently accessed pages permanently in memory. EMC folks would probably refer to most recently accessed as spatial locality while most frequently accessed as temporal locality.

Why is ZFS using ARC and what is ARC? Continue reading

Joy(ent) to the World

When someone as important and as prominent as Jason Hoffman reads and follows your blog, you tend to stand up and take notice. I found out last week that Jason Hoffman, Founder and CTO of Joyent, was doing just that, I was deeply honoured and elated.

My Asian values started kicking in and I felt that I should reciprocate his gracious visits with a piece on Joyent. I have known about Joyent, thanks to Bryan Cantrill as the VP of Engineering because I am bloody impressed with his work with DTrace. And I have followed Joyent’s announcements every now and then, even recommending a job that was posted on Joyent’s website for a Service Delivery Manager in Asia Pacific for my buddy a couple of months ago. He’s one of the best Solaris engineers I have ever worked with but the problem with techies is, they tend to wait for everything to fall into place before they do the next thing. Too methodical!

I took some time over the weekend to understand a bit more about Joyent and their solution offerings. They are doing some mighty cool stuff and if you are Unix/Linux buff/bigot like me, you would be damn impressed. For those people who has experienced Unix and especially Solaris, there is an unexplained element that describes the fire and the passion of such a techie. I was feeling all the good vibes all over again.

Unfortunately, Joyent is not well known in this part of the world but I am well aware of their partnership with a local company called XyBase in an announcement in June last year. Xybase, through its vehicle called Anise Asia, entered into the partnership to resell Joyent’s SmartCenter solution. For those who has worked with XyBase in Malaysia, let’s not go there. 😉

Enough chitter-chatter! What’s Joyent about?

Well, for Malaysian IT followers, we are practically drowned in VMware. VMware does a seminar every 1.5 months or so, and they get invited to other vendors’ events ever so frequently as well. My buddy, Mr. Ong Kok Leong, who was an early employee in VMware Malaysia, has been elevated to superstardom, thanks to his presence in everything VMware. It’s a good thing and kudos to VMware to take advantage of their first-to-market, super gung-ho approach in the last 3 years or so. They have built a sizable lead in the local market and the competitors like Citrix Xen, Microsoft Hyper-V are being left in a dust. I believe only RedHat’s KVM is making a bit of a dent but they are primarily confined to their own RedHat space. Furthermore, most of VMware competitors do not have a strong portfolio and a complete software stack to challenge VMware and what they have been churning out.

Here’s my take … consider Joyent because I see Joyent having a very, very strong portfolio to give VMware a run for its money. Public listed VMware has deep pockets to continue their marketing blitz and because of where they are right now, they have gotten very pricey and complicated. And this blogger intends to level the playing field a bit by sharing more about Joyent and their solutions.

I see Joyent having 4 very strong technologies that differentiates them from others. These technologies (in no particular order) are:

  • node.js
  • ZFS
  • DTrace
  • KVM

These technologies have been proven in the field because Joyent has been deploying, stress testing them and improving on them in their own cloud offering called Joyent Cloud for the last few year. This is true “eating your own dogfood” and putting your money where your mouth is. This is a very important considering when building a Cloud Computing offering, especially in the public cloud space. You need something that is proven and Joyent Cloud is testimonial to Joyent’s technology.

So let’s start with a diagram of the Joyent Cloud Software Stack.

 

Key to the performance of Joyent Cloud is node.js.

node.js as quoted in its website is “Node.js is a platform built on Chrome’s JavaScript runtime for easily building fast, scalable network applications. Node.js uses an event-driven, non-blocking I/O model that makes it lightweight and efficient, perfect for data-intensive real-time applications that run across distributed devices.” The key to this is being event-driven and asynchronous and cloud solutions developed using node.js are able to go faster, scale bigger and respond better. The event-based model follows a programming approach in which the flow of the program is determined by events that occurred.

A simple analogy is when you (in Malaysia) is at McDonald’s. In the past, the McDonald’s staff will service and fulfill your order before they service the next customer and so on. That was the flow of the past. Some time last year, McDonalds’ decide that their front staff would take your order, sends you to a queue and then took the order of the next customer. The back-end support staff would then fulfill your order putting that burger and drink on your tray. That is why they are able to serve (take your money) faster and get more things done. This is what I understand about event-driven, when it is applied in a programming content.

node.js has been touted as the new “Ruby-on-Rails” and it is all about low-latency, and concurrency in applications, especially cloud applications. Here’s a video introducing node.js, by Joyent’s very own Ryan Dahl, the creator of node.js.

Besides performance, you would also need a strong and robust file system to ensure security, data integrity and protection of data as it scales. ZFS is a 128-bit, enterprise file system that was developed in Sun more than 10 years ago, and I am a big admirer of the ZFS technology. I have written about ZFS in the past, comparing it with NetApp’s Data ONTAP and also written about ZFS self-healing properties in dealing with Silent Data Corruption. In fact, my buddy (him being the more technical one) and I have been developing storage solutions with ZFS.

Cloud Computing is complex and you have to know what’s happening in the Cloud. For the Cloud Service Provider, they must know the real-time behaviour of the cloud properties. It could be for performance, resource consumption and contention, bottlenecks, applications characteristics, and even for finding the problems as quickly as possible. For the customers, they must have the ability to monitor, understand and report what they are consuming and using in the Cloud.

The regular used buzzword is Analytics and DTrace is the framework developed for Cloud Analytics. When it comes to analytics, nothing comes close to what DTrace can do. Most vendors (including VMware) will provide APIs for 3rd party ISVs to develop cloud analytics but nothing beats having the creator of the cloud technology given you the tools that they use internally. That is what Joyent is giving to the customer, DTrace, a tool that they use themselves internally. Here’s a screenshot of DTrace in action for Joyent’s SmartDataCenter.

 

I have always said that you got to see  it to know it. Cloud visibility is crucial for the optimal operational efficiency of the cloud.

Joyent already has Solaris Zones technology in its offering. But the missing piece was bare metal hypervisor and last year, Joyent added the final piece. KVM (Kernel-based Virtualization) was ported to Joyent, and KVM is more secure, and faster than the traditional approach of VMware, which relies on binary translation. KVM would mean that the virtualization kernel has direct interaction and communication with the native  x86 virtualization on processors that supports hardware virtualization extension. There is a whole religious debate about native, paravirtualization and binary translation on the web. You can read one here, and as I said, KVM is native virtualization.

There are lots more to know about Joyent but you got to spend some time to learn about it. It is not well known (yet) in this part of the world, my intention in this blog entry is to disseminate information so that you readers don’t have to be droned into one thing only.

There are choices and in the virtualization space, it is just not always about VMware. VMware deserves to be where they are but when one comes into power (like VMware), he/she tends to become less friendly to work it. A customer should not be subjected to this new order of oppression because businesses are there when there are customers. And as customers, they are always choices and Joyent is one good choice.

Primary Dedupe where are you?

I am a bit surprised that primary storage deduplication has not taken off in a big way, unlike the times when the buzz of deduplication first came into being about 4 years ago.

When the first deduplication solutions first came out, it was particularly aimed at the backup data space. It is now more popularly known as secondary data deduplication, the technology has reduced the inefficiencies of backup and helped sparked the frenzy of adulation of companies like Data Domain, Exagrid, Sepaton and Quantum a few years ago. The software vendors were not left out either. Symantec, Commvault, and everyone else in town had data deduplication for backup and archiving.

It was no surprise that EMC battled NetApp and finally won the rights to acquire Data Domain for USD$2.4 billion in 2009. Today, in my opinion, the landscape of secondary data deduplication has pretty much settled and matured. Practically everyone has some sort of secondary data deduplication technology or solution in place.

But then the talk of primary data deduplication hardly cause a ripple when compared a few years ago, especially here in Malaysia. Yeah, the IT crowd is pretty fickle that way because most tend to follow the trend of the moment. Last year was Cloud Computing and now the big buzz word is Big Data.

We are here to look at technologies to solve problems, folks, and primary data deduplication technology solutions should be considered in any IT planning. And it is our job as storage networking professionals to continue to advise customers about what is relevant to their business and addressing their pain points.

I get a bit cheesed off that companies like EMC, or HDS continue to spend their marketing dollars on hyping the trends of the moment rather than using some of their funds to promote good technologies such as primary data deduplication that solve real life problems. The same goes for most IT magazines, publications and other communications mediums, rarely giving space to technologies that solves problems on the ground, and just harping on hypes, fuzz and buzz. It gets a bit too ordinary (and mundane) when they are trying too hard to be extraordinary because everyone is basically talking about the same freaking thing at the same time, over and over again. (Hmmm … I think I am speaking off topic now .. I better shut up!)

We are facing an avalanche of data. The other day, the CEO of Nexenta used the word “data tsunami” but whatever terms used do not matter. There is too much data. Secondary data deduplication solved one part of the problem and now it’s time to talk about the other part, which is data in primary storage, hence primary data deduplication.

What is out there?  Who’s doing what in term of primary data deduplication?

NetApp has their A-SIS (now NetApp Dedupe) for years and they are good in my books. They talk to customers about the benefits of deduplication on their FAS filers. (Side note: I am seeing more benefits of using data compression in primary storage but I am not going to there in this entry). EMC has primary data deduplication in their Celerra years ago but they hardly talk much about it. It’s on their VNX as well but again, nobody in EMC ever speak about their primary deduplication feature.

I have always loved Ocarina Networks ECO technology and Dell don’t give much hoot about Ocarina since the acquisition in  2010. The technology surfaced a few months ago in Dell DX6000G Storage Compression Node for its Object Storage Platform, but then again, all Dell talks about is their Fluid Data Architecture from the Compellent division. Hey Dell, you guys are so one-dimensional! Ocarina is a wonderful gem in their jewel case, and yet all their storage guys talk about are Compellent  and EqualLogic.

Moving on … I ought to knock Oracle on the head too. ZFS has great data deduplication technology that is meant for primary data and a couple of years back, Greenbytes took that and made a solution out of it. I don’t follow what Greenbytes is doing nowadays but I do hope that the big wave of primary data deduplication will rise for companies such as Greenbytes to take off in a big way. No thanks to Oracle for ignoring another gem in ZFS and wasting their resources on pre-sales (in Malaysia) and partners (in Malaysia) that hardly know much about the immense power of ZFS.

But an unexpected source coming from Microsoft could help trigger greater interest in primary data deduplication. I have just read that the next version of Windows Server OS will have primary data deduplication integrated into NTFS. The feature will be available in Windows 8 and the architectural view is shown below:

The primary data deduplication in NTFS will be a feature add-on for Windows Server users. It is implemented as a filter driver on a per volume basis, with each volume a complete, self describing unit. It is cluster aware, and fully crash consistent on all operations.

The technology is Microsoft’s own technology, built from scratch and will be working to position Hyper-V as an strong enterprise choice in its battle for the server virtualization space with VMware. Mind you, VMware already has a big, big lead and this is just something that Microsoft must do-or-die to keep Hyper-V playing catch-up. Otherwise, the gap between Microsoft and VMware in the server virtualization space will be even greater.

I don’t have the full details of this but I read that the NTFS primary deduplication chunk sizes will be between 32KB to 128KB and it will be post-processing.

With Microsoft introducing their technology soon, I hope primary data deduplication will get some deserving accolades because I think most companies are really not doing justice to the great technologies that they have in their jewel cases. And I hope Microsoft, with all its marketing savviness and adeptness, will do some justice to a technology that solves real life’s data problems.

I bid you good luck – Primary Data Deduplication! You deserved better.

Phoenix rising from OpenSolaris ashes

I got a little nostalgic over the weekend. As I was working on Solaris 11 x86 over the past few weeks, I got a little bit peeved about how much Oracle has changed the OS.

Command like ifconfig doesn’t not appear to be very functional anymore and instead ipadm has taken over most of the configuration options. And when I working with Jumpstart (damn!), it does not work the way that I know anymore. And now AI (Automated Install) has taken over Jumpstart and I got to relearn the whole what-ca-ma-callit. Dang!

I remembered the day when Solaris x86 first came out in the early 90s. I was ecstatic because I could finally test and run Solaris on x86 platform. I could get things running at home and have fun with it. Drivers were limited then (and still is but has gotten much better) but I was happily hacking away together with other Linux distros as the open source revolution was just beginning. After I joined NetApp, things started to change and I abandoned Solaris in favour of Linux as my job, as well as my interest, were on Linux, especially RedHat. I eventually got my RHCE and completely lost touch with Solaris. By 2005, when OpenSolaris was announced under CDDL (Common Development and Distribution License), I was no longer well versed with the developments of Solaris and OpenSolaris.

Enough about my nostalgia because I am beginning to see a young phoenix (a mythical firebird) rising from the mess of what Oracle did with OpenSolaris! Since Oracle purchased Sun in 2010, Oracle has practically burned OpenSolaris to ashes. On August 13 2010, Oracle announced the end of OpenSolaris in an internal memo and it read:

Solaris Engineering,

Today we are announcing a set of decisions regarding the path to
Solaris 11, and answering key pending questions on open source, open
development, software and binary licenses, and how developers and
early adopters will be able to use Solaris 11 technology before its
release in 2011.

As you all know, the term “OpenSolaris” has been used colloquially to
refer to any or all of a collection of source code, a development
model, a web site, a logo, a binary release, a source license, a
community, and many other related things. So it’s taken a while to go
over each issue from an organizational and business perspective, and
align on the correct next step. Therefore, please take the time to
read all of the detail here carefully. We’ll discuss our strategy
first, and then the decisions and changes to our policies and
processes that implement that strategy.

If you want the entire memo (and all the fa-lah-lah that goes with it), go to Steven Stallion’s blog. Incidentally Steven Stallion was the OpenSolaris kernel developer who leaked the memo into the open.

It became pretty obvious that Oracle business suit culture and “is this going to make money?” ways were suffocating talents and innovations of the Sun engineering tribes. Some of the high profile leavers were James Gosling (father of Java) and Jeff Bonwick (father of RAID-Z and led the ZFS development team in Sun). And there were many top talents exodus within 90-120 days after the Oracle acquisition.

The key technologies that went into OpenSolaris (and Solaris) were slowly but surely deprived of their inventors’ and maintainers nourishment. These technologies were:

  • ZFS (Project Pacific)
  • DTrace
  • Zones (aka Solaris Containers, aka Project Kevlar)
  • Fault Management Architecture (FMA)
  • Service Management Facility (SMF)
  • Advanced Network Virtualization (Project Crossbow)
  • Least-privilege

and many more. Some of these technologies were already open under CDDL license but some were still very much proprietary to Sun (I mean, Oracle). It was difficult to use what was available under OpenSolaris CDDL license to rebuild again, especially when the inventors, talents and maintainers are now all scattered in companies like Delphix, Nexenta, Greenbytes, Joyent and so on .

At the end of last year, shortly before Solaris 11 was announced by Oracle, the people who are passionate about OpenSolaris (and Solaris) have got together in full force again. Dubbed “Project Illumos“, the key people who has developed for Sun convened to build a new open-source, Solaris-based operating environment. The proprietary bits that are closely guarded by Oracle are going to be either rebuilt from scratch or ported from BSD into the last OpenSolaris-kernel before Oracle killed it. That kernel was Solaris Nevada, which was supposed to be the successor of Solaris 10.

The Illumos team already has a bootable and working operating environment and new developments are going on at a frantic pace. From the words of Bryan Cantrill (father of DTrace) and now VP of Engineering at Joyent,

“illumos was not designed to be a fork,but rather an entirely open downstream repository of OpenSolaris”

And the talents congregating to the Illumos project (like moths to a flame) are super-stellar. Just have a look at this list:

  • ZFS –> Matt Ahrens, Eric Schrock,  George Wilson, Adam Leventhal, Bill Pijewski and BrendanGregg
  • SMF –> Dan McDonald and Sumit Gupta
  • DTrace –> Bryan Cantrill, Adam Leventhal, Brendan Gregg, Eric Schrock, Dave Pacheco
  • Zones & Jumpstart –> Jerry Jelinek
  • and many, many more.

KVM (the Linux kernel-based virtual machine) is being added into the Illumos operating environment, giving it the final piece of the puzzle.

I cannot help but to feel extremely proud that OpenSolaris (and Solaris) is not dead yet and it’s alive and rising. Oracle cannot lay claim to the source code and the rights of Illumos (according to Bryan Cantrill) without itself abiding to the CDDL licensing and distribution scheme that it had killed off a year ago.

And this is indeed the young phoenix rising!

ONTAP vs ZFS

I have to get this off my chest. Oracle’s Solaris ZFS is better than NetApp’s ONTAP WAFL! There! I said it!

I have been studying both similar Copy-on-Write (COW) file systems at the data structure level for a while now and I strongly believe ZFS is a better implementation of the COW file systems (also known as “shadow-paging” file system) than WAFL. How are both similar and how are both different? The angle we are looking at is not performance but about resiliency and reliability.

(Note: btrfs or “Butter File System” is another up-and-coming COW file system under GPL license and is likely to be the default file system for the coming Fedora 16)

In Computer Science, COW file system are tree-like data structures as shown below. They are different than the traditional Berkeley Fast File System data structure as shown below:

As some of you may know, Berkeley Fast File System is the foundation of some modern day file systems such as Windows NTFS, Linux ext2/3/4, and Veritas VxFS.

COW file system is another school of thought and this type of file system is designed in a tree-like data structure.

In a COW file system or more rightly named shadow-paging file system, the original node of the data block is never modified. Instead, a copy of the node is created and that copy is modified, i.e. a shadow of the original node is created and modified. Since the node is linked to a parent node and that parent node is linked to a higher parent node and so on all the way to the top-most root node, each parent and higher-parent nodes are modified as it traverses through the tree ending at the root node.

The diagram below shows the shadow-paging process in action as modifications of the node copy and its respective parent node copies traverse to the top of the tree data structure. The diagram is from ZFS but the same process applies to WAFL as well.

 

As each data block of either the leaf node (the last node in the tree) or the parent nodes are being modified, pointers to either the original data blocks or the copied data blocks are modified accordingly relative to the original tree structure, until the last root node at the top of the shadow tree is modified. Then, the COW file system commit is considered complete. Take note that the entire process of changing pointers and modifying copies of the nodes of the data blocks is done is a single I/O.

The root at the top for ZFS is called uberblock and called fsinfo in WAFL. Because an exact shadow of the tree-like file system is created when the data blocks are modified, this also gives birth to how snapshots are created in a COW file system. It’s all about pointers, baby!

Here’s how it looks like with the original data tree and the snapshot data tree once the shadow paging modifications are complete.

 

However, there are a few key features from the data integrity and reliability point of view where ZFS is better than WAFL. Let me share that with you.

In a nutshell, ZFS is a layered architecture that looks like this

The Data Management Unit (DMU) layer is one implementation that ensures stronger data integrity. The DMU maintains a checksum on the data in each data block by storing the checksum in the parent’s blocks. Thus if something is messed up in the data block (possibly by Silent Data Corruption), the checksum in the parent’s block will be able to detect it and also repair the data corruption if there is sufficient data redundancy information in the data tree.

WAFL will not be able to detect such data corruptions because the checksum is applied at the disk block level and the parity derived during the RAID-DP write does not flag this such discrepancy. An old set of slides I found portrayed this comparison as shown below.

 

Another cool feature that addresses data resiliency is the implementation of ditto blocks. Ditto blocks stores 3 copies of the metadata and this allows the recovery of lost metadata even if 2 copies of the metadata are deleted.

Therefore, the ability of ZFS to survive data corruption, metadata deletion is stronger when compared to WAFL .This is not discredit NetApp’s WAFL. It is just that ZFS was built with stronger features to address the issues we have with storing data in modern day file systems.

There are many other features within ZFS that have improved upon NetApp’s WAFL. One such feature is the implementation of RAID-Z/Z2/Z3. RAID-Z is a superset implementation of the traditional RAID-5 but with a different twist. Instead of using fixed stripe width like RAID-4 or RAID-DP, RAID-Z/Z2 uses a dynamic variable stripe width. This addressed the parity RAID-4/5 “write hole” flaw, where incomplete or partial stripes will result in a “hole” that leads to file system fragmentation. RAID-Z/Z2 address this by filling up all blocks with variable stripe width. A parity can be calculated and assigned with any striped width, as shown below.

 

Other really cool stuff are Hybrid Storage Pool and the ability to create software-based caching using fast disk drives such as SSDs. This approach of creating ReadZilla (read caching) and LogZilla (write caching) eliminates the need for proprietary NVRAM as implemented in NetApp’s WAFL.

The only problem is, despite the super cool features of ZFS, most Oracle (not Sun) sales does not have much clue how to sell ZFS storage. NetApp, with its well trained and tuned, sales force is beating Oracle to pulp.

Copy-on-Write and SSDs – A better match than other file systems?

We have been taught that file systems are like folders, sub-folders and eventually files. The criteria in designing file systems is to ensure that there are few key features

  • Ease of storing, retrieving and organizing files (sounds like a fridge, doesn’t it?)
  • Simple naming convention for files
  • Performance in storing and retrieving files – hence our write and read I/Os
  • Resilience in restoring full or part of a file when there are discrepancies

In file systems performance design, one of the most important factors is locality. By locality, I mean that data blocks of a particular file should be as nearby as possible. Hence, in most file systems designs originated from the Berkeley Fast File System (BFFS), requires the file system to seek the data block to be modified to ensure locality, i.e. you try not to split up the contiguity of the data blocks. The seek time to find the require data block takes time, but you are compensate with faster reads because the read-ahead feature allows you to read extra blocks ahead in anticipation that the data blocks are related.

In Copy-on-Write file systems (also known as shadow-paging file systems), the seek portion is usually not present because the new modified block is written somewhere else, not the present location of the original block. This is the foundation of Copy-on-Write file systems such as NetApp’s WAFL and Oracle Solaris ZFS. Because the new data blocks are written somewhere else, the storing (write operation) portion is faster. It eliminated the seek time and it also skipped the read-modify-write action to the original location of the data block. Therefore, write is likely to be faster.

However, the read portion will be slower because if you want to read a file, the file system has to go around looking for the data blocks because it lacks the locality. Therefore, as the COW file system ages, it tends to have higher file system fragmentation. I wrote about this in my previous blog. It is a case of ENJOY-FIRST/SUFFER-LATER. I am not writing this to say that COW file systems are bad. Obviously, NetApp and Oracle have done enough homework to make the file systems one of the better storage file systems in the market.

So, that’s Copy-on-Write file systems. But what about SSDs?

Solid State Drives (SSDs) will make enemies with file systems that tend prefer locality. Remember that some file systems prefer its data blocks to be contiguous? Well, SSDs employ “wear-leveling” and required writes to be spread out as much as possible across the SSDs device to prolong the life of the SSD device to reduce “wear-and-tear”. That’s not good news because SSDs just told the file systems, “I don’t like locality and I will spread out the data blocks“.

NAND Flash SSDs (the common ones we find in the market and not DRAM-based SSDs) are funny creatures. When you write to SSDs, you must ERASE first, WRITE AGAIN to the SSDs. This is the part that is creating the wear-and tear of the device. When I mean ERASE first, WRITE AGAIN, I describe it below

  • Writing 1 –> 0 (OK, no problem)
  • Writing 0 –> 1 (not OK, because NAND Flash can’t do that)

So, what does the SSD do? It ERASES everything, writing the entire data blocks on the device to 1s, and then converting some of them to 0s. Crazy, isn’t it? The firmware in the SSDs controller will also spread out the erase-and-then write operations across the entire SSD device to avoid concentrating the operations on a small location or dataset. This is the “wear-leveling” we often hear about.

Since SSDs shun locality and avoid the data blocks to be nearby, and Copy-on-Write file systems are already doing this because its nature to write new data blocks somewhere else, the combination of both COW file system and SSDs seems like a very good fit. It even looks symbiotic because it is a case of “I help you; and you help me“.

From this perspective, the benefits of COW file systems and SSDs extends beyond resiliency of the SSD device but also in performance. Since the data blocks are spread out at different locations in the SSD device, the effect of parallelism will inadvertently help with COW’s performance. Make sense, doesn’t it?

I have not learned about other file systems and how they behave with SSDs, but it is pretty clear that Copy-on-Write file systems works well with Solid State Devices. Have a good week ahead :-)!

Snapshots? Don’t have a C-O-W about it!

Unfortunately, I am having a COW about it!

Snapshots are the inherent offspring of the copy-on-write technique used in shadow-paging filesystems. NetApp’s WAFL and Oracle Solaris ZFS are commercial implementations of shadow-paging filesystems and they are typically promoted as Copy-on-Write filesystems.

As we may already know, snapshots are point-in-time copy of the active file system in the storage world. They perform quick backup of the active file system by making a copy of the block addresses (pointers) of the filesystem and then updating the pointer maps to the inodes in the fsinfo root inode of the WAFL filesystem for new changes after the snapshot has been taken. The equivalent of fsinfo is the uberblock in the ZFS filesystem.

However, contrary to popular belief, the snapshots from WAFL and ZFS are not copy-on-write implementations even though the shadow paging filesystem tree employs the copy-on-write technique.

Consider this for a while when a snapshot is being taken … Copy —- On —- Write. If the definition is (1) Copy then (2) Write, this means that there are several several steps to perform a copy-on-write snapshot. The filesystem has to to make a copy of the original data block (1 x Read I/O), then write the original data block to a new location (1 x Write I/O) and then write the new data block to the location of the original data block (1 x Write I/O).

This is a 3-step process that can be summarized as

  1. Read location of original data block (1 x Read I/O)
  2. Copy this data block to new unused location (1 x Write I/O)
  3. Write the new and modified data block to the location of original data block (1 x Write I/O)

This implementation, IS THE copy-on-write technique for snapshot but NetApp and possibly Oracle guys have been saying for years that their snapshots are based on copy-on-write. This is pretty much a misnomer that needs to be corrected. EMC, in its SnapSure and SnapView implementation, called this technique Copy-on-First-Write (COFW), probably to avoid the confusion. The data blocks are copied to a savvol, a separate location to store the changes of snapshots and defaults to 10% of the total capacity of their storage solutions.

As you have seen, this method is a 3 x I/O operation and it is an expensive solution. Therefore, when we compare the speed of NetApp/ZFS snapshots to EMC’s snapshots, the EMC COFW snapshot technique will be a tad slower.

However, this method has one superior advantage over the NetApp/ZFS snapshot technique. The data blocks in the active filesystem are almost always laid out in a more contiguous fashion, resulting in a more consistent read performance throughout the life of the active file system.

Below is a diagram of how copy-on-write snapshots are implemented:

 

What is NetApp/ZFS’s snapshot method then?

It is is known as Redirect-on-Write. Using the same step … REDIRECT —- ON —– WRITE. When a data block is about to be modified, the original data block is read (1 x Read I/O) and then the data block is written to a new location (1 x Write I/O). The active file system then updates the filesystem tree and its inode address to reflect the location of the new data block. The original data block remained unchanged.

In summary,

  1. Read location of original data block (1 x Read I/O)
  2. Write modified data block to new location (1 x Write I/O)

The Redirect-on-Write method resulted in 1 Write I/O less, making snapshot creation faster. This is the NetApp/ZFS method and it is superior when compared to the Copy-on-Write snapshot technique discussed earlier.

However, as the life of the filesystem progresses, fragmentation and holes will cause the performance of the active filesystem to degrade. The reason is most related data blocks are no longer contiguous and the active file system will be busy seeking the scattered data blocks across the volume. Fragmented filesystem would have to be “cleaned and reorganized” to regain its performance lustre.

Another unwanted problem using the Redirect-on-Write snapshot technique is the snapshot resides in the same boundary as the active filesystem. Over time, if the capacity consumed by the snapshots could overwhelm the active filesystem, if their recycle schedule is unchecked.

I guess this is a case of “SUFFER NOW/ENJOY LATER” or “ENJOY NOW/SUFFER LATER”. We have to make a conscious effort to understand what snapshots are all about.

Solid State Drives … are they reliable?

There’s been a lot of questions about Solid State Drives (SSD), aka Enterprise Flash Drives (EFD) by some vendors. Are they less reliable than our 10K or 15K RPM hard disk drives (HDDs)? I was asked this question in the middle of the stage when I was presenting the topic of Green Storage 3 weeks ago.

Well, the usual answer from the typical techie is … “It depends”.

We all fear the unknown and given the limited knowledge we have about SSDs (they are fairly new in the enterprise storage market), we tend to be drawn more to the negatives than the positives of what SSDs are and what they can be. I, for one, believe that SSDs have more positives and over time, we will grow to accept that this is all part of what the IT evolution. IT has always evolved into something better, stronger, faster, more reliable and so on. As famously quoted by Jeff Goldblum’s character Dr. Ian Malcolm, in the movie Jurassic Park I, “Life finds a way …”, IT will always find a way to be just that.

SSDs are typically categorized into MLCs (multi-level cells) and SLCs (single-level cells). They have typically predictable life expectancy ranging from tens of thousands of writes to more than a million writes per drive. This, by no means, is a measure of reliability of the SSDs versus the HDDs. However, SSD controllers and drives employ various techniques to enhance the durability of the drives. A common method is to balance the I/O accesses to the disk block to adapt the I/O usage patterns which can prolong the lifespan of the disk blocks (and subsequently the drives itself) and also ensure performance of the drive does not lag since the I/O is more “spread-out” in the drive. This is known as “wear-leveling” algorithm.

Most SSDs proposed by enterprise storage vendors are MLCs to meet the market price per IOP/$/GB demand because SLC are definitely more expensive for higher durability. Also MLCs have higher BER (bit-error-rate) and it is known than MLCs have 1 BER per 10,000 writes while SLCs have 1 BER per 100,000 writes.

But the advantage of SSDs clearly outweigh HDDs. Fast access (much lower latency) is one of the main advantages. Higher IOPS is another one. SSDs can provide from several thousand IOPS to more than 1 million IOPS when compared to enterprise HDDs. A typical 7,200 RPM SATA drive has less than 120 IOPS while a 15,000 RPM Fibre Channel or SAS drive ranges from 130-200 IOPS. That IOPS advantage is definitely a vast differentiator when comparing SSDs and HDDs.

We are also seeing both drive-format and card-format SSDs in the market. The drive-format type are typically in the 2.5″ and 3.5″ profile and they tend to fit into enterprise storage systems as “disk drives”. They are known to provide capacity. On the other hand, there are also card-format type of SSDs, that fit into a PCIe card that is inserted into host systems. These tend to address the performance requirement of systems and applications. The well known PCIe vendors are Fusion-IO which is in the high-end performance market and NetApp which peddles the PAM (Performance Access Module) card in its filers. The PAM card has been renamed as FlashCache. Rumour has it that EMC will be coming out with a similar solution soon.

Another to note is that SSDs can be read-biased or write-biased. Most SSDs in the market tend to be more read-biased, published with high read IOPS, not write IOPS. Therefore, we have to be prudent to know what out there. This means that some solution, such as the NetApp FlashCache, is more suitable for heavy-read I/O rather than writes I/O. The FlashCache addresses a large segment of the enterprise market because most applications are heavy on reads than writes.

SSDs have been positioned as Tier 0 layer in the Automated Storage Tiering segment of Enterprise Storage. Vendors such as Dell Compellent, HP 3PAR and also EMC FAST2 position themselves with enhanced tiering techniques to automated LUN and sub-LUN tiering and customers have been lapping up this feature like little puppies.

However, an up-and-coming segment for SSDs usage is positioning the SSDs as extended read or write cache to the existing memory of the systems. NetApp’s Flashcache is a PCIe solution that is basically an extended read cache. An interesting feature of Oracle Solaris ZFS called Hybrid Storage Pool allows the creation of read and write cache using SSDs. The Sun fellas even come up with cool names – ReadZilla and LogZilla – for this Hybrid Storage Pool features.

Basically, I have poured out what I know about SSDs (so far) and I intend to learn more about it. SNIA (Storage Networking Industry Association) has a Technical Working Group for Solid State Storage. I advise the readers to check it out.