A little yellow elephant

By now, I believe most of you in the storage networking world would have heard of Hadoop. Hadoop was created by Doug Cutting, while he and his team was working on an open source web search engine called Nutch. The easily recognized little yellow elephant, Hadoop, was Doug Cutting’s son toy, which he made as Hadoop’s mascot. Pretty cool!

And today, Hadoop has become THE platform for Big Data applications. Why?

As I have mentioned before, everything that we do or don’t do, generates data, either as a direct product or in-direct product. I am blogging right now and I am creating data. I was in Singapore the whole of this week and everywhere I go in the MRT stations, I am being watched by the video cameras they have at the station. A new friend in class said that Singapore is the second most “watched” city after London, where there are video cameras mounted everywhere, either discreetly or indiscreetly. And that’s just video data. And there’s plenty of other human activities that generate tons and tons of data.

IDC Digital Universe Report for 2011 said that we have generated 1.8ZB (zettabyte) of data this year alone. I mentioned in my previous blog that this is a gold mine and companies are scrambling to tap on massive amount of data.  Extracting valuable information to anticipate the next trend or predict that next evolution in human preference is akin to the Gold Rush in the wild, wild west in the late 19th century. Folks, Big Data is going to be this generation’s “Digital Gold Rush”.

Sieving, filtering and processing gazillions of data (more unstructured than structured) will not work in defined, well-formatted relational databases. The data model of relational databases will simply break down. And of course, there are different schools of thoughts of different data models, but the Hadoop model seems to be gaining momentum and mind share of data scientists. That is because of Hadoop’s capability to deal with massive unstructured data, processing it and producing results in a small amount of time.

One way to process the pool of massive data is parallel programming. In parallel programming, multi-threading is commonly deployed to achieve the performance and effects of programming. But implementing multi-threading in parallel programming is difficult. Developers often has to deal with LWP (lightweight processes), semaphores, shared memory, mutex (mutually exclusive) locking and so on. Hence this style of programming works with different states on shared data, often resulting in different results in different states, even when using the same programming expression.

Hadoop belongs to another school of programming known as functional programming, where the different states on shared data concept is removed. With that in mind, the dependency on different states is also removed, resulting in a much easier and simpler parallel programming implementation. Hadoop borrows ideas from the MapReduce software framework made well known by Google and the Google File System.

Before, we get to know Hadoop, we must know MapReduce. MapReduce is a framework which allows very large data sets to be processed with a very large set of computer nodes in a cluster. Typically the computational processing is executed in a distributed fashion, spread across many computer nodes and final results are consolidated from the sub-results of these distributed processing nodes.

According to Wikipedia, the 2 key functions of Map Reduce are map() and reduce(). That’s pretty obvious. The extract below was taken from the Wikipedia definition, and explains both functions very well.

“Map” step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

“Reduce” step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

The diagram below probably can simplify the concept of MapReduce to the readers.

 

Hadoop is one of the open-source implementations of MapReduce. It is one of the projects of Apache Foundation, and the project has sparked a brand-new niche of data search, data management and data science. The diagram below will allow our readers to juxtapose MapReduce and Hadoop, and comparing them in the simplest fashion.

Hadoop primary development platform is Java. Hadoop’s architecture consists mainly of 2 components – Hadoop Common and a Hadoop-compatible file system, as shown in the diagram below.

Hadoop MapReduce layer above is the file/object access interface to the Hadoop-compatible file system below. HDFS is Hadoop Distributed File System is just one of a few Hadoop-compatible file systems. Other file systems include:

  • Amazon S3 File System as part of the Amazon EC2 Infrastructure-as-a-Service (IaaS) cloud platform
  • CloudStore – a similar Hadoop-like implementation using C++ and also inspired by Google File System
  • FTP file systems
  • HTTP and HTTPS read-only file systems
  • Any file systems accessible with the file:// URL nomenclature

But the main engine of Hadoop is in the MapReduce layer. The 2 core components in this layer is JobTracker and TaskTracker. Both has their own individual roles to play and collectively, they are key cogs in the Hadoop distributed data processing model.

Below are extract I picked up from Wikipedia.

JobTracker submits MapReduce jobs to client applications. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware filesystem, the JobTracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone network. If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on each node spawns off a separate Java Virtual Machine process to prevent the TaskTracker itself from failing if the running job crashes the JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty and can be viewed from a web browser. Jetty is a Java-based HTTP server, among other things

JobTracker records what it is up to in the filesystem. When a JobTracker starts up, it looks for any such data, so that it can restart work from where it left off.

Scheduling

By default Hadoop uses first-in, first-out (FIFO), and optional 5 scheduling priorities to schedule jobs from a work queue. In version 0.19 the job scheduler was refactored out of the JobTracker, and added the ability to use an alternate scheduler (such as the Fair scheduler or the Capacity scheduler).

Fair scheduler

The fair scheduler was developed by Facebook. The goal of the fair scheduler is to provide fast response times for small jobs and QoS (Quality of Service) for production jobs. The fair scheduler has three basic concepts.

  1. Jobs are grouped into Pools.
  2. Each pool is assigned a guaranteed minimum share.
  3. Excess capacity is split between jobs.

By default jobs that are uncategorized go into a default pool. Pools have to specify the minimum number of map slots, reduce slots, and a limit on the number of running jobs.

Capacity scheduler

The capacity scheduler was developed by Yahoo. The capacity scheduler supports several features which are similar to the fair scheduler.

  • Jobs are submitted into queues.
  • Queues are allocated a fraction of the total resource capacity.
  • Free resources are allocated to queues beyond their total capacity.
  • Within a queue a job with a high level of priority will have access to the queue’s resources.

I took most the extract below from Wikipedia, and I don’t claim to be a knowledgeable person on Hadoop. All the credits go to Wikipedia editors to put Hadoop in layman terms.

Hadoop has certainly won the hearts of the new digital gold rush, Big Data and is slowly becoming a force to be reckoned with among data scientists. Hadoop implementations are powering new frontiers in processing and mining the ever growing data capacity, giving solution providers a simple programming methodology and data model to gain more insights into the vast seas of data and information.

Hadoop has many fans, and slowly becoming the data platform for large companies such as Yahoo!, Facebook, IBM, Amazon, Apple, eBay and many more. Facebook even claims to have the largest Hadoop clusters in the world, growing to 30PB in July of 2011.

This little yellow elephant is going places and one to watch out for.

btrfs – a better butter from a better COW?

There’s better a lot of chatter about the default file system in the recently released Fedora 16. Fedora 16 was released on November 8, 2011. For months, there were rumours that the default file system for Fedora 16 will be btrfs (btree file system, or better known as butter-FS). But after the release, the default file system of Fedora 16 is still ext4, much to the dismay of many. btrfs holds a lot of potential because in the space of less than 4 years, it has moved up the hierarchy to be one of the top file systems in the Linux ecosystem.

File systems in Linux has not seen had a knight in shining armour for the long time. The file systems kings of the Linux world were ext2/3 for the RedHat and Debian flavoured distros and reiserfs for the SuSE  flavoured distros. ext4 is now default in most Linux distros but the concept of  ext2/3/4 has not changed much since its inception decades ago.

At the same time, reiserfs  had a lot of promise as well, but its development and progress have lost it lustre after its principal developer, Han Reiser was convicted of the murder of his wife a few years ago. If you are KPC (Malaysian Chinese colloquialism, meaning busybody), you can read the news here.

btrfs is going to be the new generation of file systems for Linux and even Ted T’so, the CTO of Linux Foundation and principal developer admitted that he believed btrfs is the better direction because “it offers improvements in scalability, reliability, and ease of management”.

For those who has studied computer science, B-Tree is a data structure that is used in databases and file systems. B-Tree is an excellent data structure to store billions and billions of objects/data and is able to provide fast data retrieval in logarithmic time. And the B-Tree implementation is already present in some of the file systems such as JFS, XFS and ReiserFS. However, these file systems are not shadow-paging filesystems (popularly known as copy-on-write file systems).

You see, B-Tree, in its native form of implementation, is very incompatible with COW file systems. In fact, the implementation was thought of impossible, until someone by the name of Ohad Rodeh came along. He presented a paper in Usenix FAST ’07 which described the challenges of combining the B-Tree concept with shadow paging file systems. The solution, as he described it, was to introduce insert/remove key into the tree structure, and removing the dependency of intra-leaf linking.

Chris Mason, one of the developers of reiserfs, took Ohad’s idea and created a shadow-paging file system based on the B-Tree idea.

Traditional file systems tend to follow the idea of the Berkeley Fast File Systems. Different cylinder groups have its own inode, bitmap and disk blocks. The used space in one cylinder group cannot be shared to another cylinder group, resulting in wastage. At the same time, performance can be an issue as the disk read/head frequently have to move to the inodes to find out where the next used or free blocks will be. In a way, it looks like the diagram below.

Chris Mason’s idea of btrfs made the file system looked like this:

Today, a quick check in btrfs wiki page, shows that the main Btrfs features available at the moment include:

  • Extent based file storage
  • 2^64 byte == 16 EiB maximum file size
  • Space-efficient packing of small files
  • Space-efficient indexed directories
  • Dynamic inode allocation
  • Writable snapshots, read-only snapshots
  • Subvolumes (separate internal filesystem roots)
  • Checksums on data and metadata
  • Compression (gzip and LZO)
  • Integrated multiple device support
  • RAID-0, RAID-1 and RAID-10 implementations
  • Efficient incremental backup
  • Background scrub process for finding and fixing errors on files with redundant copies
  • Online filesystem defragmentation

And Chris Mason and his team have still plenty more to offer for btrfs. It will likely be the default file system in Fedora 17 and at the rate it is going, could be the file system of choice for RedHat, Debian and SuSE distros very soon.

There’s been a lot of comparisons between btrfs and ZFS, since both are part of Oracle now. ZFS is obviously a much more mature file system, with more enterprise features and more robust, (Incidentally, ZFS just celebrated its 10th birthday on Halloween 2011 – see Matt Ahren’s blog) and the btrfs is the rising star in the Linux world. But at this moment, the 2 file systems are set apart in their market positioning and deployment.

ZFS is based on the CDDL (Common Development and Distribution License) while btrfs is based on GNU GPL, open source licensing. There are controversies surrounding CDDL licensing scheme, while is incompatible with GNU GPL scheme.

Oracle can count itself very lucky to have 2 of the most promising and prominent COW file systems around. It will be interesting to see what Oracle will do next. As a proponent of innovation, community and sharing, I sincerely hope that both file systems will continue to thrive in Oracle’s brutal, sales-driven organization. We certainly don’t want to see controversies about dual ownership BS of Oracle and mySQL that could end in 2015.

NetApp SPECSfs record broken in 13 days


Thanks for my buddy, Chew Boon of HDS who put me on alert about the new leader of the SPECSfs benchmark results. NetApp “world record” has been broken 13 days later by Avere Systems.

Avere has posted the result of 1,564,404 NFS ops/sec with an ORT (overall response time) of 0.99 msec. This benchmark was done by 44 nodes, using 6.808 TB of memory, with 800 HDDs.

Earlier this month, NetApp touted fantastic results and quickly came out with a TR comparing their solution with EMC Isilon. Here’s a table of the comparison

 

But those numbers are quickly made irrelevant by Avere FXT, and Avere claims to have the world record title with the “smallest footprint ever”. Here’s a comparison in Avere’s blog, with some photos to boot.

 

For the details of the benchmark, click here. And the news from PR Newswire.

If you have not heard of Avere, they are basically the core team of Spinnaker. NetApp acquired Spinnaker in 2003 to create the clustered file systems from the Spinnaker technology. The development and integration of Spinnaker into NetApp’s Data ONTAP took years and was buggy, and this gave the legroom for competitors like Isilon to take market share in the clustered NAS/scale-out NAS landscape.

Meanwhile, NetApp finally came did come good with the Spinnaker technology and with ONTAP 8.0.1 and 8.1, the codes of both platforms merged into one.

The Spinnaker team went on develop a new technology called the “A-3 Architecture” (shown below) and positioned itself as a NAS Accelerator.

avere-nas-1

The company has 2 series of funding and now has a high performance systems to compete with the big boys. The name, Avere Systems, is still pretty much unknown in this part of the world and this “world record” will help position them stronger.

But as I have said before, benchmarking are just ways to have bigger bragging rights. It is a game of leapfrogging, and pretty soon this Avere record will be broken. It is nice while it lasts.

The future is intelligent objects

We are used to block-based approach and also the file-based approach to data. The 2 diagrams below shows the basics of how we access data in both block-based and file-based data on the storage device.

 

For block-based , the storage of the blocks is merely in arrays of unrelated contiguous blocks. For file-based, as seen below,

 

there is another layer of abstraction, and this is called the file system. But if you seen both diagrams above, there are some random numbers in light blue and that is to represent the storage device, the hard disk drive’s export of “containers” to the file system or the application that is accessing the storage device. This is usually the LBA (Logical Block Addressing), which is basically set of schematics that defines the locations on the hard disk drives. LBA tells the location of where the data is stored. For more information about LBA, check out this Wikipedia definition. But the whole idea is LBA is dumb. It is pretty much static and exported to file systems and applications so that these guys can do something with it.

There’s something brewing in the background since 1994 and it is one of the many efforts to make intelligent storage devices. This new object-based interface was part of the research project done by Carnegie-Mellon University (CMU). Initially, it was known as Network Attached Secure Disk (NASD) but eventually made its way to the working group in SNIA, and developing it for ANSI T10 INCITS standard. ANSI T10 is the guardian of all SCSI standards. This is called Object Storage Device (OSD). The SCSI architecture diagram below shows the layer where OSD resides.

 

The motivation for this simple: To make storage devices of today to do more computational work, in particularly I/O, relieving the hosts and the local systems to concentrate other computational processing work. And the same time, the local systems must have some level of interactivity and management between the storage object and the computational hosts.

In the diagram below which compares both block-based and OSD,

 

you can see the separation of file system management interface that is at the kernel-space of the local host/system and this is replaced by the OSD Management interface at the storage device.

What does this all mean? This means that using LBA type of addressing that we are familiar with in the block-based and file-based storage is no longer the way to go, because as I mentioned before, LBA is dumb.

OSD, in some way, replaces the LBA with OIDs (Object IDs). The existing local system and/or its file system will interact with the storage devices with OIDs and the OIDs links to its respective objects storage. And the object will carry a lot of metadata, that represents the object, giving it the intelligent and management capability of the object.

 

 

The prominence of the metadata in the OSD would mean that we can build much more intelligent systems in the future. The OIDs and the objects can be grouped together in a flat design or can be organized and categorized in a virtual, hierarchical model.

 

Object storage is an intelligent evolution of disk drives that can store and serve objects rather than simply place data on tracks and sectors. And it can bring the following benefits:

  • Intelligent space management in the storage layer
  • Data aware pre-fetching and caching
  • Robust shared access by multiple clients
  • Scalable performance using off-loaded data path
  • Reliability security

Several vendors such as EMC and NetApp are already supporting OSD.

Stop stroking your …

A few days after I wrote about the performance benchmark bag of tricks, EMC was the first to fire the first salvo at NetApp’s SPECSfs2008 world records on NFS IOPS.

EMC is obviously using all its ammo to deflate NetApp chest thumping act, with Storagezilla‘s blog. Mark Twomey,  who is the alter ego of Storagezilla posted several observations about NetApp apparent use of disk short stroking to artificially boost its performance numbers. This puts NetApp against the wall, with Alex MacDonald (who incidentally is SNIA NFSv4 co-chairman) of the office of the CTO responding hard to Storagezilla’s observation.

The news of this appeared in The Register. Read all about it.

With no letting up, the article also mentioned EMC Isilon’s CTO, Rob Pegler, adding more fuel to the fire.

I spoke about short stroking as some of the tricks used to gain better numbers in benchmark. And I also mentioned that these numbers have little use to the real work and I would like to add that these numbers are just there for marketing reasons. So, for you readers out there, benchmark is really not big of a deal.

Have a great weekend!

Ocarina rising

After more than a year since Dell acquired Ocarina Networks, it has finally surfaced last week in the form of Dell DX Object Storage 6000G SCN (Storage Compression Node).

Ocarina is a content-aware storage optimization engine, and their solution is one of the best I have seen out there. Its unique ECOsystem technology, as described in the diagram below, is impressive.

Unlike most deduplication and compression solutions out there, Ocarina Networks solution takes storage optimization a step further.  Ocarina works at the file level and given the rise and crazy, crazy growth of unstructured files in the NAS space, the web and the clouds, storage optimization is one priority that has to be addressed immediately. It takes a 3-step process – Extract, Correlate and Optimize.

Today’s files are no longer a flat structure of a single object but more of a compounded file where many objects are amalgamated from different sources. Microsoft Office is a perfect example of this. An Excel file would consists of objects from Windows Metafile Formats, XML objects, OLE (Object Linking and Embedding) Compound Storage Objects and so on. (Note: That’s just Microsoft way of retaining monopolistic control). Similarly, a web page is a compound of XML, HTML, Flash, ASP, PHP object codes.

In Step 1, the technology takes files and breaks it down to its basic components. It is kind of like breaking apart every part of a car down to its nuts and bolt and layout every bit on the gravel porch. That is the “Extraction” process and it decodes each file to get the fundamental components of the files.

Once the compounded file object is “extracted”, identified and indexed, each fundamental object is Correlated in Step 2. The correlation is executed with the file and across files under the purview of Ocarina. Matching and duplicated objects are flagged and deduplicated. The deduplication is done at the byte-level, unlike most deduplication solutions that operate at the block-level. This deeper and more granular approach further reduces the capacity of the storage required, making Ocarina one of the most efficient storage optimization solutions currently available. That is why Ocarina can efficiently reduce the size of even zipped and highly encoded files.

It takes this storage optimization even further in Step 3. It applies content-aware compactors for each fundamental object type, uniquely compressing each object further. That means that there are specialized compactors for PDF objects, ZIP objects and so on. They even have compactors for Oil & Gas seismic files. At the time I was exposed to Ocarina Networks and evaluating it, it had about 600+ unique compactors.

After Dell bought Ocarina in July 2010, the whole Ocarina went into a stealth mode. Many already predicted that the Ocarina technology would be integrated and embedded into Dell’s primary storage solutions of Compellent and EqualLogic. It is not there yet, but will likely be soon.

Meanwhile, the first glimpse of Ocarina will be integrated as a gateway solution to Dell DX6000 Object Storage. DX Object Storage is a technology which Dell has OEMed from Caringo. DX6000 Object Storage (I did not read in depth) has the concept of the old EMC Centera, but with a much newer, and more approach based on XML and HTTP REST. It has published an open API and Dell is getting ISV partners to develop their applications to interact with the DX6000 including Commvault, EMC, Symantec, StoredIQ are some of the ISV partners working closely with Dell.

(24/10/2011: Editor note: Previously I associated Dell DX6000 Object Storage with Exanet. I was wrong and I would like to thank Jim Dtuton of Caringo for pointing out my mistake)

Ocarina’s first mission is to reduce the big, big capacities in Big Data space of the DX6000 Object Storage, and the Ocarina ECOsystem technology looks a good bet for Dell as a key technology differentiator.

Novell Fil(e)r … Files, my way

I took a bit of time of my busy schedule this week to learn a bit more about the Novell Filr.

Firstly, it is a F-I-L-E-R, spelled “Filr”, something like Tumblr, or Razr. I think it’s pretty inventive but putting marketing aside, I learned about a little of how the idea works behind the concept. Right now, my evaluation is pretty much on the surface because I am working out the time for a real-life demo and hands-on later on.

As I mentioned in my previous blog, the idea behind Novell Filr is to allow the users to access their files anywhere, any device. The importance of this concept is to allow the users to stay in their comfort zone. This simple concept, of having the users being comfortable, is something that we should not overlook, because it brings together the needs of the enterprise and the IT organization and the needs of the individual users in a subtle, yet powerful way. It allows the behavioral patterns of the “lazy” users to be corralled into what IT wants them to do, that is to have the users’ files secured, protected and be in IT’s control. OK, that was my usual blunt way of saying it but I believe this is a huge step forward to address the issues at hand. And I am sorry for saying that the users are “lazy” but that’s what the IT guys would say.

What are the usual issues usually faced when it comes to dealing with user files? Let me count the ways:

  • Users don’t put the files in backup folders as they were told and they blame IT for not backing up their files
  • Users keep several copies of the files and email, share through thumb drives etc, to their friends and colleagues. IT gets blames for ever growing storage capacity needs and even worse, breaching the security of the organization as internal files are shared to outsiders.
  • Users wants to get their files on iPads, iPhones, Android Pads, BlackBerry and other smart devices and saying IT is too archaic. Users said that they are less productive if they can’t get the files anywhere. IT gets the blame again
  • Users has little discipline to change their habits and to think about file security and ownership of company’s private and confidential data – sharing files happily and IT gets blame

These points, from the IT point-of-view, are exactly the challenges faced daily. That is why users are flocking to Box.Net, DropBox and Windows Live SkyDrive because they want simplicity; they want freedom; they want IT to get off their back. But all these “confrontations” are comprising the integrity of the files and data of the organization.

Novell Filr, is likely to be one of the earliest solutions to address this problem. It attempts to marry both the simplicity and freedom ala-DropBox for the users, but in the IT backend, where the organization’s files will be stored, IT runs a tight ship of the users AAA (authentication, authorization and auditing) and at the same time, includes the Novell File Management Suite. As shown below, Novell File Management Suite consists of 3 main solutions.

I will probably talk more about the File Management Suite in another blog entry, but meanwhile, how does the Novell Filr work?

First of all, it sits between the conversation between the users’ devices (typically, this will be a Windows computer accessing a network drive via CIFS) and the central file storage. You know? The usual file sharing concept, but this traditional approach limits the users to only computers, not smart devices such as smartphones and tablets.

In the spirit of DropBox, I believe a Novell Filr client (computers, smart devices etc) speaks with the Novell Filr “middleware” with standard RESTful API, over HTTP. I still need to ascertain this because I have not had any engagement with Novell yet, nor have I seen the product. In the slides given to me, the explanation at 10,000 feet is shown below.

I will share more details later once I have more information.

At the same time, I cannot help but notice this changing trend of NAS. It seems to me that many of the traditional NAS ideas going the way of the REST protocol, especially in a object-based “file” access. In fact, the definition of a “file” would also be changing into a web object. While the tide has certainly rising on this subject, we shall see how it pans out as SMB 2.0 and NFS version 4.0 start making inroads to replace the NAS protocols of CIFS 1.1 and NFS version 3.0.

As I mentioned previously, this is not disruptive to me and I know of several vendors already have developments similar to this. But the fundamental shift of users behaviors to the Web 3.0 type of data, files and information access might be addressed well with the Novell Filr.

I can’t wait for the hands-on and demo, knowing that much can be addressed in the enterprise file management space by changing the users habits, in a subtle but definitely more effective way.

Novell Filr (How do you pronounce this?)

I let you in on a little secret … I am a great admirer of Novell’s technology.

Ok, ok, they aren’t what they used to be anymore (remember the great heydays of Netware, ZenWorks and Groupwise?) And some of their business decisions didn’t make a lot of fans either. Some notable ones in recent years were the joint patent agreement with Microsoft (November 2006) and their ownership of Unix operating system rights. Though Novell did finally protected the Unix community by being the rightful owner of Unix OS rights, the negativity from the lawsuit and counter lawsuit between SCO and Novell soured the relationship with the faithfuls of Unix. In the end, they were acquired by Attachmate late last year.

However, I have been picking up bits of Novell technology knowledge for the past 3-4 years. Somehow, despite the negative perception that most people I know had about Novell, I strongly believe the ideas and thinking that goes into their solutions and products are smart and innovative.

So, when my buddy (and ex-housemate) of mine, Mr. Ong Tee Kok, the Country Manager of Novell Malaysia, asked me to evaluate a new solution from Novell (it’s not even been released yet), I jumped at the chance.

Novell will soon be announce a solution called Novell Filr. I really don’t know how to pronounce the name, but the concept of Novell Filr makes a lot of sense. I cannot say that it is disruptive but it is coming to meet the changing world of how users are storing and accessing their files and balancing it with the needs of enterprise file management and access.

Yes, Novell Filr is a file virtualization solution. It comes between the user and their files. Previously in a network attached environment, files are presented to the users via the typical file sharing protocols, CIFS for Windows and NFS for Unix/Linux. These protocols have been around for ages, with some recent advents in the last few years for SMB 2.0 and NFS version 4. However, the updates to these protocols address the greater needs of the organizations and the enterprise rather than the needs of the users.

And because of this, users have been flocking to cloud-centric solutions out there such as DropBox, Box.net and Windows Live SkyDrive. These solutions cater to the needs of the users wanting to access their files anywhere, with any device. Unfortunately, the simplicity of file access the “cloud-way” is not there when the users are in the office network. They would have to be routinely reminded by the system administrator to keep the files in some special directory to have their files backed up. Otherwise, they shall be ostracized by the IT department and their straying files will not be backed up.

Well, Novell will be introducing their Novell Filr soon and they have released a video of their solution. Check this out.

I shall be spending some time this week to look into their solution deeper and hoping to see a demo soon. And I have great confidence in the Novell solutions. I intend to share more about them later.

RedHat to acquire Gluster

This is breaking news. RedHat is to acquire Gluster!

What is Gluster? Gluster is a clustering Linux distribution started by Z Research under the direction of Anand Babu (who is currently Gluster’s CEO) aiming to commoditize supercomputing and supercomputing clustered storage. Gluster is open source but there is a commercial version as well. It runs on commodity 64-bit x86 hardware. The Gluster File System (GlusterFS) aggregates disks and memory resources into a pool of storage thru a single global namespace and accessed through multiple file-level protocols. The scale-out architecture is where storage resources can be added as a storage node in a building block fashion to meet performance and capacity demands, rather like what HP P4000 is doing to the block-level environment for SAN.

Gluster can integrated with most 64-bit Linux distros. This is done at the Linux user space but it can also be crafted at the Linux kernel space, where it is a software appliance, easily integrated into off-the-shelf 64-bit x86-64 platforms. This means that you can build a scale-out NAS pretty easily using your own hardware.

From an architecture standpoint, GlusterFS and its integration to a storage appliance looks like this:

 

Because it works in a modular add-on fashion, this architecture is distribution and extended by replicating the same architecture across additional x86-64 platforms (which is a storage node) as shown below.

 

It’s really easy to install Gluster and build the Scale Out NAS. I have been saving a couple videos about how Gluster is installed and I must say that it’s pretty easy. In less than 30 minutes, you can install your first Gluster storage node and then add additional nodes on the fly.

Enjoy the videos.

Video #1 (Gluster Installation)

(I have difficulty uploading the videos because WordPress requires me to purchase one of their solutions)

Video #2 (Creating and adding Storage Node in Gluster)

(I have difficulty uploading the videos because WordPress requires me to purchase one of their solutions)

Note: If you are interested to see the videos, please email to me at chin-fah.heoh@storagenetworking-academy.com.

This news gets me very excited because this is the perfect endorsement of what I have been saying all along. Storage networking and data management are the foundations of CLOUD and VIRTUALIZATION. Without data being stored and managed well, everything falls apart. And as I have mentioned many times before, this is a fantastic time to become an extra-ordinary storage engineer/consultant/architect/sales (maybe not!)