December 2011 - Storage Gaga

Apple chomps Anobit

By cfheoh | December 27, 2011 - 8:58 am |October 27, 2012 Apple, Data Corruption, Disks, Reliability

A few days ago, Apple paid USD$500 million to buy an Israeli startup, Anobit, a maker of flash storage technology.

Obviously, one of the reasons Apple did so is to move up a notch to differentiate itself from the competition and positions itself as a premier technology innovator. It has won the MP3 war with its iPod, but in the smartphones, tablets and notebooks space, Apple is being challenged strongly.

Today, flash storage technology is prevalent, and the demand to pack more capacity into a small real-estate of flash will eventually lead to reliability issues. The most common type of NAND flash storage is the MLC (multi-level cells) versus the more expensive type called SLC (single level cells).

But physically and the internal-build of MLC and SLC are the exactly the same, except that in SLC, one cell contains 1 bit of data. Obviously this means that 2 or more bits occupy one cell in MLC. That’s the only difference from a physical structure of NAND flash. However, if you can see from the diagram below, SLCs has advantages over MLCs.

NAND Flash uses electrical voltage to program a cell and it is always a challenge to store bits of data in a very, very small cell. If you apply too little voltage, the bit in the cell does not register and will result in something unreadable or an error. If you apply too much voltage, the adjacent cells are disturbed and resulting in errors in the flash. Voltage leak is not uncommon.

The demands of packing more and more data (i.e. more bits) into one cell geometry results in greater unreliability. Though the reliability of the NAND Flash storage is predictable, i.e. we would roughly know when it will fail, we will eventually reach a point where the reliability of MLCs will no longer be desirable if we continue the trend of packing more and more capacity.

That’s when Anobit comes in. Anobit has designed and implemented architectural changes of the way NAND Flash storage is used. The technology in laymen terms comes in 2 stages.

Error reduction – by understanding what causes flash impairment. This could be cross-coupling, read disturbs, data retention impairments, program disturbs, endurance impairments
Error Correction and Signal Processing – Advanced ECC (error-correcting code), and introducing the patented (and other patents pending) Memory Signal Processing (TM) to improve the reliability and performance of the NAND Flash storage as show in the diagram below:

In a nutshell, Anobit’s new and innovative approach will result in

More reliable MLCs
Better performing MLCs
Cheaper NAND Flash technology

This will indeed extend the NAND Flash technology into greater innovation of flash storage technology in the near future. Whatever Apple will do with Anobit’s technology is anybody’s guess but one thing is certain. It’s going to propel Apple into newer heights.

IDC Worldwide Storage Software QView 3Q11

By cfheoh | December 24, 2011 - 8:37 am |October 27, 2012 EMC, IBM, IDC, Storage Market Share, Uncategorized

Leave a comment

I did not miss this when the IDC report of worldwide storage software for Q3 2011 was released a couple of weeks ago. I was just too busy to work on it until just now.

The IDC QView report covers 7 functional areas of storage software:

Data protection and recovery software
Storage replication software
Storage infrastructure software
Storage management software
Device management software
Data archiving software
File system software

All areas are growing and Q3 grew 9.7% when compared with the figures of 3Q2010. In the overall software market, EMC holds the top position at 24.5% followed by Symantec (15.3%) and IBM (14.0%). Here’s a table to show the overall standings of the storage software vendors.

In fact, EMC leads in 3 areas of storage infrastructure management, storage management and device management. But the fastest growing area is data archiving software with a pace of 12.2% following by storage and device management of 11.3%.

HP is not in the table, but IDC reported that the biggest growth is coming from HP with a 38.2% growth, boosted by its acquisition of 3PAR. Watch out for HP in the coming quarters. Also worthy of note is the rate Symantec has been experiencing. Their was only 2.2% and IBM, at #3, is catching up fast. I wonder what’s happening in Symantec having seeing them losing their lofty heights in recent years.

The storage software market is a USD$3.5 billion market and it is the market that storage vendors are placing more importance. This market will grow.

Captain Dynamo Storage System

By cfheoh | December 23, 2011 - 8:54 pm |October 27, 2012 Amazon, Big Data, Data, Object Storage

Leave a comment

My research on file systems brought me to an very interesting piece of article. It is titled “Dynamo: Amazon’s Highly Available Key-Value Store” dated 2007.

Yes, this is an internal storage systems designed and developed in Amazon to scale and support Amazon Web Services (AWS). It is a very complex piece of technology and the paper is highly technical (not for the faint of heart). And of all places, Amazon is probably the last place you think you would find such smart technology, but it’s true. AWS engineers are slowly revealing the many of their innovations (think Amazon Silk browser technology).

And it appears that many of the latest cloud-based computing and services companies such as Amazon, Google and many others have been developing new methods of storing data objects. These methods are very different from the traditional methods of storing data, and many are no longer adopting the relational database model (RDBMS) to scale their business.

The traditional 3-tier architecture often adopted by web-based (before the advent of “cloud”), is evolving. As shown in the diagram below:

the foundation tier is usually a relational database (or a distributed relational database), communicating with the back-end storage (usually a SAN).

All that is changing because the relational database model is not keeping up with the tremendous pace of the proliferation of web-based and cloud-based objects or unstructured data. As explained by Alex Iskold, a writer of ReadWriteWeb, there are scalability issues with the conventional relational database.

Before I get to the scalability issues mentioned in the above diagram, let me set the floor for discussion.

For theoretical schoolers of relational database, the term ACID defines and guarantees the transactional reliability of relational databases. ACID stands for Atomicity, Consistency, Isolation and Durability. According to Wikipedia, “transactions provide an “all-or-nothing” proposition, stating that each work-unit performed in a database must either complete in its entirety or have no effect whatsoever. Further, the system must isolate each transaction from other transactions, results must conform to existing constraints in the database, and transactions that complete successfully must get written to durable storage.”

ACID has been the cornerstone of relational database from the very beginning. But as the demands of greater scalability and greater distribution of data, all 4 components of ACID – Atomicity, Consistency, Isolation, Durability – can no longer hold true. Hence, the CAP Theorem.

CAP Theorem (aka Brewer’s Theorem) stands for Consistency, Availability and Partition Tolerance. In the ACM (Association of Computing Machinery) conference in 2000, Eric Brewer of University of California, Berkeley delivered the theorem. It states that it is impossible for a distributed computer system (or a database system) to simultaneously guarantee all 3 components – Consistency, Availability and Partition Tolerance.

Therefore, as the database systems become more and more distributed in cyberspace, the ACID theorem begins to break down. All 4 components of ACID cannot be guaranteed simultaneously anymore as the database systems begin to become more and more distributed.

So when we get back to the diagram, both the concepts on left and right – Master/Slave OR Multiple Peers – will put a tremendous strain on the single, non-distributed relational database.

New data models are surfacing to handling the very distributed data sets. Distributed object-based “file systems” and NoSQL type of databases are some of the unconventional data storage “systems” that are beginning to surface as viable alternatives to the relational database method in cyberspace. And one of them is the Amazon Dynamo Storage System. (ADSS)

ADSS is a highly available, Amazon-proprietary key-value distributed data store. ADSS has both the properties of distributed hash table and a database and it is used internally to power various Cloud Services in Amazon Web Services (AWS).

It behaves like a relational database where it stores data objects to be retrieved. However, the data objects are not stored in a table format of a conventional relational database. Instead, the data is stored in a distributed hash table and data content or value is retrieved with a key, hence a key-value data model.

The data content is stored and retrieved through a simple put and get interface, much like how RESTful would do it. From the article in ReadWriteWeb, here’s how Dynamo works:

Physical nodes are thought of as identical and organized into a ring.
Virtual nodes are created by the system and mapped onto physical nodes, so that hardware can be swapped for maintenance and failure.
The partitioning algorithm is one of the most complicated pieces of the system, it specifies which nodes will store a given object.
The partitioning mechanism automatically scales as nodes enter and leave the system.
Every object is asynchronously replicated to N nodes.
The updates to the system occur asynchronously and may result in multiple copies of the object in the system with slightly different states.
The discrepancies in the system are reconciled after a period of time, ensuring eventual consistency.
Any node in the system can be issued a put or get request for any key

The Dynamo architecture addresses the CAP Theorem well. It is highly available, where nodes, either physical or virtual, can be easily swapped without affected the storage services. It is also high performance, nodes (again physical or virtual) can be added to boost the performance. The high performance and highly available components addresses the “A” piece of CAP.

Its distributed nature also allows it to scale to billions and billions of data objects and hence meets the “P” requirement of CAP. The Partitioning Tolerance is definitely there.

However, as stated by CAP Theorem, you can’t have all 3 happening at the same time. Therefore, the “C” or Consistency piece of CAP has to be compromised. That is why Dynamo has been labeled an “eventually consistency” storage system.

As data is stored into ADSS, the changes of the data is propogated and will be asynchronously replicated to other nodes in the system, eventually making all the data objects and its value consistent. However, given the speed of things in cyberspace and the nature of most Cloud Computing services, the consistency piece could be difficult to accomplish and that is OK because in most of the transactions that are distributed, inconsistency is acceptable.

So that’s a bit about the Amazon Dynamo. Alas, we may never get our grubby hands on this piece of cool data storage and management technology, but knowing that Dynamo is powering AWS and its business is an eye-opener for us into the realm of a new technology evolution.

Is there IOPS for Cloud Storage? – Nasuni style

By cfheoh | December 21, 2011 - 12:14 pm |October 27, 2012 Cloud, Data, Filesystems, Nasuni, Performance Caching

4 Comments

I was in Singapore last week attending the Cloud Infrastructure Services course.

In the class, one of the foundation components of Cloud Computing is of course, storage. As the students and the instructor talked about Storage, one very interesting argument surfaced. It revolved around the storage, if it was offered on the cloud. A lot of people assumed that Cloud Storage would be for their databases, and their virtual machines, which of course, is true when the communication between the applications, virtual machines and databases are in the local area network of the Cloud Service Provider (CSP).

However, if the storage is offered through the cloud to applications that are sitting on-premise in the customer’s server room, then we have to think twice of how we perceive Cloud Storage. In this aspect, the Cloud Storage offered by the CSP is a Infrastructure-as-a-Service (IaaS), where the key service is Storage. We have to differentiate that this Storage functions as a data container, and usually not for I/O performance reasons.

Though this concept probably will be easily understood by storage professionals like us, this can cause a bit confusion for someone new to the concept of Cloud Computing and Cloud Storage. This confusion, unfortunately, is caused by many of us who are vendors or solution providers, or even publications and magazines. We are responsible to disseminate correct information to customers, but due to our lack of knowledge and experience in this extremely new market of Cloud Storage, we have created the FUDs (Fear, Uncertainty and Doubt) and hype.

Therefore, it is the duty of this blogger to clear the vapourware, and hopefully pass on the right information to accelerate the adoption of Cloud Storage in the near future. At this moment, given the various factors such as network costs, high network latency and lack of key network technologies similar to LAN in Cloud Computing, Cloud Storage is, most of the time, for data storage containership and archiving only. And there are no IOPS or any performance related statistics related to Cloud Storage. If any engineer or vendor tells you that they have the fastest Cloud Storage in the industry, do me a favour. Give him/her a knock on the head for me!

Of course, as technologies evolve, this could change in the near future. For now, Cloud Storage is a container, NOT a high performance storage in the cloud. It is usually not meant for transactional data. There are many vendors in the Cloud Storage space from real CSPs to storage companies offering re-packaged storage boxes that are “cloud-ready”. A good example of a CSP offering Cloud Storage is Amazon S3 (Simple Storage Service). And storage vendors such as EMC and HDS are repackaging and rebranding their storage technologies as object storage, ready for the cloud. EMC Atmos is really a repackaged and rebranded Centera, with some slight modifications, while HDS , using their Archiving solution, has HCP (aka HCAP). There’s nothing wrong with what EMC and HDS have done, but before the overhyping of the world of Cloud Computing, these platforms were meant for immutable data archiving reasons. Just thought you should know.

One particular company that captured my imagination and addresses the storage performance portion is Nasuni. Of course, they are quite inventive with the Cloud Storage Gateway approach. Nasuni comes up with a Cloud Storage Gateway filer appliance, which can be either a physical 1U server or as a VMware or Hyper-V virtual appliance sitting on-premise at the customer’s site.

The key to this is “on-premise”, which allows access to data much faster because they are locally-cached in the Nasuni filer appliance itself. This Nasuni filer piece addresses the Cloud Storage “performance” piece but Nasuni do not claim any performance statistics with such implementation. The clever bit is that this addresses data or files that are transactional in nature, i.e. NFS or CIFS, to serve data or files “locally”. (I wonder if Nasuni filer has iSCSI as well. Hmmmm….)

In the Nasuni architecture, they “break up” their “Cloud Storage” into 2 pieces. Piece #1 sits on-premise, at the customer site, and acts as a bridge to the Piece #2, that is sitting in a Cloud Storage. From a simplified view, have a look at the diagram below:

Piece #1 is the component that handles some of the transactional traffic related to files. In a more technical diagram below, you can see that the Nasuni filer addresses the file sharing portion, using the local disks on the filer appliance as a local caching mechanism.

Furthermore, older file pieces are whiffed away to the any Cloud Storage using the Cloud Connector interface, hence giving the customer a sense that their storage capacity needs can be limitless if they want to (for a fee, of course). At the same time, the Nasuni filer support thin provisioning and snapshots. How cool is that!

The Cloud Storage piece (Piece #2) is used for the data container and archiving reasons. This component can be sitting and hosted at Amazon S3, Microsoft Azure, Rackspace Cloud Files, Nirvanix Storage Delivery Network and Iron Mountain Archive Services Platform.

The data communication and transfer between the Nasuni filer is secure, encrypted, deduplication and compressed, giving it the efficiency and security that most customers would be concerned about. The diagram below explains the dat communication and data transfer bit.

In this manner, the Nasuni filer can replace traditional NAS platforms and can potentially provide a much lower total cost of ownership (TCO) in the long run. Nasuni does not pretend to be a NAS replacement. To me, this concept is very inventive and could potentially change the way we perceive file sharing and file server, obscuring and blurring concept of NAS.

Again, I would like to reiterate that Nasuni does not attempt to say their solution is a NAS or a performance-based Cloud Storage but what they have cleverly packaged seems to be appealing to customers. Their customer base has grown 78% in Q2 of 2011. It’s just too bad they are not here in Malaysia or this part of the world (yet).

IOPS in Cloud Storage? Not yet.

A little yellow elephant

By cfheoh | December 17, 2011 - 10:35 am |October 27, 2012 Analytics, Big Data, Data, Filesystems, Hadoop

Leave a comment

By now, I believe most of you in the storage networking world would have heard of Hadoop. Hadoop was created by Doug Cutting, while he and his team was working on an open source web search engine called Nutch. The easily recognized little yellow elephant, Hadoop, was Doug Cutting’s son toy, which he made as Hadoop’s mascot. Pretty cool!

And today, Hadoop has become THE platform for Big Data applications. Why?

As I have mentioned before, everything that we do or don’t do, generates data, either as a direct product or in-direct product. I am blogging right now and I am creating data. I was in Singapore the whole of this week and everywhere I go in the MRT stations, I am being watched by the video cameras they have at the station. A new friend in class said that Singapore is the second most “watched” city after London, where there are video cameras mounted everywhere, either discreetly or indiscreetly. And that’s just video data. And there’s plenty of other human activities that generate tons and tons of data.

IDC Digital Universe Report for 2011 said that we have generated 1.8ZB (zettabyte) of data this year alone. I mentioned in my previous blog that this is a gold mine and companies are scrambling to tap on massive amount of data. Extracting valuable information to anticipate the next trend or predict that next evolution in human preference is akin to the Gold Rush in the wild, wild west in the late 19th century. Folks, Big Data is going to be this generation’s “Digital Gold Rush”.

Sieving, filtering and processing gazillions of data (more unstructured than structured) will not work in defined, well-formatted relational databases. The data model of relational databases will simply break down. And of course, there are different schools of thoughts of different data models, but the Hadoop model seems to be gaining momentum and mind share of data scientists. That is because of Hadoop’s capability to deal with massive unstructured data, processing it and producing results in a small amount of time.

One way to process the pool of massive data is parallel programming. In parallel programming, multi-threading is commonly deployed to achieve the performance and effects of programming. But implementing multi-threading in parallel programming is difficult. Developers often has to deal with LWP (lightweight processes), semaphores, shared memory, mutex (mutually exclusive) locking and so on. Hence this style of programming works with different states on shared data, often resulting in different results in different states, even when using the same programming expression.

Hadoop belongs to another school of programming known as functional programming, where the different states on shared data concept is removed. With that in mind, the dependency on different states is also removed, resulting in a much easier and simpler parallel programming implementation. Hadoop borrows ideas from the MapReduce software framework made well known by Google and the Google File System.

Before, we get to know Hadoop, we must know MapReduce. MapReduce is a framework which allows very large data sets to be processed with a very large set of computer nodes in a cluster. Typically the computational processing is executed in a distributed fashion, spread across many computer nodes and final results are consolidated from the sub-results of these distributed processing nodes.

According to Wikipedia, the 2 key functions of Map Reduce are map() and reduce(). That’s pretty obvious. The extract below was taken from the Wikipedia definition, and explains both functions very well.

“Map” step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

“Reduce” step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

The diagram below probably can simplify the concept of MapReduce to the readers.

Hadoop is one of the open-source implementations of MapReduce. It is one of the projects of Apache Foundation, and the project has sparked a brand-new niche of data search, data management and data science. The diagram below will allow our readers to juxtapose MapReduce and Hadoop, and comparing them in the simplest fashion.

Hadoop primary development platform is Java. Hadoop’s architecture consists mainly of 2 components – Hadoop Common and a Hadoop-compatible file system, as shown in the diagram below.

Hadoop MapReduce layer above is the file/object access interface to the Hadoop-compatible file system below. HDFS is Hadoop Distributed File System is just one of a few Hadoop-compatible file systems. Other file systems include:

Amazon S3 File System as part of the Amazon EC2 Infrastructure-as-a-Service (IaaS) cloud platform
CloudStore – a similar Hadoop-like implementation using C++ and also inspired by Google File System
FTP file systems
HTTP and HTTPS read-only file systems
Any file systems accessible with the file:// URL nomenclature

But the main engine of Hadoop is in the MapReduce layer. The 2 core components in this layer is JobTracker and TaskTracker. Both has their own individual roles to play and collectively, they are key cogs in the Hadoop distributed data processing model.

Below are extract I picked up from Wikipedia.

JobTracker submits MapReduce jobs to client applications. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware filesystem, the JobTracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone network. If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on each node spawns off a separate Java Virtual Machine process to prevent the TaskTracker itself from failing if the running job crashes the JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty and can be viewed from a web browser. Jetty is a Java-based HTTP server, among other things

JobTracker records what it is up to in the filesystem. When a JobTracker starts up, it looks for any such data, so that it can restart work from where it left off.

Scheduling

By default Hadoop uses first-in, first-out (FIFO), and optional 5 scheduling priorities to schedule jobs from a work queue. In version 0.19 the job scheduler was refactored out of the JobTracker, and added the ability to use an alternate scheduler (such as the Fair scheduler or the Capacity scheduler).

Fair scheduler

The fair scheduler was developed by Facebook. The goal of the fair scheduler is to provide fast response times for small jobs and QoS (Quality of Service) for production jobs. The fair scheduler has three basic concepts.

Jobs are grouped into Pools.
Each pool is assigned a guaranteed minimum share.
Excess capacity is split between jobs.

By default jobs that are uncategorized go into a default pool. Pools have to specify the minimum number of map slots, reduce slots, and a limit on the number of running jobs.

Capacity scheduler

The capacity scheduler was developed by Yahoo. The capacity scheduler supports several features which are similar to the fair scheduler.

Jobs are submitted into queues.
Queues are allocated a fraction of the total resource capacity.
Free resources are allocated to queues beyond their total capacity.
Within a queue a job with a high level of priority will have access to the queue’s resources.

I took most the extract below from Wikipedia, and I don’t claim to be a knowledgeable person on Hadoop. All the credits go to Wikipedia editors to put Hadoop in layman terms.

Hadoop has certainly won the hearts of the new digital gold rush, Big Data and is slowly becoming a force to be reckoned with among data scientists. Hadoop implementations are powering new frontiers in processing and mining the ever growing data capacity, giving solution providers a simple programming methodology and data model to gain more insights into the vast seas of data and information.

Hadoop has many fans, and slowly becoming the data platform for large companies such as Yahoo!, Facebook, IBM, Amazon, Apple, eBay and many more. Facebook even claims to have the largest Hadoop clusters in the world, growing to 30PB in July of 2011.

This little yellow elephant is going places and one to watch out for.

Greenplum looking mighty sweet

By cfheoh | December 16, 2011 - 9:06 am |October 27, 2012 Big Data, EMC

1 Comment

Big data is Big Business these days. IDC predicts that between 2012 and 2020, the spending on big data solution will account for 80% of IT spending and growing at 18% per annum. EMC predicts that the big data is worth USD$70 billion! That’s a very huge market.

We generate data, and plenty of it. In the IDC Digital Universe Report for 2011 (sponsored by EMC), approximately 1.8 zettabytes of data will be created and replicated in 2011. How much is 1 zettabyte, you say? Look at the conversion below:

                    1 zettabyte = 1 billion terabytes

That’s right, folks. 1 billion terabytes!

And this “mountain” of data and information is a Goldmine of goldmines, and companies around the world are scrambling to tap on this treasure chest. According to Wikibon, big data has the following characteristics:

Very large, distributed aggregations of loosely structured data – often incomplete and inaccessible
Petabytes/exabytes of data
Millions/billions of people
Billions/trillions of records
Loosely-structured and often distributed data
Flat schemas with few complex interrelationships
Often involving time-stamped events
Often made up of incomplete data
Often including connections between data elements that must be probabilistically inferred

But what is relevant is not the definition of big data, but rather what you get from the mountain of information generated. The ability to “mine” the information from big data, now popularly known as Big Data Analytics, has sparked a new field within the data storage and data management industry. This is called Data Science. And companies and enterprises that are able to effectively use the new data from Big Data will win big in the next decade. Activities such as

Business decision making
Gain competitive advantage
Drive productivity growth in relevant industry segments
Understanding consumer and business behavioural patterns
Knowing buying decisions and business cycles
Yielding new innovation
Reveal customer insights
much, much more

will drive a whole new paradigm that shall be known as Data Science.

And EMC, having purchased Greenplum more than a year ago, has started their Data Computing Products Division immediately after the Greenplum acquisition. And in October of 2010, EMC announced their Greenplum Data Computing Appliance with some impressive numbers. Using 2 configurations of their appliance, noted below:

Below are 2 tables of the Greenplum performance benchmarks:

That’s what these big data appliance is able. The ability to load billions of either structured or unstructured files or objects in mere minutes is what drives the massive adoption of Big Data.

And a few days, EMC announced their Greenplum Unified Analytics Platform (UAP) which comprises of 3 Greenplum components:

A relational database for structured data
An enterprise Hadoop engine for the analysis and processing of unstructured data
Chorus 2.0, which is a social media collaboration tool for data scientists

The diagram below summarizes the UAP solution:

Greenplum is certainly ahead of the curve. Competitors like IBM Netezza, Teradata and Oracle Exalogic are racing to be ahead but Greenplum is one of the early adopters of a single platform for big data. Having a consolidation platform will not only reduce costs (integration of all big data components usually incurs high professional services’ fees) but will also reduce the barrier to entry to big data, thus further accelerating the adoption of big data.

Big Data is still very much at its infancy and EMC is pushing to establish its footprint in this space. EMC Education has already announce the general availability of courses related to big data last week and also the EMC Data Science Architect (EMC DSA) certification. Greenplum is enjoying the early sweetness of the Big Data game and there will be more to come. I am certainly looking forward to share more on this plum (pun intended ;-)) of the data storage and data management excitement.

“Ugly Yellow Box” bought by private equity firm

By cfheoh | December 11, 2011 - 7:05 am |October 27, 2012 Bluecoat, NetApp, Security

1 Comment

Security is BIG business, probably even bigger than storage and with more “sex” appeal and pazzazz! My friends are owners of 2 of the biggest security distributors in town, so I know. I am not much of a security guy, but I reason I write about Bluecoat is that this company has something close to my heart.

In the early 2000, NetApp used to have a separate division that is not storage. They have a product called NetCache, which is a web proxy solution. It was a pretty decent product and one of the competitors we frequently encounter on the field was an “ugly yellow box” called CacheFlow. Whenever we see an “ugly yellow box” in a rack, we will immediately know that it was a CacheFlow box. NetApp competed strongly with Cache Flow, partly because their CEO and founder, Brian NeSmith, as we NetAppians were told, was ex-NetApp. And there was some animosity between Brian and NetApp, up to the point that I recalled NetApp’s CEO then, Dan Warmenhoven, declaring that “NetApp will bury CacheFlow!“, or something of that nature. At that point, in the circa of 2001-2002, CacheFlow was indeed in a bit of a rut as well. They suffered heavy losses and was near bankcruptcy. A old news from Forbes confirmed Brian NeSmith’s near-bankcruptcy adventure.

CacheFlow survived the rut, changed their name to Bluecoat Systems, and changed their focus from Internet caching to security. Know why they are know as “Bluecoat”? They are the policemen of the Internet, and policemen are men in blue coats. I found an old article from Network World about their change. And they decided not to paint their boxes yellow anymore. 😉

Eventually, it was CacheFlow who triumphed over NetApp. And the irony was NetApp eventually sold the NetCache unit and its technology to BlueCoat in 2006. And hence, that my account of the history of Bluecoat.

Yesterday, Bluecoat was on the history books again, but for a better reason. A private equity firm, Thoma Bravo, has put in USD$1.3 billion to acquire Bluecoat. News here and here.

Have a happy Sunday 😀

Gartner 3Q2011 WW ECB Disk Storage Market

By cfheoh | December 10, 2011 - 9:54 am |October 27, 2012 Commvault, Dell, EMC, Gartner, HP, IBM, NetApp, Oracle, Storage Market Share

Leave a comment

Just after IDC released their numbers of their worldwide Disk Storage System Tracker (Read my blog) 10 days ago, Gartner released their Worldwide External Controller Based (ECB) Disk Storage Market report for Q3 of 2011.

The storage market remains resilient (for now) and growing 10.4% in terms of revenue, despite the hard economic conditions. The table below shows the top 7 storage vendors and their relation to their Q2 numbers.

EMC remained at the top and gained a massive 3.6% jump in market share. Looks like they are firing all cylinders and chugging like an unstoppable steam train. IBM gained 0.1% in second place as its stable of DS8000, XIV and Storewize V7000 is taking shape. Even though IBM has been holding steadily, I still think that their present storage lineup is staggered and lacks that seamless upgrade path for their customers.

NetApp, which I always terms as the “little engine that could”, is slowing down. They were badly hit in the last quarter, delivering lower than expected revenue numbers according to the analysts. Their stock took a tumble too. As quoted by Gartner, “NetApp’s third-quarter results reflect an overdependence on a few large customers, limited geographic coverage in high-growth countries and increased competition from Dell, EMC, HP and IBM in the midrange modular ECB disk array market segment.”

I wrote in my recent blog, that NetApp has to start evolving from a pure-play storage vendor into a total storage and data management solution vendor. The recent rumours of NetApp’s interests in Commvault and Quantum should make a lot of sense if NetApp decides to make that move. Come on, NetApp! What are you waiting for?

HP came back strong in this report. They are in 4th place with 10.4% market share and hot on NetApp’s heels. After many months of nonsensical madness – Leo Apotheker firing, trying to ditch the PC business, the killing of WebOS tablet, the very public Oracle-HP spat – things are beginning to settle a bit under their new CEO, Meg Whitman. In a recent HP Discover conference in Vienna, it was reported that the HP storage team is gung-ho of what they have in their arsenal right now. They called it “The 4 Jewels of HP Storage Crown” which includes 3PAR, Ibrix, StoreOnce and LeftHand. They also leap-frogged over HDS and Dell in the recent Gartner Magic Quadrant (See below).

Kudos to HP and team.

HDS seems to be doing well, and so is Dell. But the Gartner numbers tell a different story. HDS, lost market share and now shares 7.8% market share with Dell. Dell, despite its strong marketing on Compellent, could not make up its loss after breaking off with EMC.

Fujitsu and Oracle completes the line up.

My conclusion: HP and IBM are coming back; EMC is well and far ahead of everyone else; NetApp has to evolve; Dell still lacking in enterprise storage savviness despite having good technology; No comments about HDS.

Cloud Computing and it’s not iCloud

By cfheoh | December 9, 2011 - 9:55 am |October 27, 2012 Amazon, Apple, Cloud, Steve Jobs

1 Comment

Steve Jobs was great with what he has done, but when it comes to Cloud Computing, Jeff Bezos of Amazon is the one. And I believe the Amazon Web Services (AWS) is bigger than Apple’s iCloud, in this present time and the future. Why do I say that knowing that the Apple fan boys could be using me as target practice? Because I believe what Amazon is doing is the future of Cloud Computing. Jeff Bezos is a true visionary.

One thing we have to note is that we play different roles when it comes to Cloud Computing. There are Cloud Service Providers (CSP) and there are enterprise subscribers. On a personal level, there are CSPs that cater for consumer-level type of services and there are subscribers of this kind as well. The diagram below shows the needs from an enterprise perspective, for both providers and subscribers.

Also we recognize Amazon from a less enterprise perspective, and they are probably better known for their engagement at the consumer level. But what Amazon is brewing could already be what Cloud Computing should be and I don’t think Apple iCloud is quite there yet.

Amazon Web Services cater for the enterprise and the IT crowd, providing both Platform-as-a-Service (PaaS) and Infrastructure-as-a-Service (IaaS) through its delectable offerings of the

Elastic Compute Cloud (EC2)
SimpleDB
Simple Storage Service (S3)
Elastic Block Store (EBS)
Elastic Beanstalk
CloudFormation
many more

And AWS has been operational and serving enterprise customers for 5-6 years now. Netflix, Zynga, Farmville are some of AWS customers. This is something Apple iCloud do not have, a Cloud Computing ecosystems for enterprise customers. Apple iCloud do not offer PaaS or IaaS. Perhaps that’s Apple vision not to get into the enterprise, but eventually the world evolve around businesses and businesses are adopting Cloud Computing. Many readers may disagree with what I say now in this paragraph but I will share with you later that even at the consumer level, Amazon is putting right moves in place, probably more so than Apple’s vision. (more about this later).

But the recent announcement of Kindle Fire, their USD$199 Android-based gadget, was to me, the final piece to Amazon’s Phase I jigsaw – the move to conquer the Cloud Computing space. I read somewhere that USD$199 Kindle Fire actually costs about USD$201.XX to manufacture. Apple’s iPad costs USD$499. So Amazon is making a loss for each gadget they sell. So what! It’s no big deal.

Let me share with you this table that will rattle your thinking a little bit. Remember this: Cloud Computing is defined as a “utility”. Cloud Computing is about services, content.

The table was taken from a recent Wired Magazine article. It featured the interview with Jeff Bezos. Go check out the interview. It’s very refreshing and humbling.

I hope the table is convincing you enough to say that the device or the gadget doesn’t matter. Yes, Apple and Amazon have different visions when it comes to Cloud Computing, but if you take some time to analyze the comparison, Amazon does not lock you into buying expensive (but very good) hardware, unlike Apple.

Take for instance the last point. Apple promotes downloaded media while Amazon uses streamed media. If you think about it, that what Cloud Computing should be because the services and the contents are utility. Amazon is providing services and content as a utility. Apple’s thinking is more old-school, still very much the PC-era type of mentality. You have to download the applications onto your gadget before you can use it.

Even the Amazon Silk browser concept is more revolutionary that Apple’s Safari. The Silk browser splits some of the processing in the Amazon Cloud, taking advantage of the power of the Amazon Cloud to do the processing for the user. Here’s a little video about Amazon Silk browser.

The Apple Safari is still very PC-centric, where most of the Web content has to be downloaded onto the browser to be viewed and processed. No doubt the Amazon Silk also download contents, but some of the processing such as read-ahead, applet-processing functions have been moved to Amazon Cloud. That’s changing our paradigm. That’s Cloud Computing. And iCloud does not have anything like that yet.

Someone once told me that Cloud is about economics. How incredibly true! It is about having the lowest costs to both providers and consumers. It’s about bringing a motherload of contents that can be delivered to you on the network. Amazon has tons of digital books, music, movies, TV and computing power to sell to you. And they are doing it at a responsible pace, with low margins. With low margins, the barrier of entry is lower, which in turn accelerates the Cloud Computing adoption. And Amazon is very good at that. Heck, they are selling their Kindle Fire at a loss.

Jeff Bezos has stressed that what they are doing is long term, much longer term than most. To me, Jeff Bezos is the better visionary of Cloud Computing. I am sorry but the reality is Steve Jobs wants high margins from the gadgets they sell to you. That is Apple’s vision for you.

Photo courtesy of Wired magazine.

What should be a Cloud Storage?

By cfheoh | December 8, 2011 - 1:43 pm |October 27, 2012 Analytics, Big Data, Filesystems, Object Storage

2 Comments

For us filesystem guys, NAS is the way to go. We are used to store files into network file systems via NFS and CIFS protocols and treating the NAS storage array like a refrigerator – taking stuff out and putting stuff back it. All that is fine and well as long as the data is what I would term as corporate data.

Corporate data is generated by employees, applications and users of the company and for a long time, the power of data creation lies in the hands of the enterprise. That is why storage solutions are designed to address the needs of the enterprise where the data is structured and well defined. How the data is stored; the data is formatted; and how is being accessed are the “boundary” of how the data is being used. Typically a database is used to “restrict” the data created so that the information can be retrieved and modified quickly. And of course, SAN guys will tell you to put these structured data of the data base into their SAN.

For the unstructured data in the enterprise, NAS file systems hold that responsibility. Files such as documents and presentations have a more loosely defined “boundaries”, and hence filesystems are a better natural fit for unstructured data. Filesystems are like a free-for-all container, and able to store and provide access to any files in the enterprise.

But today, as the Web 2.0 applications are already taking over the enterprise, the power of data creation does not necessary lie in the hands of the enterprise applications and users. In fact, it is estimated that the percentage of enterprise data now has exceeded 50% of the enterprise’s total data capacity. With the proliferation of personal devices such as tablets, Blackberries, smart phones, PDAs and so on, individual contributors are generating plenty of data. This situation has been made more acute with Web 2.0 applications, such as Facebook, blogs, social networking, Twitter, Google Search and so on.

Unfortunately, file systems in the NAS category still pretty much the traditional file systems, while the needs of a new type of file system could not be met by the traditional file systems. The paradigm is definitely shifting. The new unstructured data world needs a new storage concept. I would term this type of storage as “Cloud Storage” because it breaks down the traditional concepts of NAS.

So what basically defined a Cloud Storage? I already mentioned that the type of unstructured data has changed. And the new requirements for unstructured data type are:

The unstructured data type is capable of globally distributed.
There will be billions and billions of unstructured data objects created but each object, be it a Twitter tweet, or a uploaded mobile video, or even the clandestine data collected by CarrierIQ, can be accessed easily via a single namespace
The storage file system foundation for these new unstructured data type is easily provisioned and management. Look at Facebook. It is easy to setup, get going and the user (and probably the data administrator) can easily manage the user interface and the platform
For the service provider of Cloud Storage, the file system must be secure and support multi-tenancy and virtualization of storage resources
There should be some form of policy-driven content management. That is why development platforms such as Joomla!, Drupal, WordPress are slowing become enterprise driven to address these unstructured data types.
Highly searchable and have a high degree of search optimization. A Google search do have a strong degree of intelligence and relevance to the data being search as well as generating tons of by-product data that feeds the need to understand the consumers or the users better. Hail Big Data!

So when I compare traditional NAS storage solutions such as Netapp or EMC VNX or BlueArc, I ask the question of whether their NAS solutions has these capabilities to meet the requirements of these new unstructured data type.

Most of them, no matter how they package it, is still relying on files as the granular object of storage. And today, most files may have some form of metadata such as file name, owner, size etc, DO NOT, possess the capability of content-aware. Here’s an example when I want to show you:

The file properties (part of the file metadata) tell you about the file but little about the content of the file. Today, it requires more than that and the new unstructured data type should look more like this:

If you look at the diagram below, the object on the right (which is the new unstructured data type), display much more information than a typical file in a NAS file system. There additional information becomes the fodder to other applications such as search engines, RSS feeds, robots and spiders and of course, big data analytics.

Here’s another example of what I mean about these extended metadata, and a Cloud Storage storage array is required to work with these new set of parameters and a new set of requirement.

There’s a new unstructured data type in town. Traditional NAS systems may not have the right features to work with this new paradigm.

Don’t be white washed by the fancy talk of storage vendors in town. Learn the facts, and find out what is really a Cloud Storage.

It’s time to think differently. It’s time to think of what should be a Cloud Storage.

Monthly Archives: December 2011

Apple chomps Anobit

Captain Dynamo Storage System