The Network is Still the Computer

[Preamble: I have been invited by  GestaltIT as a delegate to their TechFieldDay from Oct 17-19, 2018 in the Silicon Valley USA. My expenses, travel and accommodation are covered by GestaltIT, the organizer and I was not obligated to blog or promote their technologies presented at this event. The content of this blog is of my own opinions and views]

Sun Microsystems coined the phrase “The Network is the Computer“. It became one of the most powerful ideologies in the computing world, but over the years, many technology companies have tried to emulate and practise the mantra, but fell short.

I have never heard of Drivescale. It wasn’t in my radar until the legendary NFS guru, Brian Pawlowski joined them in April this year. Beepy, as he is known, was CTO of NetApp and later at Pure Storage, and held many technology leadership roles, including leading the development of NFSv3 and v4.

Prior to Tech Field Day 17, I was given some “homework”. Stephen Foskett, Chief Cat Herder (as he is known) of Tech Field Days and Storage Field Days, highly recommended Drivescale and asked the delegates to pick up some notes on their technology. Going through a couple of the videos, Drivescale’s message and philosophy resonated well with me. Perhaps it was their Sun Microsystems DNA? Many of the Drivescale team members were from Sun, and I was previously from Sun as well. I was drinking Sun’s Kool Aid by the bucket loads even before I graduated in 1991, and so what Drivescale preached made a lot of sense to me.Drivescale is all about Scale-Out Architecture at the webscale level, to address the massive scale of data processing. To understand deeper, we must think about “Data Locality” and “Data Mobility“. I frequently use these 2 “points of discussion” in my consulting practice in architecting and designing data center infrastructure. The gist of data locality is simple – the closer the data is to the processing, the cheaper/lightweight/efficient it gets. Moving data – the data mobility part – is expensive.

Continue reading

Own the Data Pipeline

[Preamble: I was a delegate of Storage Field Day 15 from Mar 7-9, 2018. My expenses, travel and accommodation were paid for by GestaltIT, the organizer and I was not obligated to blog or promote the technologies presented at this event. The content of this blog is of my own opinions and views]

I am a big proponent of Go-to-Market (GTM) solutions. Technology does not stand alone. It must be in an ecosystem, and in each industry, in each segment of each respective industry, every ecosystem is unique. And when we amalgamate data, the storage infrastructure technologies and the data management into the ecosystem, we reap the benefits in that ecosystem.

Data moves in the ecosystem, from system to system, north to south, east to west and vice versa, random, sequential, ad-hoc. Data acquires different statuses, different roles, different relevances in its lifecycle through the ecosystem. From it, we derive the flow, a workflow of data creating a data pipeline. The Data Pipeline concept has been around since the inception of data.

To illustrate my point, I created one for the Oil & Gas – Exploration & Production (EP) upstream some years ago.

 

Continue reading

Considerations of Hadoop in the Enterprise

I am guilty. I have not been tendering this blog for quite a while now, but it feels good to be back. What have I been doing? Since leaving NetApp 2 months or so ago, I have been active in the scenes again. This time I am more aligned towards data analytics and its burgeoning impact on the storage networking segment.

I was intrigued by an article posted by a friend of mine in Facebook. The article (circa 2013) was titled “Never, ever do this to Hadoop”. It described the author’s gripe with the SAN bigots. I have encountered storage professionals who throw in the SAN solution every time, because that was all they know. NAS, to them, was like that old relative smelled of camphor oil and they avoid NAS like a plague. Similar DAS was frowned upon but how things have changed. The pendulum has swung back to DAS and new market segments such as VSANs and Hyper Converged platforms have been dominating the scene in the past 2 years. I highlighted this in my blog, “Praying to the Hypervisor God” almost 2 years ago.

I agree with the author, Andrew C. Oliver. The “locality” of resources is central to Hadoop’s performance.

Consider these 2 models:

moving-compute-storage

In the model on your left (Moving Data to Compute), the delivery process from Storage to Compute is HEAVY. That is because data has dependencies; data has gravity. However, if you consider the model on your right (Moving Compute to Data), delivering data processing to the storage layer is much lighter. Compute or data processing is transient, and the data in the compute layer is volatile. Once compute’s power is turned off, everything starts again from a clean slate, hence the volatile stage.

Continue reading

A little yellow elephant

By now, I believe most of you in the storage networking world would have heard of Hadoop. Hadoop was created by Doug Cutting, while he and his team was working on an open source web search engine called Nutch. The easily recognized little yellow elephant, Hadoop, was Doug Cutting’s son toy, which he made as Hadoop’s mascot. Pretty cool!

And today, Hadoop has become THE platform for Big Data applications. Why?

As I have mentioned before, everything that we do or don’t do, generates data, either as a direct product or in-direct product. I am blogging right now and I am creating data. I was in Singapore the whole of this week and everywhere I go in the MRT stations, I am being watched by the video cameras they have at the station. A new friend in class said that Singapore is the second most “watched” city after London, where there are video cameras mounted everywhere, either discreetly or indiscreetly. And that’s just video data. And there’s plenty of other human activities that generate tons and tons of data.

IDC Digital Universe Report for 2011 said that we have generated 1.8ZB (zettabyte) of data this year alone. I mentioned in my previous blog that this is a gold mine and companies are scrambling to tap on massive amount of data.  Extracting valuable information to anticipate the next trend or predict that next evolution in human preference is akin to the Gold Rush in the wild, wild west in the late 19th century. Folks, Big Data is going to be this generation’s “Digital Gold Rush”.

Sieving, filtering and processing gazillions of data (more unstructured than structured) will not work in defined, well-formatted relational databases. The data model of relational databases will simply break down. And of course, there are different schools of thoughts of different data models, but the Hadoop model seems to be gaining momentum and mind share of data scientists. That is because of Hadoop’s capability to deal with massive unstructured data, processing it and producing results in a small amount of time.

One way to process the pool of massive data is parallel programming. In parallel programming, multi-threading is commonly deployed to achieve the performance and effects of programming. But implementing multi-threading in parallel programming is difficult. Developers often has to deal with LWP (lightweight processes), semaphores, shared memory, mutex (mutually exclusive) locking and so on. Hence this style of programming works with different states on shared data, often resulting in different results in different states, even when using the same programming expression.

Hadoop belongs to another school of programming known as functional programming, where the different states on shared data concept is removed. With that in mind, the dependency on different states is also removed, resulting in a much easier and simpler parallel programming implementation. Hadoop borrows ideas from the MapReduce software framework made well known by Google and the Google File System.

Before, we get to know Hadoop, we must know MapReduce. MapReduce is a framework which allows very large data sets to be processed with a very large set of computer nodes in a cluster. Typically the computational processing is executed in a distributed fashion, spread across many computer nodes and final results are consolidated from the sub-results of these distributed processing nodes.

According to Wikipedia, the 2 key functions of Map Reduce are map() and reduce(). That’s pretty obvious. The extract below was taken from the Wikipedia definition, and explains both functions very well.

“Map” step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

“Reduce” step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

The diagram below probably can simplify the concept of MapReduce to the readers.

 

Hadoop is one of the open-source implementations of MapReduce. It is one of the projects of Apache Foundation, and the project has sparked a brand-new niche of data search, data management and data science. The diagram below will allow our readers to juxtapose MapReduce and Hadoop, and comparing them in the simplest fashion.

Hadoop primary development platform is Java. Hadoop’s architecture consists mainly of 2 components – Hadoop Common and a Hadoop-compatible file system, as shown in the diagram below.

Hadoop MapReduce layer above is the file/object access interface to the Hadoop-compatible file system below. HDFS is Hadoop Distributed File System is just one of a few Hadoop-compatible file systems. Other file systems include:

  • Amazon S3 File System as part of the Amazon EC2 Infrastructure-as-a-Service (IaaS) cloud platform
  • CloudStore – a similar Hadoop-like implementation using C++ and also inspired by Google File System
  • FTP file systems
  • HTTP and HTTPS read-only file systems
  • Any file systems accessible with the file:// URL nomenclature

But the main engine of Hadoop is in the MapReduce layer. The 2 core components in this layer is JobTracker and TaskTracker. Both has their own individual roles to play and collectively, they are key cogs in the Hadoop distributed data processing model.

Below are extract I picked up from Wikipedia.

JobTracker submits MapReduce jobs to client applications. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware filesystem, the JobTracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone network. If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on each node spawns off a separate Java Virtual Machine process to prevent the TaskTracker itself from failing if the running job crashes the JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty and can be viewed from a web browser. Jetty is a Java-based HTTP server, among other things

JobTracker records what it is up to in the filesystem. When a JobTracker starts up, it looks for any such data, so that it can restart work from where it left off.

Scheduling

By default Hadoop uses first-in, first-out (FIFO), and optional 5 scheduling priorities to schedule jobs from a work queue. In version 0.19 the job scheduler was refactored out of the JobTracker, and added the ability to use an alternate scheduler (such as the Fair scheduler or the Capacity scheduler).

Fair scheduler

The fair scheduler was developed by Facebook. The goal of the fair scheduler is to provide fast response times for small jobs and QoS (Quality of Service) for production jobs. The fair scheduler has three basic concepts.

  1. Jobs are grouped into Pools.
  2. Each pool is assigned a guaranteed minimum share.
  3. Excess capacity is split between jobs.

By default jobs that are uncategorized go into a default pool. Pools have to specify the minimum number of map slots, reduce slots, and a limit on the number of running jobs.

Capacity scheduler

The capacity scheduler was developed by Yahoo. The capacity scheduler supports several features which are similar to the fair scheduler.

  • Jobs are submitted into queues.
  • Queues are allocated a fraction of the total resource capacity.
  • Free resources are allocated to queues beyond their total capacity.
  • Within a queue a job with a high level of priority will have access to the queue’s resources.

I took most the extract below from Wikipedia, and I don’t claim to be a knowledgeable person on Hadoop. All the credits go to Wikipedia editors to put Hadoop in layman terms.

Hadoop has certainly won the hearts of the new digital gold rush, Big Data and is slowly becoming a force to be reckoned with among data scientists. Hadoop implementations are powering new frontiers in processing and mining the ever growing data capacity, giving solution providers a simple programming methodology and data model to gain more insights into the vast seas of data and information.

Hadoop has many fans, and slowly becoming the data platform for large companies such as Yahoo!, Facebook, IBM, Amazon, Apple, eBay and many more. Facebook even claims to have the largest Hadoop clusters in the world, growing to 30PB in July of 2011.

This little yellow elephant is going places and one to watch out for.