big data – Storage Gaga

Rethinking data processing frameworks systems in real time

By cfheoh | December 27, 2021 - 8:00 am |December 26, 2021 Algorithm, Amazon Web Services, Analytics, API, Artificial Intelligence, Confluent, Containers, Data, Data Management, Data Privacy, Data Protection, Data Security, Digital Transformation, Google, Hadoop, Hadoop Clusters, InfluxDB, Machine Learning, MapReduce, Microsoft Azure, Pravega, Scale-out architecture

We got to keep more data

By cfheoh | April 4, 2019 - 4:47 pm |April 4, 2019 Analytics, Artificial Intelligence, Big Data, Data, eDiscovery, Machine Learning

1 Comment

Guess which airport has won the most awards in the annual Skytrax list? Guess which airport won 480 awards since its opening in 1981? Guess how this airport did it?

Data Analytics gives the competive edge.

Serving and servicing more than 65 million passengers and travellers in 2018, and growing, Changi Airport Singapore sets a very high level customer service. And it does it with the help of technology, something they call Smart (Service Management through Analytics and Resource Transformation) Airport. In an ultra competitive and cut-throat airline business, the deep integration of customer-centric services and the ultimate traveller’s experience are crucial to the survival and growth of airlines. And it has definitely helped Singapore Airlines to be the world’s best airlines in 2018, its 4th win.

To achieve that, Changi Airport relies on technology and lots of relevant data for deep insights on how to serve its customers better. The details are well described in this old news article.

Keep More Relevant Data for Greater Insights

When I mean more data, I do not mean every single piece of data. Data has to be relevant to be useful.

How do we get more insights? How can we teach systems to learn? How to we develop artificial intelligence systems? By having more relevant data feeding into data analytics systems, machine learning and such.

As such, a simple framework for building from the data ingestion, to data repositories to outcomes such as artificial intelligence, predictive and recommendations systems, automation and new data insights isn’t difficult to understand. The diagram below is a high level overview of what I work with most of the time. Continue reading →

The full force of Western Digital

By cfheoh | March 21, 2019 - 11:39 am |March 21, 2019 Acquisition, Analytics, API, Appliance, Artificial Intelligence, Backup, Big Data, Business Continuity, Cloud, Clusters, Composable Infrastructure, Data, Data Archiving, Data Availability, Data Management, Data Protection, Deep Learning, Disaster Recovery, Disks, Drivescale, Edge Computing, Flash, Fog Computing, Hyperconvergence, IoT, Kaminario, Machine Learning, NAS, Object Storage, Reliability, SCSI, Seagate, Solid State Devices, Storage Field Day, Storage Tiering, Tech Field Day, Tegile, Unified Storage, Western Digital

2 Comments

[Preamble: I have been invited by GestaltIT as a delegate to their Tech Field Day for Storage Field Day 18 from Feb 27-Mar 1, 2019 in the Silicon Valley USA. My expenses, travel and accommodation were covered by GestaltIT, the organizer and I was not obligated to blog or promote their technologies presented at this event. The content of this blog is of my own opinions and views]

3 weeks after Storage Field Day 18, I was still trying to wrap my head around the 3-hour session we had with Western Digital. I was like a kid in a candy store for a while, because there were too much to chew and I couldn’t munch them all.

From “Silicon to System”

Not many storage companies in the world can claim that mantra – “From Silicon to Systems“. Western Digital is probably one of 3 companies (the other 2 being Intel and nVidia) I know of at present, which develops vertical innovation and integration, end to end, from components, to platforms and to systems.

For a long time, we have always known Western Digital to be a hard disk company. It owns HGST, SanDisk, providing the drives, the Flash and the Compact Flash for both the consumer and the enterprise markets. However, in recent years, through 2 eyebrow raising acquisitions, Western Digital was moving itself up the infrastructure stack. In 2015, it acquired Amplidata. 2 years later, it acquired Tegile Systems. At that time, I was wondering why a hard disk manufacturer was buying storage technology companies that were not its usual bread and butter business.

Continue reading →

Greenplum looking mighty sweet

By cfheoh | December 16, 2011 - 9:06 am |October 27, 2012 Big Data, EMC

1 Comment

Big data is Big Business these days. IDC predicts that between 2012 and 2020, the spending on big data solution will account for 80% of IT spending and growing at 18% per annum. EMC predicts that the big data is worth USD$70 billion! That’s a very huge market.

We generate data, and plenty of it. In the IDC Digital Universe Report for 2011 (sponsored by EMC), approximately 1.8 zettabytes of data will be created and replicated in 2011. How much is 1 zettabyte, you say? Look at the conversion below:

                    1 zettabyte = 1 billion terabytes

That’s right, folks. 1 billion terabytes!

And this “mountain” of data and information is a Goldmine of goldmines, and companies around the world are scrambling to tap on this treasure chest. According to Wikibon, big data has the following characteristics:

Very large, distributed aggregations of loosely structured data – often incomplete and inaccessible
Petabytes/exabytes of data
Millions/billions of people
Billions/trillions of records
Loosely-structured and often distributed data
Flat schemas with few complex interrelationships
Often involving time-stamped events
Often made up of incomplete data
Often including connections between data elements that must be probabilistically inferred

But what is relevant is not the definition of big data, but rather what you get from the mountain of information generated. The ability to “mine” the information from big data, now popularly known as Big Data Analytics, has sparked a new field within the data storage and data management industry. This is called Data Science. And companies and enterprises that are able to effectively use the new data from Big Data will win big in the next decade. Activities such as

Business decision making
Gain competitive advantage
Drive productivity growth in relevant industry segments
Understanding consumer and business behavioural patterns
Knowing buying decisions and business cycles
Yielding new innovation
Reveal customer insights
much, much more

will drive a whole new paradigm that shall be known as Data Science.

And EMC, having purchased Greenplum more than a year ago, has started their Data Computing Products Division immediately after the Greenplum acquisition. And in October of 2010, EMC announced their Greenplum Data Computing Appliance with some impressive numbers. Using 2 configurations of their appliance, noted below:

Below are 2 tables of the Greenplum performance benchmarks:

That’s what these big data appliance is able. The ability to load billions of either structured or unstructured files or objects in mere minutes is what drives the massive adoption of Big Data.

And a few days, EMC announced their Greenplum Unified Analytics Platform (UAP) which comprises of 3 Greenplum components:

A relational database for structured data
An enterprise Hadoop engine for the analysis and processing of unstructured data
Chorus 2.0, which is a social media collaboration tool for data scientists

The diagram below summarizes the UAP solution:

Greenplum is certainly ahead of the curve. Competitors like IBM Netezza, Teradata and Oracle Exalogic are racing to be ahead but Greenplum is one of the early adopters of a single platform for big data. Having a consolidation platform will not only reduce costs (integration of all big data components usually incurs high professional services’ fees) but will also reduce the barrier to entry to big data, thus further accelerating the adoption of big data.

Big Data is still very much at its infancy and EMC is pushing to establish its footprint in this space. EMC Education has already announce the general availability of courses related to big data last week and also the EMC Data Science Architect (EMC DSA) certification. Greenplum is enjoying the early sweetness of the Big Data game and there will be more to come. I am certainly looking forward to share more on this plum (pun intended ;-)) of the data storage and data management excitement.

What should be a Cloud Storage?

By cfheoh | December 8, 2011 - 1:43 pm |October 27, 2012 Analytics, Big Data, Filesystems, Object Storage

2 Comments

For us filesystem guys, NAS is the way to go. We are used to store files into network file systems via NFS and CIFS protocols and treating the NAS storage array like a refrigerator – taking stuff out and putting stuff back it. All that is fine and well as long as the data is what I would term as corporate data.

Corporate data is generated by employees, applications and users of the company and for a long time, the power of data creation lies in the hands of the enterprise. That is why storage solutions are designed to address the needs of the enterprise where the data is structured and well defined. How the data is stored; the data is formatted; and how is being accessed are the “boundary” of how the data is being used. Typically a database is used to “restrict” the data created so that the information can be retrieved and modified quickly. And of course, SAN guys will tell you to put these structured data of the data base into their SAN.

For the unstructured data in the enterprise, NAS file systems hold that responsibility. Files such as documents and presentations have a more loosely defined “boundaries”, and hence filesystems are a better natural fit for unstructured data. Filesystems are like a free-for-all container, and able to store and provide access to any files in the enterprise.

But today, as the Web 2.0 applications are already taking over the enterprise, the power of data creation does not necessary lie in the hands of the enterprise applications and users. In fact, it is estimated that the percentage of enterprise data now has exceeded 50% of the enterprise’s total data capacity. With the proliferation of personal devices such as tablets, Blackberries, smart phones, PDAs and so on, individual contributors are generating plenty of data. This situation has been made more acute with Web 2.0 applications, such as Facebook, blogs, social networking, Twitter, Google Search and so on.

Unfortunately, file systems in the NAS category still pretty much the traditional file systems, while the needs of a new type of file system could not be met by the traditional file systems. The paradigm is definitely shifting. The new unstructured data world needs a new storage concept. I would term this type of storage as “Cloud Storage” because it breaks down the traditional concepts of NAS.

So what basically defined a Cloud Storage? I already mentioned that the type of unstructured data has changed. And the new requirements for unstructured data type are:

The unstructured data type is capable of globally distributed.
There will be billions and billions of unstructured data objects created but each object, be it a Twitter tweet, or a uploaded mobile video, or even the clandestine data collected by CarrierIQ, can be accessed easily via a single namespace
The storage file system foundation for these new unstructured data type is easily provisioned and management. Look at Facebook. It is easy to setup, get going and the user (and probably the data administrator) can easily manage the user interface and the platform
For the service provider of Cloud Storage, the file system must be secure and support multi-tenancy and virtualization of storage resources
There should be some form of policy-driven content management. That is why development platforms such as Joomla!, Drupal, WordPress are slowing become enterprise driven to address these unstructured data types.
Highly searchable and have a high degree of search optimization. A Google search do have a strong degree of intelligence and relevance to the data being search as well as generating tons of by-product data that feeds the need to understand the consumers or the users better. Hail Big Data!

So when I compare traditional NAS storage solutions such as Netapp or EMC VNX or BlueArc, I ask the question of whether their NAS solutions has these capabilities to meet the requirements of these new unstructured data type.

Most of them, no matter how they package it, is still relying on files as the granular object of storage. And today, most files may have some form of metadata such as file name, owner, size etc, DO NOT, possess the capability of content-aware. Here’s an example when I want to show you:

The file properties (part of the file metadata) tell you about the file but little about the content of the file. Today, it requires more than that and the new unstructured data type should look more like this:

If you look at the diagram below, the object on the right (which is the new unstructured data type), display much more information than a typical file in a NAS file system. There additional information becomes the fodder to other applications such as search engines, RSS feeds, robots and spiders and of course, big data analytics.

Here’s another example of what I mean about these extended metadata, and a Cloud Storage storage array is required to work with these new set of parameters and a new set of requirement.

There’s a new unstructured data type in town. Traditional NAS systems may not have the right features to work with this new paradigm.

Don’t be white washed by the fancy talk of storage vendors in town. Learn the facts, and find out what is really a Cloud Storage.

It’s time to think differently. It’s time to think of what should be a Cloud Storage.

Big data is big headache

By cfheoh | October 28, 2011 - 8:13 pm |October 23, 2012 Analytics, Big Data

1 Comment

IBM claims that we are responsible of for creating 2.5 quintillion bytes of data every day. How much is 1 quintilion?

According to the web,

1 quintillion = 1,000,000,000,000,000,000

After billion, it is trillion, then quadrillion, and then quintillion. That’s what 1 quintillion is, with 18 zeroes!

These data comes from everything from social networking updates, meteorology (weather reports), remote sensing maps (Google Maps, GPS, Geographical Information Systems), photos (Flickr), videos (YouTube), Internet search (Google) and so on. The big data terminology, according to Wikipedia, is data that are too large to be handled and processed by conventional data management tools. This presents a new set of difficulties when it comes to collected these data, storing them and sharing them. Indexing and searching big data would require special technologies to be able to mine and extract valuable information from big data datasets, within an acceptable period of time.

According to Wiki, “Technologies being applied to big data include massively parallel processing (MPP) databases, datamining grids, distributed file systems, distributed databases, cloud computing platforms, the Internet, and scalable storage systems.” That is why EMC has paid big money to acquire GreenPlum and IBM acquired Netezza. Traditional data warehousing players such Teradata, Oracle and Ingres are in the picture as well, setting a collision course between the storage and infrastructure companies and the data warehousing solutions companies.

The 2010 Gartner Magic Quadrant has seen non-traditional players such as IBM/Netezza and EMC/Greenplum, in its leaders quadrant.

And the key word that is already on everyone’s lips is “ANALYTICS“.

The ability to extract valuable information that helps determines what the next future trend is and personalized profiling will be something that may already arrived as companies are clamouring to get more and more out of our personalities so that they can sell you more of their wares.

Meteorological organizations are using big data analytics to find out about weather patterns and climate change. Space exploration becomes more acute and precise from the tons and tons of data collected from space explorations. Big data analytics are also helping pharmaceutical companies develop new biological and pharmaceutical breakthroughs. And the list goes on.

I am a new stranger into big data and I do not proclaim to know a lot. But terms such as scale-out NAS, distributed file systems, grid computing, massively parallel processing are certainly bringing the data storage world into a new frontier, and it is something we as storage professionals have to adapt to. I am eager to learn and know more about big data. It is a big headache but change is inevitable.

Tag Archives: big data

Rethinking data processing frameworks systems in real time

We got to keep more data

Data Analytics gives the competive edge.

Keep More Relevant Data for Greater Insights

The full force of Western Digital

From “Silicon to System”

Greenplum looking mighty sweet

Big data is big headache

Recent Posts

Sponsored Ads

Google Adsense

Recent Comments

Google Adsense

Share this:

Data Analytics gives the competive edge.

Keep More Relevant Data for Greater Insights

Share this:

From “Silicon to System”

Share this:

Share this:

Share this:

Share this:

Recent Posts

Sponsored Ads

Google Adsense

Recent Comments

Google Adsense