Hadoop Archives - Storage Gaga

As Disk Drive capacity gets larger (and larger), the resilient Filesystem matters

By cfheoh | May 30, 2022 - 8:00 am |May 30, 2022 Appliance, Backup, Ceph, CIFS, Clusters, Data Management, Data Protection, Disks, Filesystems, Flash, FreeNAS, Gluster, Hadoop, High Performance Computing, iXsystems, Joyent, Linux, Lustre, NetApp, Nexenta, NFS, OpenZFS, Oracle, RAID, Reliability, SMB, Snapshots, TrueNAS, Virtualization

2 Comments

I just got home from the wonderful iXsystems™ Sales Summit in Knoxville, Tennessee. The key highlight was to christian the opening of iXsystems™ Maryville facility, the key operations center that will house iX engineering, support and part of marketing as well. News of this can be found here.

iX datacenter in the new Maryville facility

Western Digital® has always been a big advocate of iX, and at the Summit, they shared their hard disk drives HDD, solid state drives SSD, and other storage platforms roadmaps. I felt like a kid a candy store because I love all these excitements in the disk drive industry. Who says HDDs are going to be usurped by SSDs?

Several other disk drive manufacturers, including Western Digital®, have announced larger capacity drives. Here are some news of each vendor in recent months

Other than the AFR (annualized failure rates) numbers published by Backblaze every quarter, the Capacity factor has always been a measurement of high interest in the storage industry.

Continue reading →

Celebrating MinIO

By cfheoh | January 31, 2022 - 8:00 am |January 31, 2022 Algorithm, Amazon Web Services, Analytics, API, Artificial Intelligence, Backup, Big Data, Blitzscaling, Cloud, Cloudian, Clusters, Containers, Data Archiving, Deep Learning, Gartner, Gluster, Hadoop, Hadoop Clusters, HDS, High Performance Computing, Hitachi Vantara, IBM, InfluxDB, Kubernetes, Machine Learning, Minio, NetApp, Object Storage, OpenIO, Openstack, Scale-out architecture, Software Defined Storage

2 Comments

“Essentially MinIO is a web server …“

I vaguely recalled Anand Babu Periasamy (AB as he is known), the CEO of MinIO saying that when I first met him in 2017. I was fresh “playing around” with MinIO and instantly I fell in love with software technology. Wait a minute. Object storage wasn’t supposed to be so easy. It was not supposed to be that simple to set up and use, but MinIO burst into my storage universe like the birth of the Infinity Stones. There was a eureka moment. And I was attending one of the Storage Field Days in the US shortly after my MinIO discovery in late 2017. What an opportunity!

I could not recall how I made the appointment to meeting MinIO, but I recalled myself taking an Uber to their cosy office on University Avenue in Palo Alto to meet. Through Andy Watson (one of the CTOs then), I was introduced to AB, Garima Kapoor, MinIO’s COO and his wife, Frank Wessels, Zamin (one of the business people who is no longer there) and Ugur Tigli (East Coast CTO) who was on the Polycom. I was awe struck.

Last week, MinIO scored a major Series B round funding of USD103 million. It was delayed by the pandemic because I recalled Garima telling me that the funding was happening in 2020. But I think the delay made it better, because the world now is even more ready for MinIO than ever before.

Continue reading →

Rethinking data processing frameworks systems in real time

By cfheoh | December 27, 2021 - 8:00 am |December 26, 2021 Algorithm, Amazon Web Services, Analytics, API, Artificial Intelligence, Confluent, Containers, Data, Data Management, Data Privacy, Data Protection, Data Security, Digital Transformation, Google, Hadoop, Hadoop Clusters, InfluxDB, Machine Learning, MapReduce, Microsoft Azure, Pravega, Scale-out architecture

Leave a comment

“Row, row, row your boat, gently down the stream…”

Except the stream isn’t gentle at all in the data processing’s new context.

For many of us in the storage infrastructure and data management world, the well known framework is storing and retrieve data from a storage media. That media could be a disk-based storage array, a tape, or some cloud storage where the storage media is abstracted from the users and the applications. The model of post processing the data after the data has safely and persistently stored on that media is a well understood and a mature one. Users, applications and workloads (A&W) process this data in its resting phase, retrieve it, work on it, and write it back to the resting phase again.

There is another model of data processing that has been bubbling over the years and now reaching a boiling point. Still it has not reached its apex yet. This is processing the data in flight, while it is still flowing as it passes through processing engine. The nature of this kind of data is described in one 2018 conference I chanced upon a year ago.

letgo marketplace processing numbers in 2018

* NRT = near real time

From a storage technology infrastructure perspective, this kind of data processing piqued my curiosity immensely. And I have been studying this burgeoning new data processing model in my spare time, and where it fits, bringing the understanding back into the storage infrastructure and data management side.

Continue reading →

Storage Elephant Compute Birds

By cfheoh | November 22, 2021 - 8:00 am |November 21, 2021 Actifio, Amazon Web Services, Analytics, API, Artificial Intelligence, Backup, Big Data, Business Continuity, Cloud, Composable Infrastructure, Data, Data Archiving, Data Availability, Data Management, Data Privacy, Data Protection, Data Security, Deduplication, DellEMC, Delphix, Digital Transformation, Disaster Recovery, Edge Computing, Hadoop, Hyperconvergence, IoT, iXsystems, Kubernetes, Machine Learning, NetApp, Nimble Storage, Nutanix, Object Storage, Simplivity, Snapshots, Storage Optimization, Storage Tiering, TrueNAS, Virtualization, VMware

Leave a comment

Data movement is expensive. Not just costs, but also latency and resources as well. Thus there were many narratives to move compute closer to where the data is stored because moving compute is definitely more economical than moving data. I borrowed the analogy of the 2 animals from some old NetApp® slides which depicted storage as the elephant, and compute as birds. It was the perfect analogy, because the storage is heavy and compute is light.

“Close up of a white Great Egret perching on top of an African Elephant aa Amboseli national park, Kenya”

Before the animals representation came about I used to use the term “Data locality, Data Mobility“, because of past work on storage technology in the Oil & Gas subsurface data management pipeline.

Take stock of your data movement

I had recent conversations with an end user who has been paying a lot of dollars keeping their “backup” and “archive” in AWS Glacier. The S3 storage is cheap enough to hold several petabytes of data for years, because the IT folks said that the data in AWS Glacier are for “backup” and “archive”. I put both words in quotes because they were termed as “backup” and “archive” because of their enterprise practice. However, the face of their business is changing. They are in manufacturing, oil and gas downstream, and the definitions of “backup” and “archive” data has changed.

For one, there is a strong demand for reusing the past data for various reasons and these datasets have to be recalled from their cloud storage. Secondly, their data movement activities still mimicked what they did in the past during their enterprise storage days. It was a classic lift-and-shift when they moved to the cloud, and not taking stock of their data movements and the operations they ran on these datasets. Still ongoing, their monthly AWS cost a bomb.

Continue reading →

Open Source Storage Technology Crafters

By cfheoh | October 4, 2021 - 8:00 am |October 4, 2021 Appliance, Ceph, Cloud, Data Fabric, Data Management, Delphix, Filesystems, FreeNAS, Hadoop, Hadoop Clusters, iXsystems, Linux, Lustre, Minio, NAS, Object Storage, Openstack, OpenZFS, Oracle, Redhat, SMB, Snapshots, SNIA, SoftIron, Software Defined Storage, TrueNAS, Unified Storage, Virtualbox, Virtualization

Leave a comment

The conversation often starts with a challenge. “What’s so great about open source storage technology?”

For the casual end users of storage systems, regardless of SAN (definitely not Fibre Channel) or NAS on-premises, or getting “files” from the personal cloud storage like Dropbox, OneDrive et al., there is a strong presumption that open source storage technology is cheap and flaky. This is not helped with the diet of consumer brands of NAS in the market, where the price is cheap, but the storage offering with capabilities, reliability and performance are found to be wanting. Thus this notion floats its way to the business and enterprise users, and often ended up with a negative perception of open source storage technology.

Highway Signpost with Open Source wording

Storage Assemblers

Anybody can “build” a storage system with open source storage software. Put the software together with any commodity x86 server, and it can function with the basic storage services. Most open source storage software can do the job pretty well. However, once the completed storage technology is put together, can it do the job well enough to serve a business critical end user? I have plenty of sob stories from end users I have spoken to in these many years in the industry related to so-called “enterprise” storage vendors. I wrote a few blogs in the past that related to these sad situations:

We have such storage offerings rigged with cybersecurity risks and holes too. In a recent Unit 42 report, 250,000 NAS devices are vulnerable and exposed to the public Internet. The brands in question are mentioned in the report.

I would categorize these as storage assemblers.

Continue reading →

What the heck is Storage Modernization?

By cfheoh | September 20, 2021 - 7:00 am |September 19, 2021 Acquisition, Analytics, API, Artificial Intelligence, Big Data, Business Continuity, Cloud, Containers, Data, Data Archiving, Data Availability, Data Fabric, Data Management, Data Privacy, Data Protection, Data Security, Deep Learning, Digital Transformation, Disaster Recovery, Edge Computing, Green Computing, Hadoop, Hadoop Clusters, Kubernetes, Machine Learning, MapReduce, Reliability, Software-defined Datacenter, Solid State Devices, Storage Optimization, Tape storage

Leave a comment

We often hear the word “modernization” thrown around these days. The push is to get the end user to refresh their infrastructure, and the storage infrastructure market is rife with modernization word. Is your storage ripe for “modernization“?

Many possibilities to modernize storage

To modernize, it has to be relative to legacy storage hardware, and the operating environment that came with it. But if the so-called “legacy” still does the job, should you modernize?

Big Data is right

When the word “Big Data” came into prominence a while back, it stirred the IT industry into a frenzy. At one point, Apache Hadoop became the poster elephant (pun intended) for this exciting new segment. So many Vs came out, but I settled with 4 Vs as the framework of my IT conversations. The 4Vs we often hear are:

Volume
Velocity
Variety
Veracity

Continue reading →

Give back or no give

By cfheoh | September 7, 2020 - 9:15 am |September 5, 2020 Amazon Web Services, Ceph, Cloud, Datto, Filesystems, FreeNAS, Gluster, Hadoop, iXsystems, Joyent, Linux, Lustre, Microsoft, Minio, NAS, Nexenta, Openstack, QNAP, Redhat, SuSE, Synology, Tegile

Leave a comment

[ Disclosure: I work for iXsystems™ Inc. Views and opinions are my own. ]

If my memory served me right, I recalled the illustrious leader of the Illumos project, Garrett D’Amore ranting about companies, big and small, taking OpenZFS open source codes and projects to incorporate into their own technology but hardly ever giving back to the open source community. That was almost 6 years ago.

My thoughts immediately go back to the days when open source was starting to take off back in the early 2000s. Oracle 9i database had just embraced Linux in a big way, and the book by Eric S. Raymond, “The Cathedral and The Bazaar” was a big hit.

The Cathedral & The Bazaar by Eric S. Raymond

Since then, the blooming days of proprietary software world began to wilt, and over the next twenty plus year, open source software has pretty much taken over the world. Even Microsoft®, the ruthless ruler of the Evil Empire caved in to some of the open source calls. The Microsoft® “I Love Linux” embrace definitely gave the victory feeling of the Rebellion win over the Empire. Open Source won.

Open Source bag of worms

Even with the concerted efforts of the open source communities and projects, there were many situations which have caused frictions and inadvertently, major issues as well. There are several open source projects licenses, and they are not always compatible when different open source projects mesh together for the greater good.

On the storage side of things, 2 “incidents” caught the attention of the masses. For instance, Linus Torvalds, Linux BDFL (Benevolent Dictator for Life) and emperor supremo said “Don’t use ZFS” partly due to the ignorance and incompatibility of Linux GPL (General Public License) and ZFS CDDL (Common Development and Distribution License). That ruffled some feathers amongst the OpenZFS community that Matt Ahrens, the co-creator of the ZFS file system and OpenZFS community leader had to defend OpenZFS from Linus’ comments.

Continue reading →

Hadoop is truly dead – LOTR version

By cfheoh | January 24, 2020 - 1:06 pm |January 24, 2020 Acquisition, Analytics, API, Artificial Intelligence, Big Data, Cloud, Cloudera, Containers, Data Management, Data Security, Deep Learning, Digital Transformation, Hadoop, Hadoop Clusters, Kubernetes, MapReduce, NAS, NetApp, Object Storage, Pure Storage, Storage Field Day, Tech Field Day

2 Comments

[Disclosure: I was invited by GestaltIT as a delegate to their Storage Field Day 19 event from Jan 22-24, 2020 in the Silicon Valley USA. My expenses, travel, accommodation and conference fees were covered by GestaltIT, the organizer and I was not obligated to blog or promote the vendors’ technologies to be presented at this event. The content of this blog is of my own opinions and views]

This blog was not intended because it was not in my plans to write it. But a string of events happened in the Storage Field Day 19 week and I have the fodder to share my thoughts. Hadoop is indeed dead.

Warning: There are Lord of the Rings references in this blog. You might want to do some research. 😉

Storage metrics never happened

The fellowship of Arjan Timmerman, Keiran Shelden, Brian Gold (Pure Storage) and myself started at the office of Pure Storage in downtown Mountain View, much like Frodo Baggins, Samwise Gamgee, Peregrine Took and Meriadoc Brandybuck forging their journey vows at Rivendell. The podcast was supposed to be on the topic of storage metrics but was unanimously swung to talk about Hadoop under the stewardship of Mr. Stephen Foskett, our host of Tech Field Day. I saw Stephen as Elrond Half-elven, the Lord of Rivendell, moderating the podcast as he would have in the plans of decimating the One Ring in Mount Doom.

So there we were talking about Hadoop, or maybe Sauron, or both.

The photo of the Oliphaunt below seemed apt to describe the industry attacks on Hadoop.

Continue reading →

Is General Purpose Object Storage disenfranchised?

By cfheoh | December 23, 2019 - 5:40 pm |January 14, 2020 100Gigabit Ethernet, Amazon Web Services, Analytics, API, Artificial Intelligence, Big Data, BYOD, Ceph, Cloud, Cloudian, Clusters, Deep Learning, DellEMC, Docker, Dropbox, Edge Computing, Filesystems, Flash, Gartner, Hadoop, HDS, High Performance Computing, Hitachi Vantara, IDC, Industry 4.0, IoT, Lustre, Machine Learning, Mellanox Technologies, Minio, NetApp, Object Storage, OpenIO, Openstack, Performance Benchmark, Reliability, Scale-out architecture, Software Defined Storage, Storage Field Day, Storage Market Share, swiftstack, Tape storage, Tech Field Day

6 Comments

[Disclosure: I am invited by GestaltIT as a delegate to their Storage Field Day 19 event from Jan 22-24, 2020 in the Silicon Valley USA. My expenses, travel, accommodation and conference fees will be covered by GestaltIT, the organizer and I am not obligated to blog or promote the vendors’ technologies to be presented at this event. The content of this blog is of my own opinions and views]

This is NOT an advertisement for coloured balls.

This is the license to brag for the vendors in the next 2 weeks or so, as we approach the 2020 new year. This, of course, is the latest 2019 IDC Marketscape for Object-based Storage, released last week.

My object storage mentions

I have written extensively about Object Storage since 2011. With different angles and perspectives, here are some of them:

The Future is Intelligent Objects (2011)
What should be Cloud Storage? (2011)
APIs that stick in Storage (2012)
Has Object Storage become the Everything Store? (2013)
Of Object Storage, Filesystems and Multicloud (2017)
My Dilemma of Stateful Storage Marriage (2018)
The Malaysian Openstack Storage Conundrum (2018)
Sleepless in Malaysia with Object Storage (2019)
The Waning Light of Openstack Swift (2019)

Continue reading →

Time to Advocate Common Data Personality

By cfheoh | November 25, 2019 - 12:33 pm |November 25, 2019 Amazon Web Services, Analytics, API, Artificial Intelligence, Big Data, Commvault, Containers, Data Archiving, Data Availability, Data Management, Data Privacy, Data Protection, Data Security, Deep Learning, Digital Transformation, eDiscovery, Hadoop, Hadoop Clusters, HDS, Hedvig, Hitachi Vantara, InfluxDB, Machine Learning, MapReduce, Object Storage

3 Comments

The thought of it has been on my mind since Commvault GO 2019. It was sparked when Don Foster, VP of Storage Solutions of Commvault answered a question posted by one of the analysts. What he said made a connection, as I was searching for the better insights to how Commvault and Hedvig would end up to be together.

Data Deluge is a swamp thing now

Several years ago, I heard Stephen Brobst, CTO of Teradata brought up the term “Data Swamp“. It was the anti- part of the Data Lakes, and this was back when Data Lakes and Hadoop were all the rage. His comments were raw, honest and it was leading to the truth out there.

Source: https://www.deviantart.com/rhineville/art/God-that-Crawls-Detail-2-291228644

I was enamoured by his thoughts at the time, and today, his comments about the Data Swamp manifested itself. Continue reading →

Category Archives: Hadoop

As Disk Drive capacity gets larger (and larger), the resilient Filesystem matters

Celebrating MinIO

Rethinking data processing frameworks systems in real time