Hadoop Clusters Archives

AI and the Data Factory

By cfheoh | November 19, 2024 - 6:13 am |November 19, 2024 Algorithm, Analytics, API, Appliance, Artificial Intelligence, Cloud, Clusters, Composable Infrastructure, Data, Data Direct Networks, Data Governance, Data Management, Data Privacy, Data Protection, Data Security, DDN, Deep Learning, Digital Transformation, Filesystems, Hadoop Clusters, High Performance Computing, Lustre, Machine Learning, Mellanox, Mellanox Technologies, Minio, nVidia, Object Storage, Parallel NFS, Performance Benchmark, Performance Caching, RDMA, Scale-out architecture, Storage Optimization

Leave a comment

When I first heard of the word “AI Factory”, the world was blaring Jensen Huang‘s keynote at NVIDIA GTC24. I thought those were cool words, since he mentioned about the raw material of water going into the factory to produce electricity. The analogy was spot on for the AI we are building.

As I engage with many DDN partners and end users in the region, week in, week out, the “AI Factory” word keeps popping into conversations. Yet, many still do not know how to go about building this “AI Factory”. They only know they need to buy GPUs, lots of them. These companies’ AI ambitions are unabated. And IDC predicts that worldwide spending on AI will double by 2028, and yet, the ROI (returns on investment) remains elusive.

At the ground level, based on many conversations so far, the common theme is, the steps to begin building the AI Factory are ambiguous and fuzzy to most. I like to share my views from a data storage point of view. Hence, my take on the Data Factory for AI.

Are you AI-ready?

We have to have a plan but before we take the first step, we must look at where we are standing at the present moment. We know that to train AI, the proverbial step is, we need lots of data. Deep Learning (DL) works with Large Language Models (LLMs), and Generative AI (GenAI), needs tons of data.

If the company knows where they are, they will know which phase is next. So, in the AI Maturity Model (I simplified the diagram below), where is your company now? Are you AI-ready?

Simplified AI Maturity Model

Get the Data Strategy Right

In his interview with CRN, MinIO’s CEO AB Periasamy quoted “For generative AI, they realized that buying more GPUs without a coherent data strategy meant GPUs are going to idle out”. I was struck by his wisdom about having a coherent data strategy because that is absolutely true. This is my starting point. Having the Right Data Strategy.

In the AI world, from a data storage guy, data is the fuel. Data is the raw material that Jensen alluded to, if it was obvious. We have heard this anecdotal quote many times before, even before the AI phenomenon took over. AI is data-driven. Data is vital for the ROI of AI projects. And thus, we must look from the point of the data to make the AI Factory successful.

Continue reading →

Celebrating MinIO

By cfheoh | January 31, 2022 - 8:00 am |January 31, 2022 Algorithm, Amazon Web Services, Analytics, API, Artificial Intelligence, Backup, Big Data, Blitzscaling, Cloud, Cloudian, Clusters, Containers, Data Archiving, Deep Learning, Gartner, Gluster, Hadoop, Hadoop Clusters, HDS, High Performance Computing, Hitachi Vantara, IBM, InfluxDB, Kubernetes, Machine Learning, Minio, NetApp, Object Storage, OpenIO, Openstack, Scale-out architecture, Software Defined Storage

2 Comments

“Essentially MinIO is a web server …“

I vaguely recalled Anand Babu Periasamy (AB as he is known), the CEO of MinIO saying that when I first met him in 2017. I was fresh “playing around” with MinIO and instantly I fell in love with software technology. Wait a minute. Object storage wasn’t supposed to be so easy. It was not supposed to be that simple to set up and use, but MinIO burst into my storage universe like the birth of the Infinity Stones. There was a eureka moment. And I was attending one of the Storage Field Days in the US shortly after my MinIO discovery in late 2017. What an opportunity!

I could not recall how I made the appointment to meeting MinIO, but I recalled myself taking an Uber to their cosy office on University Avenue in Palo Alto to meet. Through Andy Watson (one of the CTOs then), I was introduced to AB, Garima Kapoor, MinIO’s COO and his wife, Frank Wessels, Zamin (one of the business people who is no longer there) and Ugur Tigli (East Coast CTO) who was on the Polycom. I was awe struck.

Last week, MinIO scored a major Series B round funding of USD103 million. It was delayed by the pandemic because I recalled Garima telling me that the funding was happening in 2020. But I think the delay made it better, because the world now is even more ready for MinIO than ever before.

Continue reading →

Rethinking data processing frameworks systems in real time

By cfheoh | December 27, 2021 - 8:00 am |December 26, 2021 Algorithm, Amazon Web Services, Analytics, API, Artificial Intelligence, Confluent, Containers, Data, Data Management, Data Privacy, Data Protection, Data Security, Digital Transformation, Google, Hadoop, Hadoop Clusters, InfluxDB, Machine Learning, MapReduce, Microsoft Azure, Pravega, Scale-out architecture

Leave a comment

“Row, row, row your boat, gently down the stream…”

Except the stream isn’t gentle at all in the data processing’s new context.

For many of us in the storage infrastructure and data management world, the well known framework is storing and retrieve data from a storage media. That media could be a disk-based storage array, a tape, or some cloud storage where the storage media is abstracted from the users and the applications. The model of post processing the data after the data has safely and persistently stored on that media is a well understood and a mature one. Users, applications and workloads (A&W) process this data in its resting phase, retrieve it, work on it, and write it back to the resting phase again.

There is another model of data processing that has been bubbling over the years and now reaching a boiling point. Still it has not reached its apex yet. This is processing the data in flight, while it is still flowing as it passes through processing engine. The nature of this kind of data is described in one 2018 conference I chanced upon a year ago.

letgo marketplace processing numbers in 2018

* NRT = near real time

From a storage technology infrastructure perspective, this kind of data processing piqued my curiosity immensely. And I have been studying this burgeoning new data processing model in my spare time, and where it fits, bringing the understanding back into the storage infrastructure and data management side.

Continue reading →

Open Source Storage Technology Crafters

By cfheoh | October 4, 2021 - 8:00 am |October 4, 2021 Appliance, Ceph, Cloud, Data Fabric, Data Management, Delphix, Filesystems, FreeNAS, Hadoop, Hadoop Clusters, iXsystems, Linux, Lustre, Minio, NAS, Object Storage, Openstack, OpenZFS, Oracle, Redhat, SMB, Snapshots, SNIA, SoftIron, Software Defined Storage, TrueNAS, Unified Storage, Virtualbox, Virtualization

Leave a comment

The conversation often starts with a challenge. “What’s so great about open source storage technology?”

For the casual end users of storage systems, regardless of SAN (definitely not Fibre Channel) or NAS on-premises, or getting “files” from the personal cloud storage like Dropbox, OneDrive et al., there is a strong presumption that open source storage technology is cheap and flaky. This is not helped with the diet of consumer brands of NAS in the market, where the price is cheap, but the storage offering with capabilities, reliability and performance are found to be wanting. Thus this notion floats its way to the business and enterprise users, and often ended up with a negative perception of open source storage technology.

Highway Signpost with Open Source wording

Storage Assemblers

Anybody can “build” a storage system with open source storage software. Put the software together with any commodity x86 server, and it can function with the basic storage services. Most open source storage software can do the job pretty well. However, once the completed storage technology is put together, can it do the job well enough to serve a business critical end user? I have plenty of sob stories from end users I have spoken to in these many years in the industry related to so-called “enterprise” storage vendors. I wrote a few blogs in the past that related to these sad situations:

We have such storage offerings rigged with cybersecurity risks and holes too. In a recent Unit 42 report, 250,000 NAS devices are vulnerable and exposed to the public Internet. The brands in question are mentioned in the report.

I would categorize these as storage assemblers.

Continue reading →

What the heck is Storage Modernization?

By cfheoh | September 20, 2021 - 7:00 am |September 19, 2021 Acquisition, Analytics, API, Artificial Intelligence, Big Data, Business Continuity, Cloud, Containers, Data, Data Archiving, Data Availability, Data Fabric, Data Management, Data Privacy, Data Protection, Data Security, Deep Learning, Digital Transformation, Disaster Recovery, Edge Computing, Green Computing, Hadoop, Hadoop Clusters, Kubernetes, Machine Learning, MapReduce, Reliability, Software-defined Datacenter, Solid State Devices, Storage Optimization, Tape storage

Leave a comment

We often hear the word “modernization” thrown around these days. The push is to get the end user to refresh their infrastructure, and the storage infrastructure market is rife with modernization word. Is your storage ripe for “modernization“?

Many possibilities to modernize storage

To modernize, it has to be relative to legacy storage hardware, and the operating environment that came with it. But if the so-called “legacy” still does the job, should you modernize?

Big Data is right

When the word “Big Data” came into prominence a while back, it stirred the IT industry into a frenzy. At one point, Apache Hadoop became the poster elephant (pun intended) for this exciting new segment. So many Vs came out, but I settled with 4 Vs as the framework of my IT conversations. The 4Vs we often hear are:

Volume
Velocity
Variety
Veracity

Continue reading →

Paradigm shift of Dev to Storage Ops

By cfheoh | March 2, 2020 - 5:47 am |March 2, 2020 Amazon Web Services, API, Artificial Intelligence, Ceph, Cloud, Composable Infrastructure, Containers, Data Management, Deep Learning, Docker, Drivescale, Edge Computing, Filesystems, Hadoop Clusters, High Performance Computing, IBM, Kubernetes, Linux, Liqid, Machine Learning, Minio, Object Storage, Performance Benchmark, Redhat, Scale-out architecture, Software Defined Storage, Storage Field Day, Tech Field Day, VMware

2 Comments

[ Disclosure: I was invited by GestaltIT as a delegate to their Storage Field Day 19 event from Jan 22-24, 2020 in the Silicon Valley USA. My expenses, travel, accommodation and conference fees were covered by GestaltIT, the organizer and I was not obligated to blog or promote the vendors’ technologies presented at the event. The content of this blog is of my own opinions and views ]

A funny photo (below) came up on my Facebook feed a couple of weeks back. In an honest way, it depicted how a developer would think (or the lack of thinking) about the storage infrastructure designs and models for the applications and workloads. This also reminded me of how DBAs used to diss storage engineers. “I don’t care about storage, as long as it is RAID 10“. That was aeons ago 😉

The world of developers and the world of infrastructure people are vastly different. Since cloud computing birthed, both worlds have collided and programmable infrastructure-as-code (IAC) have become part and parcel of cloud native applications. Of course, there is no denying that there is friction.

Welcome to DevOps!

The Kubernetes factor

Containerized applications are quickly defining the cloud native applications landscape. The container orchestration machinery has one dominant engine – Kubernetes.

In the world of software development and delivery, DevOps has taken a liking to containers. Containers make it easier to host and manage life-cycle of web applications inside the portable environment. It packages up application code other dependencies into building blocks to deliver consistency, efficiency, and productivity. To scale to a multi-applications, multi-cloud with th0usands and even tens of thousands of microservices in containers, the Kubernetes factor comes into play. Kubernetes handles tasks like auto-scaling, rolling deployment, computer resource, volume storage and much, much more, and it is designed to run on bare metal, in the data center, public cloud or even a hybrid cloud.

Continue reading →

Hadoop is truly dead – LOTR version

By cfheoh | January 24, 2020 - 1:06 pm |January 24, 2020 Acquisition, Analytics, API, Artificial Intelligence, Big Data, Cloud, Cloudera, Containers, Data Management, Data Security, Deep Learning, Digital Transformation, Hadoop, Hadoop Clusters, Kubernetes, MapReduce, NAS, NetApp, Object Storage, Pure Storage, Storage Field Day, Tech Field Day

2 Comments

[Disclosure: I was invited by GestaltIT as a delegate to their Storage Field Day 19 event from Jan 22-24, 2020 in the Silicon Valley USA. My expenses, travel, accommodation and conference fees were covered by GestaltIT, the organizer and I was not obligated to blog or promote the vendors’ technologies to be presented at this event. The content of this blog is of my own opinions and views]

This blog was not intended because it was not in my plans to write it. But a string of events happened in the Storage Field Day 19 week and I have the fodder to share my thoughts. Hadoop is indeed dead.

Warning: There are Lord of the Rings references in this blog. You might want to do some research. 😉

Storage metrics never happened

The fellowship of Arjan Timmerman, Keiran Shelden, Brian Gold (Pure Storage) and myself started at the office of Pure Storage in downtown Mountain View, much like Frodo Baggins, Samwise Gamgee, Peregrine Took and Meriadoc Brandybuck forging their journey vows at Rivendell. The podcast was supposed to be on the topic of storage metrics but was unanimously swung to talk about Hadoop under the stewardship of Mr. Stephen Foskett, our host of Tech Field Day. I saw Stephen as Elrond Half-elven, the Lord of Rivendell, moderating the podcast as he would have in the plans of decimating the One Ring in Mount Doom.

So there we were talking about Hadoop, or maybe Sauron, or both.

The photo of the Oliphaunt below seemed apt to describe the industry attacks on Hadoop.

Continue reading →

Time to Advocate Common Data Personality

By cfheoh | November 25, 2019 - 12:33 pm |November 25, 2019 Amazon Web Services, Analytics, API, Artificial Intelligence, Big Data, Commvault, Containers, Data Archiving, Data Availability, Data Management, Data Privacy, Data Protection, Data Security, Deep Learning, Digital Transformation, eDiscovery, Hadoop, Hadoop Clusters, HDS, Hedvig, Hitachi Vantara, InfluxDB, Machine Learning, MapReduce, Object Storage

3 Comments

The thought of it has been on my mind since Commvault GO 2019. It was sparked when Don Foster, VP of Storage Solutions of Commvault answered a question posted by one of the analysts. What he said made a connection, as I was searching for the better insights to how Commvault and Hedvig would end up to be together.

Data Deluge is a swamp thing now

Several years ago, I heard Stephen Brobst, CTO of Teradata brought up the term “Data Swamp“. It was the anti- part of the Data Lakes, and this was back when Data Lakes and Hadoop were all the rage. His comments were raw, honest and it was leading to the truth out there.

Source: https://www.deviantart.com/rhineville/art/God-that-Crawls-Detail-2-291228644

I was enamoured by his thoughts at the time, and today, his comments about the Data Swamp manifested itself. Continue reading →

Data Renaissance in Oil and Gas

By cfheoh | October 3, 2019 - 6:11 pm |October 3, 2019 Acquisition, Amazon Web Services, Analytics, API, Appliance, Artificial Intelligence, Big Data, Cloud, Clusters, Data Management, Data Privacy, Data Protection, Data Security, Deep Learning, Digital Transformation, eDiscovery, Google, Hadoop Clusters, High Performance Computing, Hyperconvergence, Industry 4.0, Intel, Machine Learning, Microsoft, nVidia, Oracle Cloud, Virtualization

Leave a comment

The Oil and Gas industry, especially in the upstream Exploration and Production (EP) sector, has been enjoying a renewed vigour in the past few years. I have kept in touch with the developments of the EP side because I always have a soft spot for the industry. I have engaged in infrastructure and solutions in the petrotechnical side in my days at Sun Microsystems back in the late 90s. The engagements with EP intensified in my first stint at NetApp, wearing the regional Oil & Gas consulting engineer here in South Asia for almost 6 years. Then, with Interica in 2014, I was dealing with subsurface data and seismic interpretation technology. EP is certainly an exciting sector to cover because there are so much technical work involved and the technologies, especially the non-IT, are breath taking.

I have been an annual registrant to the Digital Energy Journal events since 2013, except last year, and I have always enjoyed their newsletter. This week I attended Digital Energy 2-day conference again, and I was taken in by the exciting times in EP. Here are a few of my views and trends observation in this data renaissance.

Continue reading →

Commvault big bet

By cfheoh | September 12, 2019 - 9:03 pm |September 12, 2019 Acquisition, Analytics, API, Appliance, Big Data, Business Continuity, Cisco, Cloud, Cohesity, Commvault, Data Archiving, Data Availability, Data Corruption, Data Fabric, Data Management, Data Privacy, Data Protection, Data Security, Deep Learning, Digital Transformation, Filesystems, Hadoop, Hadoop Clusters, Hedvig, Hitachi Vantara, Hyperconvergence, ILM, Infrascale, Machine Learning, MapReduce, Minio, NAS, NetApp, Object Storage, Scale-out architecture, Software Defined Storage, Software-defined Datacenter, Storage Field Day, Storage Tiering, Tape storage, Tech Field Day, Unified Storage, Veeam, Veritas, Zerto

1 Comment

I woke up at 2.59am in the morning of Sept 5th morning, a bit discombobulated and quickly jumped into the Commvault call. The damn alarm rang and I slept through it, but I got up just in time for the 3am call.

As I was going through the motion of getting onto UberConference, organized by GestaltIT, I was already sensing something big. In the call, Commvault was acquiring Hedvig and it hit me. My drowsy self centered to the big news. And I saw a few guys from Veritas and Cohesity on my social media group making gestures about the acquisition.

I spent the rest of the week thinking about the acquisition. What is good? What is bad? How is Commvault going to move forward? This is at pressing against the stark background from the rumour mill here in South Asia, just a week before this acquisition news, where I heard that the entire Commvault teams in Malaysia and Asia Pacific were released. I couldn’t confirm the news in Asia Pacific, but the source of the news coming from Malaysia was strong and a reliable one.

What is good?

It is a big win for Hedvig. Nestled among several scale-out primary storage vendors and little competitive differentiation, this Commvault acquisition is Hedvig’s pay day.

Continue reading →

Category Archives: Hadoop Clusters