GPU – Storage Gaga

Rethinking Storage OKRs for AI Data Infrastructure – Part 2

By cfheoh | January 13, 2025 - 8:00 am |January 12, 2025 Acquisition, AMD, Analytics, API, Appliance, Artificial Intelligence, Big Data, Cloud, Clusters, Composable Infrastructure, Data Direct Networks, Data Governance, Data Management, Data Security, DDN, Digital Transformation, Filesystems, Gartner, High Performance Computing, Intel, Lustre, nVidia, NVMe, Performance Benchmark, Scale-out architecture, Security, Software Defined Storage, Storage Optimization

Leave a comment

[ Preamble: This analysis focuses on my own journey as I incorporate my experiences into this new market segment called AI Data Infrastructure. There are many elements of HPC (High Performance Computing) at play here. Even though things such as speeds and feeds, features and functions crowd many conversations, as many enterprise storage vendors do, these conversations, in my opinion, are secondary. There are more vital and important operational technology and technical elements that an organization has to consider prudently. They involve asking the hard questions beyond the marketing hype and fluff. I call these elements of consideration Storage Objectives and Key Results (OKRs) for AI Data Infrastructure.

I had to break this blog into 2 parts. It has become TL;DR-ish. This is Part 2 ]

This is a continuation from Part 1 of my blog last week. I spoke about the 4 key OKRs (Objectives and Key Results) we look at from the storage point-of-view with regards to AI data infrastructure. To recap, they are:

Reliability
Speed
Power Efficiency
Security

Power Efficiency

Patrick Kennedy of ServeTheHome (STH) fame, astutely explained the new generation of data center racks required by NVIDIA® and AMD® in his article “Here is how much Power we expect AMD® and NVIDIA® racks will need in 2027” 2 weeks ago. Today, the NVIDIA® GB200 NVL72 ORv3 rack design takes up 120kW per rack. That’s an insane amount of power consumption that can only go up in the next 2-3 years. That is why power efficiency must be an OKR metric to be deeply evaluated.

When you operate a GPU compute farm, whether it is 8 GPUs or 16,384 GPUs, keep operations tight is vital to ensure that maximum power efficiency is right up there with the rest of the operational OKRs. The element of power consumption becomes a cost factor in the data infrastructure design for AI.

2 very important units of measurements I would look into, and that have become valuable OKRs to achieve are Performance per Watt (Performance/Watt) and Performance per Rack Unit (Performance/RU).

Power Efficiency in Data Center is a Must.

Continue reading →

AI and the Data Factory

By cfheoh | November 19, 2024 - 6:13 am |November 19, 2024 Algorithm, Analytics, API, Appliance, Artificial Intelligence, Cloud, Clusters, Composable Infrastructure, Data, Data Direct Networks, Data Governance, Data Management, Data Privacy, Data Protection, Data Security, DDN, Deep Learning, Digital Transformation, Filesystems, Hadoop Clusters, High Performance Computing, Lustre, Machine Learning, Mellanox, Mellanox Technologies, Minio, nVidia, Object Storage, Parallel NFS, Performance Benchmark, Performance Caching, RDMA, Scale-out architecture, Storage Optimization

Leave a comment

When I first heard of the word “AI Factory”, the world was blaring Jensen Huang‘s keynote at NVIDIA GTC24. I thought those were cool words, since he mentioned about the raw material of water going into the factory to produce electricity. The analogy was spot on for the AI we are building.

As I engage with many DDN partners and end users in the region, week in, week out, the “AI Factory” word keeps popping into conversations. Yet, many still do not know how to go about building this “AI Factory”. They only know they need to buy GPUs, lots of them. These companies’ AI ambitions are unabated. And IDC predicts that worldwide spending on AI will double by 2028, and yet, the ROI (returns on investment) remains elusive.

At the ground level, based on many conversations so far, the common theme is, the steps to begin building the AI Factory are ambiguous and fuzzy to most. I like to share my views from a data storage point of view. Hence, my take on the Data Factory for AI.

Are you AI-ready?

We have to have a plan but before we take the first step, we must look at where we are standing at the present moment. We know that to train AI, the proverbial step is, we need lots of data. Deep Learning (DL) works with Large Language Models (LLMs), and Generative AI (GenAI), needs tons of data.

If the company knows where they are, they will know which phase is next. So, in the AI Maturity Model (I simplified the diagram below), where is your company now? Are you AI-ready?

Simplified AI Maturity Model

Get the Data Strategy Right

In his interview with CRN, MinIO’s CEO AB Periasamy quoted “For generative AI, they realized that buying more GPUs without a coherent data strategy meant GPUs are going to idle out”. I was struck by his wisdom about having a coherent data strategy because that is absolutely true. This is my starting point. Having the Right Data Strategy.

In the AI world, from a data storage guy, data is the fuel. Data is the raw material that Jensen alluded to, if it was obvious. We have heard this anecdotal quote many times before, even before the AI phenomenon took over. AI is data-driven. Data is vital for the ROI of AI projects. And thus, we must look from the point of the data to make the AI Factory successful.

Continue reading →

AI is pushing storage and data management harder than ever

By cfheoh | April 29, 2024 - 7:30 am |April 28, 2024 100Gigabit Ethernet, Algorithm, Analytics, Artificial Intelligence, Big Data, Clusters, Data, Data Governance, Data Management, Data Privacy, Data Protection, Data Security, Deep Learning, Digital Transformation, DMTF, High Performance Computing, Machine Learning, nVidia, Scale-out architecture, Solidigm, Tech Field Day

2 Comments

I am on a learning streak again. The most prominent technology that keeps landing on my tray at present is, of course, Artificial Intelligence (AI).

AI is hot. Very hot. And overhyped. Everyone is an expert nowadays. Yeah, right. Not me.

Underneath that glossy veneer of the AI hype, there are much going on behind the scenes to make AI great. The 2 areas I have been involved in and practiced for a long time are data infrastructure a.k.a. storage, and data management. And both are playing prominent parts in the advancement of the AI ecosystem. This makes me very excited.

I am no expert, but learning from various sources is already telling me that AI is pushing both storage and data management harder than ever before, much harder than traditional enterprise on-premises use cases and even the cloud computing applications. I ask myself, “where do I start my learning again?” as I journal my process.

Storage performance in a Data Pipeline

Speed of how AI responds is Trust. The faster it is to the accurate and relevant responses will build trust in AI. To get to the speed that we want is not an easy thing, and storage a.k.a. data infrastructure is doing its part. I pick up my learning from understanding the AI pipeline. One early help comes from my friend, Gina Rosenthal, who attended the Solidigm‘s presentation at AI Field Day in February 2024. Her article, titled “Why storage matters for AI – Solidigm“, kickstarted my learning juices again.

I was particularly captured by this slide in Gina’s article. It defines the laborious path data takes to become useful for AI applications.

Storage and the 5 stages of AI Work (reference: Solidigm presentation on AI Field Day)

Continue reading →

Storage IO straight to GPU

By cfheoh | July 5, 2021 - 9:00 am |July 3, 2021 100Gigabit Ethernet, Algorithm, Analytics, API, Artificial Intelligence, Composable Infrastructure, compression, CXL, Deduplication, Deep Learning, Filesystems, High Performance Computing, Hyperconvergence, Machine Learning, Mellanox, Mellanox Technologies, Microsoft, nVidia, NVMe, RDMA, Vast Data, WekaIO

Leave a comment

The parallel processing power of the GPU (Graphics Processing Unit) cannot be denied. One year ago, nVidia® overtook Intel® in market capitalization. And today, they have doubled their market cap lead over Intel®, [as of July 2, 2021] USD$510.53 billion vs USD$229.19 billion.

Thus it is not surprising that storage architectures are changing from the CPU-centric paradigm to take advantage of the burgeoning prowess of the GPU. And 2 announcements in the storage news in recent weeks have caught my attention – Windows 11 DirectStorage API and nVidia® Magnum IO GPUDirect® Storage.

nVidia GPU

Exciting the gamers

The Windows DirectStorage API feature is only available in Windows 11. It was announced as part of the Xbox® Velocity Architecture last year to take advantage of the high I/O capability of modern day NVMe SSDs. DirectStorage-enabled applications and games have several technologies such as D3D Direct3D decompression/compression algorithm designed for the GPU, and SFS Sampler Feedback Streaming that uses the previous rendered frame results to decide which higher resolution texture frames to be loaded into memory of the GPU and rendered for the real-time gaming experience.

Continue reading →

The Edge is coming! The Edge is coming!

By cfheoh | October 12, 2020 - 9:15 am |October 11, 2020 100Gigabit Ethernet, Analytics, Big Data, Containers, Data, Deep Learning, Edge Computing, Flash, Industry 4.0, InfluxDB, Linux, Machine Learning, Mellanox, Mellanox Technologies, Minio, nVidia, NVMe, Pravega, SNIA, Solid State Devices

Leave a comment

Actually, Edge Computing is already here. It has been here on everyone’s lips for quite some time, but for me and for many others, Edge Computing is still a hodgepodge of many things. The proliferation of devices, IoT, sensor, end points being pulled into the ubiquitous term of Edge Computing has made the scope ever changing, and difficult to pin down. And it is this proliferation of edge devices that will generate voluminous amount of data. Obvious questions emerge:

How to do you store all the data?
How do you process all the data?
How do you derive competitive value from the data from these edge devices?
How do you securely transfer and share the data?

From the storage technology perspective, it might be easier to observe what are the traits of the data generated on the edge device. In this blog, we also observe what could some new storage technologies out there that could be part of the Edge Computing present and future.

Edge Computing overview – Cloud to Edge to Endpoint

Storage at the Edge

The mantra of putting compute as close to the data and processing it where it is stored is the main crux right now, at least where storage of the data is concerned. The latency to the computing resources on the cloud and back to the edge devices will not be conducive, and in many older settings, these edge devices in factory may not be even network enabled. In my last encounter several years ago, there were more than 40 interfaces, specifications and protocols, most of them proprietary, for the edge devices. And there is no industry wide standard for these edge devices too.

Continue reading →

Storage Performance Considerations for AI Data Paths

By cfheoh | June 17, 2019 - 10:50 am |June 17, 2019 100Gigabit Ethernet, Algorithm, Analytics, API, Artificial Intelligence, Big Data, Cloud, Composable Infrastructure, Data, Data Fabric, Data Management, Data Privacy, Data Security, Digital Transformation, Drivescale, E8 Storage, Edge Computing, Elastifile, Excelero, Filesystems, High Performance Computing, Hyperconvergence, Industry 4.0, Infiniband, Intel, Liqid, Lustre, Machine Learning, Mellanox Technologies, NVMe, Object Storage, Performance Benchmark, Performance Caching, Quantum Corporation, RDMA, Software-defined Datacenter, Storage Optimization, Storage Tiering, ThinkParq, Vast Data, Virtualization, WekaIO

1 Comment

The hype of Deep Learning (DL), Machine Learning (ML) and Artificial Intelligence (AI) has reached an unprecedented frenzy. Every infrastructure vendor from servers, to networking, to storage has a word to say or play about DL/ML/AI. This prompted me to explore this hyped ecosystem from a storage perspective, notably from a storage performance requirement point-of-view.

One question on my mind

There are plenty of questions on my mind. One stood out and that is related to storage performance requirements.

Reading and learning from one storage technology vendor to another, the context of everyone’s play against their competitors seems to be “They are archaic, they are legacy. Our architecture is built from ground up, modern, NVMe-enabled“. And there are more juxtaposing, but you get the picture – “We are better, no doubt“.

Are the data patterns and behaviours of AI different? How do they affect the storage design as the data moves through the workflow, the data paths and the lifecycle of the AI ecosystem?

Continue reading →

Scaling new HPC with Composable Architecture

By cfheoh | May 10, 2019 - 4:50 pm |May 10, 2019 100Gigabit Ethernet, Analytics, API, Appliance, Artificial Intelligence, Big Data, Cloud, Clusters, Composable Infrastructure, Containers, Data Fabric, Data Management, Deep Learning, DellEMC, Drivescale, High Performance Computing, Hyperconvergence, Infiniband, Liqid, Machine Learning, nVidia, NVMe, PCIe, RDMA, Scale-out architecture, Software Defined Storage, Software-defined Datacenter, Tech Field Day, Unified Storage, Virtualization

2 Comments

[Disclosure: I was invited by Dell Technologies as a delegate to their Dell Technologies World 2019 Conference from Apr 29-May 1, 2019 in the Las Vegas USA. Tech Field Day Extra was an included activity as part of the Dell Technologies World. My expenses, travel, accommodation and conference fees were covered by Dell Technologies, the organizer and I was not obligated to blog or promote their technologies presented at this event. The content of this blog is of my own opinions and views]

Deep Learning, Neural Networks, Machine Learning and subsequently Artificial Intelligence (AI) are the new generation of applications and workloads to the commercial HPC systems. Different from the traditional, more scientific and engineering HPC workloads, I have written about the new dawn of supercomputing and the attractive posture of commercial HPC.

Don’t be idle

From the business perspective, the investment of HPC systems is high most of the time, and justifying it to the executives and the investors is not easy. Therefore, it is critical to keep feeding the HPC systems and significantly minimize the idle times for compute, GPUs, network and storage.

However, almost all HPC systems today are inflexible. Once assigned to a project, the resources pretty much stay with the project, even when the workload processing of the project is idle and waiting. Of course, we have to bear in mind that not all resources are fully abstracted, virtualized and software-defined whereby you can carve out pieces of the hardware and deliver a percentage of that resource. Case in point is the CPU, where you cannot assign certain clock cycles of CPU to one project and another half to the other. The technology isn’t there yet. Certain resources like GPU is going down the path of Virtual GPU, and into the realm of resource disaggregation. Eventually, all resources of the HPC systems – CPU, memory, FPGA, GPU, PCIe channels, NVMe paths, IOPS, bandwidth, burst buffers etc – should be disaggregated and pooled for disparate applications and workloads based on demands of usage, time and performance.

Hence we are beginning to see the disaggregated HPC systems resources composed and built up the meet the diverse mix and needs of HPC applications and workloads. This is even more acute when a AI project might grow cold, but the training of AL/ML/DL workloads continues to stay hot

Liqid the early leader in Composable Architecture

Continue reading →

Tag Archives: GPU