Rethinking Storage OKRs for AI Data Infrastructure – Part 2

[ Preamble: This analysis focuses on my own journey as I incorporate my experiences into this new market segment called AI Data Infrastructure. There are many elements of HPC (High Performance Computing) at play here. Even though things such as speeds and feeds, features and functions crowd many conversations, as many enterprise storage vendors do, these conversations, in my opinion, are secondary. There are more vital and important operational technology and technical elements that an organization has to consider prudently. They involve asking the hard questions beyond the marketing hype and fluff. I call these elements of consideration Storage Objectives and Key Results (OKRs) for AI Data Infrastructure.

I had to break this blog into 2 parts. It has become TL;DR-ish. This is Part 2 ]

This is a continuation from Part 1 of my blog last week. I spoke about the 4 key OKRs (Objectives and Key Results) we look at from the storage point-of-view with regards to AI data infrastructure. To recap, they are:

  • Reliability
  • Speed
  • Power Efficiency
  • Security

Power Efficiency

Patrick Kennedy of ServeTheHome (STH) fame, astutely explained the new generation of data center racks required by NVIDIA® and AMD® in his article “Here is how much Power we expect AMD® and NVIDIA® racks will need in 2027” 2 weeks ago. Today, the NVIDIA® GB200 NVL72 ORv3 rack design takes up 120kW per rack. That’s an insane amount of power consumption that can only go up in the next 2-3 years. That is why power efficiency must be an OKR metric to be deeply evaluated.

When you operate a GPU compute farm, whether it is 8 GPUs or 16,384 GPUs, keep operations tight is vital to ensure that maximum power efficiency is right up there with the rest of the operational OKRs. The element of power consumption becomes a cost factor in the data infrastructure design for AI.

2 very important units of measurements I would look into, and that have become valuable OKRs to achieve are Performance per Watt (Performance/Watt) and Performance per Rack Unit (Performance/RU).

Power Efficiency in Data Center is a Must.

Continue reading

AI and the Data Factory

When I first heard of the word “AI Factory”, the world was blaring Jensen Huang‘s keynote at NVIDIA GTC24. I thought those were cool words, since he mentioned about the raw material of water going into the factory to produce electricity. The analogy was spot on for the AI we are building.

As I engage with many DDN partners and end users in the region, week in, week out, the “AI Factory” word keeps popping into conversations. Yet, many still do not know how to go about building this “AI Factory”. They only know they need to buy GPUs, lots of them. These companies’ AI ambitions are unabated. And IDC predicts that worldwide spending on AI will double by 2028, and yet, the ROI (returns on investment) remains elusive.

At the ground level, based on many conversations so far, the common theme is, the steps to begin building the AI Factory are ambiguous and fuzzy to most. I like to share my views from a data storage point of view. Hence, my take on the Data Factory for AI.

Are you AI-ready?

We have to have a plan but before we take the first step, we must look at where we are standing at the present moment. We know that to train AI, the proverbial step is, we need lots of data. Deep Learning (DL) works with Large Language Models (LLMs), and Generative AI (GenAI), needs tons of data.

If the company knows where they are, they will know which phase is next. So, in the AI Maturity Model (I simplified the diagram below), where is your company now? Are you AI-ready?

Simplified AI Maturity Model

Get the Data Strategy Right

In his interview with CRN, MinIO’s CEO AB Periasamy quoted “For generative AI, they realized that buying more GPUs without a coherent data strategy meant GPUs are going to idle out”. I was struck by his wisdom about having a coherent data strategy because that is absolutely true. This is my starting point. Having the Right Data Strategy.

In the AI world, from a data storage guy, data is the fuel. Data is the raw material that Jensen alluded to, if it was obvious. We have heard this anecdotal quote many times before, even before the AI phenomenon took over. AI is data-driven. Data is vital for the ROI of AI projects. And thus, we must look from the point of the data to make the AI Factory successful.

Continue reading