When I first heard of the word “AI Factory”, the world was blaring Jensen Huang‘s keynote at NVIDIA GTC24. I thought those were cool words, since he mentioned about the raw material of water going into the factory to produce electricity. The analogy was spot on for the AI we are building.
As I engage with many DDN partners and end users in the region, week in, week out, the “AI Factory” word keeps popping into conversations. Yet, many still do not know how to go about building this “AI Factory”. They only know they need to buy GPUs, lots of them. These companies’ AI ambitions are unabated. And IDC predicts that worldwide spending on AI will double by 2028, and yet, the ROI (returns on investment) remains elusive.
At the ground level, based on many conversations so far, the common theme is, the steps to begin building the AI Factory are ambiguous and fuzzy to most. I like to share my views from a data storage point of view. Hence, my take on the Data Factory for AI.
Are you AI-ready?
We have to have a plan but before we take the first step, we must look at where we are standing at the present moment. We know that to train AI, the proverbial step is, we need lots of data. Deep Learning (DL) works with Large Language Models (LLMs), and Generative AI (GenAI), needs tons of data.
If the company knows where they are, they will know which phase is next. So, in the AI Maturity Model (I simplified the diagram below), where is your company now? Are you AI-ready?
Get the Data Strategy Right
In his interview with CRN, MinIO’s CEO AB Periasamy quoted “For generative AI, they realized that buying more GPUs without a coherent data strategy meant GPUs are going to idle out”. I was struck by his wisdom about having a coherent data strategy because that is absolutely true. This is my starting point. Having the Right Data Strategy.
In the AI world, from a data storage guy, data is the fuel. Data is the raw material that Jensen alluded to, if it was obvious. We have heard this anecdotal quote many times before, even before the AI phenomenon took over. AI is data-driven. Data is vital for the ROI of AI projects. And thus, we must look from the point of the data to make the AI Factory successful.
Get the Data Right
If we spend the time reading the tea leaves (I learned that this is called tasseography), we should be spending more time and money, and resources to get the data right before AI training. Readers of my blog know that I am a big advocate of Data Governance. The blueprint of good data, data of high quality and standards, start with good data governance. AI is a big beneficiary of Data Governance, as I have written in my blog on Data Governance a few months ago.
After all, we all know that having a higher octane fuel results in a higher efficiency car engine. We must do the same for data to feed into the AI Factory’s processing engines.
Get the Data Infrastructure Right
GPU starvation is a problem when they are not fed efficiently. Idle GPUs are bad investments because you are not using them well to process the data, to train team and churn out faster results. And AI datasets are super large. They can’t fit all into GPU memory, and they can’t fit all into local node’s storage as well. Thus what is needed is a supercharged shared storage subsystem cluster, one that is a true parallel filesystem and one that can support Petabytes and even Exabytes of data A tried-and-tested storage technology in that category of super-high speed and extremely large capacities is the Lustre filesystem, running at heart of DDN‘s EXAscaler platforms.
In DDN’s case, it is not just the EXA storage servers that participate in keeping the GPUs busy. Smart EXA clients with intelligent data placement and data delivery are all part of the package. I have written about DDN technology like Hot Nodes and NVIDIA® GPUDirect Storage, the cornerstone of accelerated data delivery paths to the GPU cluster.
Get the Data Factory Right
GPUs are big investments. The options to choose the right storage vendors are many but every investment deserves the balance of costs, benefits and risks analysis. And it is important to choose a storage solution that is proven in the field.
Not many come close to one vendor where it has been chosen over and over again by NVIDIA® themselves, NVIDIA® customers for NCP (NVIDIA® Cloud Partner) DGX SuperPODs and BasePODs deployments, and also by many High Performance Computing customers as well. Therefore, pairing NVIDIA® GPU deployments with the most experienced storage technology vendor makes the most prudent proposition. This is DDN, of course,
Data Gravity and cost of AI Training
Data is heavy. Data gravity attracts other bits and pieces of data that makes the data heavy. The philosophy of enterprise storage thinking is to cram every bit of cool feature there is to win the hearts of the buyer does little to keep the GPUs investments purring.
Thus, many buyers (or should I say investors), tend to invoke their deeply psychological needs that they must get more from their data storage technologies as well. But we look at the AI Data Pipeline of the Data Training phase where the GPUs’ utilization should be the highest, enterprise features like snapshots, deduplication on the storage server side should be the least desired. There are weights heavily pulling down the storage throughput speed to feed the hungry GPUs.
If we observe the AI Data Pipeline below (from Solidigm™’s blog “The role of Solidigm SSDs in AI Storage advancement” – which by the way, is a fantastic blog), the phase most critical for GPUs is the middle one boxed in red below.
This is the phase we should be focusing on because this is the phase companies buy GPUs for. The focus here is eyes on the prize – the GPUs. That is why it is important to doing the Data Factory right.
The math of AI Factory investment
In a totally legit-looking blog (plenty of maths involved but I didn’t do the math) I chanced upon the other day, the cost to train a Llama 3.1 405 billion parameters is USD$52 million. The numerator is the FLOPs (floating point operations) operated on the data and model divided by the denominator of FLOPS (floating point operations per second) of the throughput per GPU hour. Don’t believe me? Read this article titled “How long does it take to train the LLM from scratch?”
With science and math backing up AI training, we now know that the faster we can put the GPUs throughput-wise (assisted by hyper-performance data storage like DDN), the faster we can have the models trained, and put into production the AI applications. This is what building the Data Factory for the AI Factory is all about.
So, this isn’t Charlie who got lucky with the Golden Ticket for visiting the chocolate factory. This is Jensen Huang’s Golden Ticket to his AI Factory vision, and the Data Factory is there for it.