Rethinking Storage OKRs for AI Data Infrastructure – Part 1

[ Preamble: This analysis focuses on my own journey as I incorporate my past experiences into this new market segment called AI Data Infrastructure, and gaining new ones.

There are many elements of HPC (High Performance Computing) at play here. Even though things such as speeds and feeds, features and functions crowd many conversations, as many enterprise storage vendors like to do, these conversations, in my opinion, are secondary. There are more vital and important operational technology and technical elements that an organization has to consider prudently, vis-a-vis to ROIs (returns of investments). They involve asking the hard questions beyond the marketing hype and fluff. I call these elements of consideration Storage Objectives and Key Results (OKRs) for AI Data Infrastructure.

I had to break this blog into 2 parts. It has become TL;DR-ish. This is Part 1 ]

I have just passed my 6-month anniversary with DDN. Coming into the High Performance Storage System (HPSS) market segment, with the strong focus on the distributed parallel filesystem of Lustre®, there was a high learning curve for me. I spend over 3 decades in Enterprise Storage, with some of the highest level of storage technologies there were in that market segment. And I have already developed my own approach to enterprise storage, based on the A.P.P.A.R.M.S.C.. That was already developed and honed from 25 years ago.

The rapid adoption of AI has created a technology paradigm shift. Artificial Intelligence (AI) came in and blurred many lines. It also has been evolving my thinking when it comes to storage for AI. There is also a paradigm shift in my thoughts, opinions and experiences as well.

AI has brought HPSS technologies like Lustre® in DDN EXAscaler platform , proven in the Supercomputing world, to a new realm – the AI Data Infrastructure market segment. On the other side, many enterprise storage vendors aspire to be a supplier to the AI Data Infrastructure opportunities as well. This convergence from the top storage performers for Supercomputing, in the likes of DDN, IBM® (through Storage Scale), HPE® (through Cray, which by-the-way often uses the open-source Lustre® edition in its storage portfolio), from the software-defined storage players in Weka IO, Vast Data, MinIO, and from the enterprise storage array vendors such as NetApp®, Pure Storage®, and Dell®.

[ Note that I take care not to name every storage vendor for AI because many either do OEMs or repacking and rebranding of SDS technology into their gear such as HPE® GreenLake for Files and Hitachi® IQ. You can Google to find out who the original vendors are for each respectively. There are others as well. ]

In these 3 simplified categories (HPSS, SDS, Enterprise Storage Array), I have begun to see a pattern of each calling its technology as an “AI Data Infrastructure”. At the same time, I am also developing a new set of storage conversations for the AI Data Infrastructure market segment, one that is based on OKRs (Objectives and Key Results) rather than just features, features and more features that many SDS and enterprise storage vendors like to tout. Here are a few thoughts that we should look for when end users are considering a high-speed storage solution for their AI journey.

AI Data Infrastructure

GPU is king

In the AI world, the GPU infrastructure is the deity at the altar. The utilization rate of the GPUs is kept at the highest to get the maximum compute infrastructure return-on-investment (ROI). Keeping the GPUs resolutely busy is a must. HPSS is very much part of that ecosystem.

These are a few OKRs I would consider the storage or data infrastructure for AI.

Reliability
Speed
Power Efficiency
Security

Let’s look at each one of them from the point of view of a storage practitioner like me.

Reliability

Naturally, the data infrastructure must be reliable. Most storage architectures for AI are clusters but there are many kinds of clusters. There are some storage vendors who touts the appliance model, where these are likely to be dual controller HA systems running in active-active or active-passive mode. Layering over these HA pairs could be a scale-out NAS filesystem or a parallel filesystem like Lustre®, often with a single distributed namespace to make the cluster look like one giant filesystem for storage services.

In HA pairs, we have to be cognizant of the types of HA. In an Active-Passive configuration, the redundancy is N+1, but in an Active-Active HA, the redundancy is 2N. Such are the federated clusters of HA storage pairs.

On the other hand, there are several software-defined storage vendors that have come along to name their solution an “appliance” or “appliance-experience”. Most deploy a N+2 (4+2, 6+2, 8+2 …) redundancy mostly because they run on COTS (common off-the-shelf) x86 servers. The COTS server’s hardware, unlike a true engineered purpose-built appliance, is less reliable, and thus the N+2 redundancy is built-in to compensate the probable lower reliability factor of the COTS hardware. Most of these vendors have introduced their POD appliances, N+2 redundancy but underneath the covers, the nodes in their cluster are still COTS servers. Even N+4 sometimes.

Again, the clusters present the high availability, thus storage services reliability to the AI compute nodes but not all designs are the same. Look for 2N, N+1 and N+2 redundancy levels and give them some thoughts. Don’t listen to vendor marketing too much.

The other factor to look at in this space is the impact when nodes fail in the cluster. The impact domain on performance and reliability of the storage services to the GPU compute during a node (or nodes) rebuild. The blast radius in rebuild situations may drag performance down partially in the storage cluster, and in some, much, much more. This is where the redundancy levels of N+1, N+2, 2N shall be questioned rather than outrightly believing N+2 of surviving with 2 nodes failures is better than an Active-Active HA pair of 2N redundancy.

Furthermore, in the some SDS models where there is no disaggregation, the COTS server is the c0ntroller, the storage and everything else in one. That means that the single node in the cluster has a higher level of vulnerability in a cluster rebuild because everything is packed into that 1U or 2U COTS server. Another thing to consider about reliability of the data infrastructure for AI.

Speed

The most tangible and obvious metrics are the storage performances. Without the high-performance accelerated data delivery, the GPUs starve. They are idle and that means the GPUs are not working to train or infer data for AI.

Today, customers are bombarded with plenty of performance statistics. Many whom I have spoken to have been blinded by each vendor’s marketing prowess. I would even go as far to say that some wool was pulled over someone’s eyes. I would usually point them to the independent and authoritative source, which is MLCommons. The MLPerf for Storage benchmark was published in September 2024 to provisionally point out the best from the rest. Many storage vendors participate and each with its own ways of interpreting and explaining those venerated results to the market. Confusion ensued once again.

So, it is back to pepper my conversations with the usual speeds and feeds of this storage dinosaur. Read and Write throughputs measured in GB/sec are the metrics that are easily grasped by most end users and IT practitioners … for now.

But there is a kicker, a lesser known OKR that leads to time to result, which is the process of checkpointing in AI training is often thrown around but less understood. The importance and significance of Write I/O speed of the High-Performance storage technology for AI magnifies the gap between a super-charge storage technology versus and ordinary one, which usually sings Read I/O speeds but little on the Write I/O speeds. In AI training checkpointing, a very performant Write throughput matters more. Here is a look at the AI data pipeline.

AI Data Pipeline

Each phase in the pipeline requires a different storage performance profile, both in Reads and Writes. Both are equally important to help AI infrastructure reach the end of a training epoch faster, and doing more, quicker and more reliably as well.

I encourage you to watch this video. It is a presentation by John Fragalla, a Principal Storage Architect at NVIDIA® talking about the design principles of a data infrastructure technology for a very large-scale NVIDIA® cluster. It is 19.35 minutes long but trust me, it is worth it.

In John’s presentation, the discussions about OKRs are evident. He spoke about the outcomes of “quicker time to train” in the GPUs rather than the throughput performance numbers. He also spoke about the “efficiency of the GPUs”, something equally important in the realm of AI Data Infrastructure.

In both mentions (there are plenty in the video), the focus is on the outcomes. Therefore, it is important for the storage architects in the AI Data Infrastructure space to build upon and not be fixated in the speeds and feeds and the features and functions as I mentioned earlier. The outcomes would lead to strong ROIs in the GPU investments.

In fact, some enterprise storage features and functions are even detrimental to the performance of the accelerated delivery data paths to the GPU compute complex, but I have seen several storage technology vendors pitching to the end users to create the FOMO effect because their competitors might be doing it differently or may not have the apple-to-apple comparison to their “superior” technology feature. Again, I urge readers and end users to ask the hard questions.

That is why I am advocating for organizations to do deeper research, develop storage-related OKRs in their AI journey. Focus on the AI long game. More on that in Part 2.

Link to Part 2 here.