Rethinking Storage OKRs for AI Data Infrastructure – Part 2

[ Preamble: This analysis focuses on my own journey as I incorporate my experiences into this new market segment called AI Data Infrastructure. There are many elements of HPC (High Performance Computing) at play here. Even though things such as speeds and feeds, features and functions crowd many conversations, as many enterprise storage vendors do, these conversations, in my opinion, are secondary. There are more vital and important operational technology and technical elements that an organization has to consider prudently. They involve asking the hard questions beyond the marketing hype and fluff. I call these elements of consideration Storage Objectives and Key Results (OKRs) for AI Data Infrastructure.

I had to break this blog into 2 parts. It has become TL;DR-ish. This is Part 2 ]

This is a continuation from Part 1 of my blog last week. I spoke about the 4 key OKRs (Objectives and Key Results) we look at from the storage point-of-view with regards to AI data infrastructure. To recap, they are:

Reliability
Speed
Power Efficiency
Security

Power Efficiency

Patrick Kennedy of ServeTheHome (STH) fame, astutely explained the new generation of data center racks required by NVIDIA® and AMD® in his article “Here is how much Power we expect AMD® and NVIDIA® racks will need in 2027” 2 weeks ago. Today, the NVIDIA® GB200 NVL72 ORv3 rack design takes up 120kW per rack. That’s an insane amount of power consumption that can only go up in the next 2-3 years. That is why power efficiency must be an OKR metric to be deeply evaluated.

When you operate a GPU compute farm, whether it is 8 GPUs or 16,384 GPUs, keep operations tight is vital to ensure that maximum power efficiency is right up there with the rest of the operational OKRs. The element of power consumption becomes a cost factor in the data infrastructure design for AI.

2 very important units of measurements I would look into, and that have become valuable OKRs to achieve are Performance per Watt (Performance/Watt) and Performance per Rack Unit (Performance/RU).

Power Efficiency in Data Center is a Must.

Some vendors’ storage clusters are disaggregated, separating the front-end storage processing nodes from the back-end storage capacity nodes with NVMe-over-Ethernet switches in between. This storage architecture design, while able to achieve great Read I/O speeds, does not necessarily translate into great Performance/Watt and Performance/RU. Many of these types of cluster components take up a lot of rack space and eat up a lot of power as well. In the long run, these costs add up, by a lot!

Others clustered data infrastructure solutions are more purpose-built, as actual engineered appliances. These are usually more compact, taking up lesser rack spaces and consuming less power as well. In the long run, the savings from the power consumption costs add up as well.

Security

The rise of AI is fueled with data. Data has to be consolidated, prepared, enriched and trained before they are put to production. GenAI and now, Agentic AI, will repeat many of the sequences in the AI Data Pipeline, and drives data into many non-linear excursions and extensions in the data paths.

This means that the data can be secured and exposed within the shared AI data infrastructure, and outside into cloud service providers as well. The visibility of curated data for AI is lacking because I have seen time and time again, in this hyper-exciting AI frenzy, many are putting the horse before the cart. I am repeating that we have put Data Governance in place for AI projects to succeed.

However, security is a very vast subject matter in both AI and enterprise infrastructure. It is hard to put down a specific OKR for security in the AI Data Infrastructure space because of security’s expansive reach, while trying to optimize the sharing of physical infrastructure resources in various premises.

At the AI compute level, several GPU sharing technologies have been jousting to become the standardized form of resource sharing while ensuring some level of isolation and segmentation with access control. There are some with security baked into them, and so which rely on other external methods to augment its own security techniques.

Because the domain of security is extensive, it is good to narrow the objectives of the OKR in this space. I would call this OKR to secure the multi-tenancy integration of the storage volumes (in Lustre®’s case, the secure access to the parallel filesystem namespace and shared directories) to the GPU compute nodes in the cluster.

First, this help create a targeted, more granular access control and second, it also closes the security of the logical storage resources to the GPU clusters.

One common implementation I have seen often in the NVIDIA® clusters when working with storage resources is to mount logically separated storage volumes on shared storage infrastructure with access control and security. These involves some form of multi-tenancy, GPU nodes authentication to the storage services, encryption and secure grouping of mapped identities to access the segmented storage volumes.

Lustre®, and to that extent, DDN EXAscaler, handles client (which are the GPU compute nodes) and server through its unique implementation of Lustre®’s communication protocol called LNET. Only permitted EXA clients through the configurations of NIDs (network IDs) are allowed to connect to the DDN parallel filesystem services. This is further strengthened through the Shared-Secret Keys (SSKs) to prevent spoofed NIDs of non-authorized clients. The security model of DDN EXAscaler through Lustre is well documented.

This is achieved through 2 key components – Nodemaps and Lustre® sub-directory mounts.

More details of how nodemapping and sub-directory mounts is architected and implemented is well articulated in this DDN blog.

ROI.

Returns on investment is tricky business, especially in AI now. In a Gartner survey last year (survey below), many organizations are struggling to “estimate and demonstrate AI value“.

Gartner survey on Generative AI (GenAI) – May 2024

This, to me, means that the values of AI projects are still “shifting the goalposts” and I believe many organizations are still rushing into AI (according to CIO.com) with less prudence in developing the OKRs needed to hit success with their projects and use cases. Companies are still struggling with ROIs of AI projects.

This is not helped marketers calling for “AI-First” now. Remember the Cloud-First mentality?

Pause and define.

That is why, as storage professionals (me Storage Dinosaur), we must advise organizations to take their time to evaluate vendors’ technology. Separate the wheat from the chaff; Remove the marketing speak that focuses on selling features. It is important to pause and define the outcomes that will match the AI investments.

And as data fuels AI, the choice of a proven technology is more important than ever. I have evaluated DDN and others, and I have made the call that DDN possess strong storage OKRs to bring the best value in AI Data Infrastructure projects. Look for outcomes. That is what OKRs are about.

[ DDN news Jan 2025: Blackstone invests $300 million at $5 billion valuation in DDN, AI and Data Intelligence solutions leader, to fuel further rapid growth. ]