Rethinking Storage OKRs for AI Data Infrastructure – Part 1

[ Preamble: This analysis focuses on my own journey as I incorporate my past experiences into this new market segment called AI Data Infrastructure, and gaining new ones.

There are many elements of HPC (High Performance Computing) at play here. Even though things such as speeds and feeds, features and functions crowd many conversations, as many enterprise storage vendors like to do, these conversations, in my opinion, are secondary. There are more vital and important operational technology and technical elements that an organization has to consider prudently, vis-a-vis to ROIs (returns of investments). They involve asking the hard questions beyond the marketing hype and fluff. I call these elements of consideration Storage Objectives and Key Results (OKRs) for AI Data Infrastructure.

I had to break this blog into 2 parts. It has become TL;DR-ish. This is Part 1 ]

I have just passed my 6-month anniversary with DDN. Coming into the High Performance Storage System (HPSS) market segment, with the strong focus on the distributed parallel filesystem of Lustre®, there was a high learning curve for me. I spend over 3 decades in Enterprise Storage, with some of the highest level of storage technologies there were in that market segment. And I have already developed my own approach to enterprise storage, based on the A.P.P.A.R.M.S.C.. That was already developed and honed from 25 years ago.

The rapid adoption of AI has created a technology paradigm shift. Artificial Intelligence (AI) came in and blurred many lines. It also has been evolving my thinking when it comes to storage for AI. There is also a paradigm shift in my thoughts, opinions and experiences as well.

AI has brought HPSS technologies like Lustre® in DDN EXAscaler platform , proven in the Supercomputing world, to a new realm – the AI Data Infrastructure market segment. On the other side, many enterprise storage vendors aspire to be a supplier to the AI Data Infrastructure opportunities as well. This convergence from the top storage performers for Supercomputing, in the likes of DDN, IBM® (through Storage Scale), HPE® (through Cray, which by-the-way often uses the open-source Lustre® edition in its storage portfolio), from the software-defined storage players in Weka IO, Vast Data, MinIO, and from the enterprise storage array vendors such as NetApp®, Pure Storage®, and Dell®.

[ Note that I take care not to name every storage vendor for AI because many either do OEMs or repacking and rebranding of SDS technology into their gear such as HPE® GreenLake for Files and Hitachi® IQ. You can Google to find out who the original vendors are for each respectively. There are others as well. ]

In these 3 simplified categories (HPSS, SDS, Enterprise Storage Array), I have begun to see a pattern of each calling its technology as an “AI Data Infrastructure”. At the same time, I am also developing a new set of storage conversations for the AI Data Infrastructure market segment, one that is based on OKRs (Objectives and Key Results) rather than just features, features and more features that many SDS and enterprise storage vendors like to tout. Here are a few thoughts that we should look for when end users are considering a high-speed storage solution for their AI journey.

AI Data Infrastructure

GPU is king

In the AI world, the GPU infrastructure is the deity at the altar. The utilization rate of the GPUs is kept at the highest to get the maximum compute infrastructure return-on-investment (ROI). Keeping the GPUs resolutely busy is a must. HPSS is very much part of that ecosystem.

These are a few OKRs I would consider the storage or data infrastructure for AI.

  • Reliability
  • Speed
  • Power Efficiency
  • Security

Let’s look at each one of them from the point of view of a storage practitioner like me.

Continue reading

Accelerated Data Paths of High Performance Storage is the Cornerstone of building AI

It has been 2 months into my new role at DDN as a Solutions Architect. With many revolving doors around me, I have been trying to find the essence, the critical cog of the data infrastructure that supports the accelerated computing of the Nvidia GPU clusters. The more I read and engage, a pattern emerged. I found that cog in the supercharged data paths between the storage infrastructure systems and the GPU clusters. I will share more.

To set the context, let me start with a wonderful article I read in CIO.com back in July 2024. It was titled “Storage: The unsung hero of AI deployments“. It was music to my ears because as a long-time practitioner in the storage technology industry, it is time the storage industry gets its credit it deserves.

What is the data path?

To put it simply, a Data Path, from a storage context, is the communication route taken by the data bits between the compute system’s processing and program memory and the storage subsystem. The links and the established sessions can be within the system components such as the PCIe bus or external to the system through the shared networking infrastructure.

High speed accelerated data paths

In the world of accelerated computing such as AI and HPC, there are additional, more advanced technologies to create even faster delivery of the data bits. This is the accelerated data paths between the compute nodes and the storage subsystems. Following on, I share a few of these technologies that are lesser used in the enterprise storage segment.

Continue reading

The All-Important Storage Appliance Mindset for HPC and AI projects

I am strong believer of using the right tool to do the job right. I have said this before 2 years ago, in my blog “Stating the case for a Storage Appliance approach“. It was written when I was previously working for an open source storage company. And I am an advocate of the crafter versus assembler mindset, especially in the enterprise and high- performance storage technology segments.

I have joined DDN. Even with DDN that same mindset does not change a bit. I have been saying all along that the storage appliance model should always be the mindset for the businesses’ peace-of-mind.

My view of the storage appliance model began almost 25 years. I came into NAS systems world via Sun Microsystems®. Sun was famous for running NFS servers on general Sun Solaris servers. NFS services on Unix systems. Back then, I remember arguing with one of the Sun distributors about the tenets of running NFS over 100Mbit/sec Ethernet on Sun servers. I was drinking Sun’s Kool-Aid big time.

When I joined Network Appliance® (now NetApp®) in 2000, my worldview of putting software on general purpose servers changed. Network Appliance®, had one product family, the FAS700 (720, 740, 760) family. All NetApp® did was to serve NFS services in the beginning. They were the NAS filers and nothing else.

I was completed sold on the appliance way with NetApp®. Firstly, it was my very first time knowing such network storage services could be provisioned with an appliance concept. This was different from Sun. I was used to managing NFS exports on a Sun SPARCstation 20 to Unix clients in the network.

Secondly, my mindset began to shape that “you have to have the right tool to the job correctly and extremely well“. Well, the toaster toasts bread very well and nothing else. And the fridge (an analogy used by Dave Hitz, I think) does what it does very well too. That is what the appliance does. You definitely cannot grill a steak with a bread toaster, just like you can’t run an excellent, ultra-high performance storage services to serve the demanding AI and HPC applications on a general server platform. You have to have a storage appliance solution for High-Speed Storage.

That little Network Appliance® toaster award given out to exemplary employees stood vividly in my mind. The NetApp® tagline back then was “Fast, Simple, Reliable”. That solidifies my mindset for the high-speed storage in AI and HPC projects in present times.

DDN AI400X2 Turbo Appliance

Costs Benefits and Risks

I like to think about what the end users are thinking about. There are investments costs involved, and along with it, risks to the investments as well as their benefits. Let’s just simplify and lump them into Cost-Benefits-Risk analysis triangle. These variables come into play in the decision making of AI and HPC projects.

Continue reading