Intelligent Data Movement and Data Placement dictate the future of AI Data Infrastructure

I have been reading a couple of articles over the weekend which started by placing the weights of outdated networking infrastructure slowing down AI ambitions. The 2 articles are:

I did not fully agree that networking infrastructure is the main inhibitor of AI ambitions per se. Not from the experiences and the present development in high performance networking of what I know so far. In fact, AI networking infrastructure has been growing leaps and bounds, laying down ultra-high throughput plumbing between the GPUs (inadvertently up the stack to the AI models and applications) and the data storage infrastructure.

The NVIDIA-heavy GPU compute infrastructure is of course, dominated by its own NVIDIA’s networking infrastructure. Both NVIDIA Spectrum (Ethernet) and Quantum (InfiniBand), BlueField (data processing units), ConnectX and LinkX are the mainstays of DGX Cloud, a big part of NVIDIA NCPs as well.

In fact, in one of DDN’s NCP customers, I have seen a 10-node DDN EXAscaler cluster deliver almost 1.1TB/sec read and 750GB/sec write throughput to the GPU compute cluster, out-of-the-box, all with 200Gbps networking gear.

The rise of networking in AI infrastructure

To further support my take on ultra-high performance networking for AI compute, here are a few recent interesting technology news that would change your mind.

There are many more AI networking technology developments that are making their marks to accelerate AI ambitions. I am not convinced that these are bottlenecks.

It is not just storage speed now.

I have been seeing the AI infrastructure (compute, networking, storage) vendors innovating to hyper-increase data throughput, both for Read I/Os and Write I/Os. Many have done well in the Read department. A well designed data infrastructure (storage) cluster can easily deliver 3 figure GB/sec speeds with standard 100Gbps Ethernet or higher.

Write throughput is a different story, and not all data storage vendors excel in this department. Many aggregate both read and write numbers to make the whole performance thingy look aestheticly pleasing.

But the hardest part from data delivery from data storage to the AI compute infrastructure is not throughput, but latency.

I have observed that many vendors are trying to collapse the service round trip between GPUs and the datasets, all in the name of reducing latency.

What is Storage Latency?

A Google search with AI overview (copied verbatim) reveals that “Storage latency refers to the time delay between a storage device receiving a data request and the completion of that request, encompassing both read and write operations. It’s a critical performance metric in data storage, directly impacting application responsiveness and overall system efficiency. Lower latency means faster access to data, which is crucial for applications that require quick responses, like online transactions or streaming services.

In the galumphing charge of AI in everything technology now, latency has become a vital measurement of AI inferencing’s progress and success, from a data infrastructure point-of-view.

The AI Data Pipeline

In my work, I often bring up the AI Data Pipeline. It is about how data is sourced, procured, created and ingested into the AI data architecture, how data is prepared, enriched before the data is trained with the foundation and frontier models. Then we have to think about how trained data is optimized and enhanced before they are placed to do inferencing and go into production. Below is an AI Data Pipeline I used often in my customer presentation.

To understand performance bottlenecks, especially on the data that is fueling the AI ecosystem through the infrastructure perspective, I use data points within the AI Data Pipeline’s plumbing relative to the high performance compute elements in the ecosystem.

What are some of them doing?

I see several storage-heavy vendors getting deeper beyond just the networking infrastructure. The most obvious one is NVIDIA’s GPUDirect Storage, loosely named Magnum. I have written a little bit about Magnum before back in 2021, and given my shallow knowledge, it was an early indicator of how to use NVIDIA software to control data movement and data placement in order to achieve a higher magnitude of performance.

Techniques like data movement and data placement are not new. They have been used in Supercomputing for decades, and also in enterprise storage as well. The NetApp PAM (Performance Accelerator Module) card was one I was familiar with, and then, earlier still, Fusion-io Flash Memory module that were using intelligent data caching innovation (a form of controlling the most frequently and most recently accessed data are placed). The ZFS filesystem, uses fast storage medium to enhance its ARC (Adaptive Replacement Cache) to improve read data serving response times as well.

Another technique is to place the data (especially read data) closer to the compute. This is very much client-side data placement as I have seen in several storage vendors. When I joined DDN, I was asked to understand the Lustre Persistent Client Cache deeply, which found its way into Sunway Taihulight supercomputer (once the fastest in the world between 2016 and 2018). Recently, Hammerspace introduce Tier-0, that turns the local NVMe drives into an ultra-high performance data tier with ultra-low latency.

Weka recently introduced Neural Mesh. One of the features is the ability to control and address files with small block sizes, usually the metadata types used in AI data. This is a data placement technique (I don’t have the deep technical details yet), where small block files are optimized with greater efficiency and support. Incidently, Weka has also introduced their Augmented Memory Grid, an integration with distributed inference engine such as NVIDIA Dynamo to provide a larger key-value pair memory footprint supporting large context-aware processing.

Others like Vast Data are integrating part of the AI and Machine Learning (ML) Data Pipeline into their Data Platform, again to enhance its prowess with accelerating data and controlling how the data movements and placements work within their technology.

The use of BlueField-3 DPUs (data processing units) in vendors’ platforms like DDN Infinia, Weka on BlueField-3, Vast Data on BlueField-3 are just not to curry flavour with NVIDIA. DPUs are surely taking over some of the repetitive and resource intensive tasks of the CPUs, but it is also changing the paradigm of how data is processed. The BlueField-3 DPUs are making data more secure, more performant and tighter integration with the NVIDIA ecosystem. This, again, is a data movement and data placement technique in play, all to hit the lowest storage latency possible.

Where is the Infrastructure Pendulum now?

I have been in the infrastructure industry long enough that performance does not forever get stuck. The need to get the data, to move the data, to process the data and more swings like a pendulum among the 3 pieces of infrastructure – Compute, Network, Storage. At this point, as the voluminous footprint of data is growing exponentially, I would say Storage, or in AI-speak, Data Infrastructure is facing the greatest challenge. That is why innovative techniques of the ones I have mentioned, and more that I have not covered are about collapsing the gap of data movement and data placement to the right compute processing, up-the-stack, to the edge-core-cloud and back.

Data is heavy. I describe them as the elephant of the 3 infrastructure layers, while in the compute infrastructure, data is fleeting as birds. The networking infrastructure layer in between has gotten better, but storage professionals like myself understand data. It will be forever growing and forever demanding more. The GPUs have jumped ahead, and so has the networking pipes. Storage infrastructure is getting better as well, but it will be the intelligent data movement and data placement technologies that will separate the data storage heroes from the wanna-bes.

Tagged , , , , , , , , . Bookmark the permalink.

About cfheoh

I am a technology blogger with 30 years of IT experience. I write heavily on technologies related to storage networking and data management because those are my areas of interest and expertise. I introduce technologies with the objectives to get readers to know the facts and use that knowledge to cut through the marketing hypes, FUD (fear, uncertainty and doubt) and other fancy stuff. Only then, there will be progress. I am involved in SNIA (Storage Networking Industry Association) and between 2013-2015, I was SNIA South Asia & SNIA Malaysia non-voting representation to SNIA Technical Council. I currently employed at iXsystems as their General Manager for Asia Pacific Japan.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.