Accelerated Data Paths of High Performance Storage is the Cornerstone of building AI

It has been 2 months into my new role at DDN as a Solutions Architect. With many revolving doors around me, I have been trying to find the essence, the critical cog of the data infrastructure that supports the accelerated computing of the Nvidia GPU clusters. The more I read and engage, a pattern emerged. I found that cog in the supercharged data paths between the storage infrastructure systems and the GPU clusters. I will share more.

To set the context, let me start with a wonderful article I read in CIO.com back in July 2024. It was titled “Storage: The unsung hero of AI deployments“. It was music to my ears because as a long-time practitioner in the storage technology industry, it is time the storage industry gets its credit it deserves.

What is the data path?

To put it simply, a Data Path, from a storage context, is the communication route taken by the data bits between the compute system’s processing and program memory and the storage subsystem. The links and the established sessions can be within the system components such as the PCIe bus or external to the system through the shared networking infrastructure.

High speed accelerated data paths

In the world of accelerated computing such as AI and HPC, there are additional, more advanced technologies to create even faster delivery of the data bits. This is the accelerated data paths between the compute nodes and the storage subsystems. Following on, I share a few of these technologies that are lesser used in the enterprise storage segment.

The RDMA networking

It is obvious that we cannot have a dedicated 1:1 compute node to storage. This dedicated direct-attached storage (DAS) may be useful in a lab setting but the days of DAS are gone, even in the era of NVMe. NVMe-over-TCP is challenging the long-time king of SANs (storage area networks), Fibre Channel. FC, not wanting to be left out, already has NVMe-over-Fibre Channel support, just carrying a different payload. Thus, as Ethernet-TCP/IP and Fibre Channel dominates shared storage networking, RDMA (remote direct memory access) dominates in the networking of accelerated data paths of AI and HPC. Years ago, I already called of The Rise of RDMA, in a blog I wrote in 2017.

RDMA is widely deployed, either as part of the InfiniBand networking or with RoCEv2 (RDMA over Converged Ethernet). RDMA has ultra-low latency and high-speed relative to traditional TCP/IP over Ethernet, and even Fibre Channel networks.

It is this very fast networking infrastructure and protocol that deliver direct data placement path to the memory of the communicating systems, bypassing the inefficient ways of buffer copies as the network packets as it traverses through the layers of the underlying networking stack.

Lustre Persistent Client Cache (LPCC)

Client-side caching is common in storage infrastructure and network. They keep frequently accessed, and most recently data pages and blocks close to the compute node that uses it. They exist in some small part in SMB oplocks and NFS client cache, assisting in the lower latencies than fetching the pieces of data again over the network. There are plenty of client-side caching as well, some within the filesystem in kernel or user space and some within the application itself.

The challenge of client-side caching is coherency. To keep things consistent and semi in-sync with the rest of the shared storage ecosystem demands high-performance locking and unlocking as well as ensuring that data integrity is persevered at all times. Most client-side caching is for read caches but write caches are available too, albeit harder to implement and maintain in scale for coherency reasons.

The venerable Lustre® filesystem, the distributed parallel filesystem that powers the DDN Exascaler storage platforms, also has a client-side caching mechanism known as Lustre® Persistent Client Cache (LPCC). You can read about the LPCC paper presented at the Supercomputer conference 2019 kept at the ACM Digital Library archives. I have read that PDF multiple of times to understand LPCC deeper, and I hope to provide the deeper views of LPCC in a future blog.

Meanwhile, I found out that LPCC, was used to speed up the benchmark of the once fastest supercomputer in the world, the Sunway TaihuLight. You can read about the LPCC prowess in a research paper here and from the National Supercomputing Center, Wuxi, China here. Below is an early implementation architecture of the LPCC technology in Lustre.

Lustre Persistent Client Cache architecture 2018

The read caching part of LPCC got further enhanced by DDN and became the Exascaler Hot Nodes feature. That went on to power the Nvidia® Eos supercomputer, ranked number 9 in the Top500 ranking list in November 2023. This further solidifies the legend of DDN Exascaler Lustre in the altars of accelerated computing, AI and HPC.

GPUDirect® Storage

Nvidia® knew early on that it couldn’t not achieve hypergrowth if its GPU clusters did scale in performance with shared networked storage. It announced its Magnum IO technology back in 2019, and I also blogged about it naively in 2021. All storage performance technologies awaken my enthusiasm, and Magnum IO has certainly fired up my storage turbo boosters.

In a nutshell, GPUDirect® Storage creates and maintains a direct communication path between its GPU processors and associated memory, bypassing the need of the CPU and the staging memory subsystem (known as a bounce buffer) to load the I/O from the shared high-performance network storage. This is described very well in the diagram below:

Comparing GPUDirect Storage path and CPU+bounce buffer path

This skips the need the CPU cycles to do a POSIX file read operation and the extra hop to copy from the bounce buffer to the GPU memory subsystem with a cudaMemCpy operation. The application within the Nvidia® node, with the right Nvidia libraries, is able to perform a cuFile(Read) operation directly to the shared network storage. It opens up an accelerated data path with low latency between the GPUs and the storage.

Storage – The unsung hero of AI deployments

There are indeed many others besides DDN Exascaler Hot Nodes (as in Lustre® Persistent Client Cache), Nvidia® GPUDirect® Storage and of course, the RDMA networking technology. I do not know every one of them, nor have I learned them all. But I hope this blog provide a glimpse of what the demands of the client-side high-performance storage looks like, and the advanced technologies that serve them. They are hardly features in an enterprise storage solution.

2 months ago, a fantastic article came up in CIO.com. “Storage: The unsung hero of AI deployments“. I am not going to read you the content of the article, but one quote struck the right chord with me. This definitely tugged my heartstrings because of the absolute truth to it.

“AI is so intense in terms of the amount of data that needs to be stored, and how rapidly these massive data sets need to be accessed,”

As I said before in my 2018 blog, “Sexy HPC storage is all the rage“, and I am very happy, at this juncture, storage is at the center of the accelerated computing, AI and HPC. Yeah, storage is boring but essential and indispensable to build AI.

Accelerated Data Paths of High Performance Storage is the Cornerstone of building AI

What is the data path?

The RDMA networking

Lustre Persistent Client Cache (LPCC)

GPUDirect® Storage

Storage – The unsung hero of AI deployments

Related

About cfheoh

Leave a Reply Cancel reply

Recent Posts

Sponsored Ads

Google Adsense

Recent Comments

Google Adsense

Accelerated Data Paths of High Performance Storage is the Cornerstone of building AI

What is the data path?

The RDMA networking

Lustre Persistent Client Cache (LPCC)

GPUDirect® Storage

Storage – The unsung hero of AI deployments

Share this:

Related

About cfheoh

Leave a Reply Cancel reply

Recent Posts

Sponsored Ads

Google Adsense

Recent Comments

Google Adsense