I have been reading a couple of articles over the weekend which started by placing the weights of outdated networking infrastructure slowing down AI ambitions. The 2 articles are:
- The AI Infrastructure bottleneck no one talks about (which turned out to be a not-so-subtle play for Netris, a secure multi-tenant network provisioner technology).
- Data Infrastructure: The missing link in successful AI adoption (a more subtle introduction of Indicium, an AI data services company).
I did not fully agree that networking infrastructure is the main inhibitor of AI ambitions per se. Not from the experiences and the present development in high performance networking of what I know so far. In fact, AI networking infrastructure has been growing leaps and bounds, laying down ultra-high throughput plumbing between the GPUs (inadvertently up the stack to the AI models and applications) and the data storage infrastructure.
The NVIDIA-heavy GPU compute infrastructure is of course, dominated by its own NVIDIA’s networking infrastructure. Both NVIDIA Spectrum (Ethernet) and Quantum (InfiniBand), BlueField (data processing units), ConnectX and LinkX are the mainstays of DGX Cloud, a big part of NVIDIA NCPs as well.

In fact, in one of DDN’s NCP customers, I have seen a 10-node DDN EXAscaler cluster deliver almost 1.1TB/sec read and 750GB/sec write throughput to the GPU compute cluster, out-of-the-box, all with 200Gbps networking gear.
Continue reading