I have known of RDMA (Remote Direct Memory Access) for quite some time, but never in depth. But since my contract work ended last week, and I have some time off to do some personal development, I decided to look deeper into RDMA. Why RDMA?
In the past 1 year or so, RDMA has been appearing in my radar very frequently, and rightly so. The speedy development and adoption of NVMe (Non-Volatile Memory Express) have pushed All Flash Arrays into the next level. This pushes the I/O and the throughput performance bottlenecks away from the NVMe storage medium into the legacy world of SCSI.
Most network storage interfaces and protocols like SAS, SATA, iSCSI, Fibre Channel today still carry SCSI loads and would have to translate between NVMe and SCSI. NVMe-to-SCSI bridges have to be present to facilitate the translation.
In the slide below, shared at the Flash Memory Summit, there were numerous red boxes which laid out the SCSI connections and interfaces where SCSI-to-NVMe translation (and vice versa) would be required.
Thus, the NVMe envelope is now propelled into NVMeF/NVMeOF (NVMe over Fabrics) space. Once again, the pendulum is swinging back into the network again after a few years of vendors pushing Server SAN/VSAN over the networked storage of SANs and NASes.
I spoke about NVMe being the great balancer a few blogs entries ago. I truly believe that, after decades of overlaying and underlaying the piping and the networking for SCSI, NVMe and NVMeF would simplify the server-based storage and network-based storage landscape and bringing equilibrium and peace. 😉
However, in order for NVMe to work seamlessly and transparently within the server and the network, a very low latency and very high throughput link or transport protocol is required. RDMA has emerged as the strongest foundational framework to enable low latency and high throughput network I/O.
From RDMA, several network link and transport protocols are prevalent. They are InfiniBand (IB), RDMA over Converged Ethernet version 2 (ROCEv2), Internet Wide Area RDMA Protocol (iWARP).
The beauty of RDMA is that the delivery and the command channels by its inherent nature, bypasses the operating system’s kernel and thus, bypassing the CPU processing complex as well. With this zero-copy offloading, the latency, buffer-to-buffer copying, and layer-by-layer interfacing processes are removed or greatly reduced (depending whether it is iWARP, iSER, SRP and others), giving RDMA a tremendous advantage over traditional TCP/IP networks.
Not to be left behind, Fibre Channel has also evolved into a link and transport protocol for NVMeF, although it is not RDMA-based (maybe?).
RDMA requires RDMA-enable NIC – RNICs in iWARP, ROCE NICs. In Infiniband, RDMA is already infused in the IB networks of switches, cables and HCAs (Host Channel Adapters) and the QSFPs/SFPs connectors.
To understand RDMA, we must first understand that RDMA starts by registering a specific segment of the memory called the Memory Region. The RDMA kernel driver works with the OS kernel, to establish and structure both the Command Channel and the Data Channel, through a series of “actions” called verbs. This action is akin to “pinning the memory” and “fencing the application’s memory” by informing the OS kernel. This happens at the corresponding receiving host as well.
Once the Command Channel and the Data Channel have been established, the communication paths and sending/receiving “ports” are created. The “ports” are called Queues, and 3 communication Queues are created.
- The Send Queue and Received Queue – always combined into a pair called Queue Pair is for scheduling work
- The Work Queue is an instruction jobs list. Work instructions in the Work Queue are called Work Queue Element (WQE) or “Wookie”.
- The Completion Queue is a completed jobs list. Completed work elements are … you guessed it – Completion Queue Element (CQE) or “Cookie”.
WQEs contain pointers to different elements in the queue. In the Send Queue, the pointer is to a message to be sent. In the Receive Queue, the pointer is to a designated memory buffer where the message is to be placed.
The diagram below (from the ZCopy blog) wonderfully depicts all 3 Queues.
Once the job transaction between the Sender and the Receiver has been completed, a CQE “cookie” is created and placed into the Completion Queue. An instruction is alerted to the application about the completion of the job. (It feels so similar to the “doorbell” system of the MSI-X interrupt in the NVMe framework).
Combined with the Queue structures mentioned, RDMA is also asynchronous, giving it the tremendous ability to scale. And best of all, RDMA has been percolating and maturing in the HPC (High Performance Computing) and parallel computing arena for the past decade.
RDMA seems to have all the important checkboxes ticked. It is low latency, high throughput, does kernel and CPU bypass and it is mature. It is definitely ready as the premier transport framework for NVMeF.
NOTE: I admit that the deep knowledge of RDMA is still fairly new to me. And I took to writing this blog as my learning process, and I invite feedback and inputs.