I have been earnestly observing the growth of Computational Storage for a number of years now. It was known by several previous names, with the name “in-situ data processing” stuck with me the most. The Computational Storage nomenclature became more cohesive when SNIA® put together the CMSI (Compute Memory Storage Initiative) some time back. This initiative is where several standards bodies, the major technology players and several SIGs (special interest groups) in SNIA® collaborated to advance Computational Storage segment in the storage technology industry we know of today.
The use cases for Computational Storage are burgeoning, and the functional implementations of Computational Storage are becoming vital to tackle the explosive data tsunami. In 2018 IDC, in its Worldwide Global Datasphere Forecast 2021-2025 report, predicted that the world will have 175 ZB (zettabytes) of data. That number, according to hearsay, has been revised to a heady figure of 250ZB, given the superlative rate data is being originated, spawned and more.
Computational Storage driving factors
If we take the Computer Science definition of in-situ processing, Computational Storage can be distilled as processing data where it resides. In a nutshell, “Bring Compute closer to Storage“. This means that there is a processing unit within the storage subsystem which does not require the host CPU to perform processing. In a very simplistic manner, a RAID card in a storage array can be considered a Computational Storage device because it performs the RAID functions instead of the host CPU. But this new generation of Computational Storage has much more prowess than just the RAID function in a RAID card.
There are many factors in Computational Storage that make a lot sense. Here are a few:
Voluminous data inundate the centralized architecture of the cloud platforms and the enterprise systems today. Much of the data come from end point devices – mobile devices, sensors, IoT, point-of-sales, video cameras, et.al. Pre-processing the data at the origin data points can help filter the data, reduce the size to be processed centrally, and secure the data before they are ingested into the central data processing systems
Real-time processing of the data at the moment the data is received gives the opportunity to create the Velocity of Data Analytics. Much of the data do not need to move to a central data processing system for analysis. Often in use cases like autonomous vehicles, fraud detection, recommendation systems, disaster alerts etc require near instantaneous responses. Performing early data analytics at the data origin point has tremendous advantages.
Moore’s Law is waning. The CPU (central processing unit) is no longer the center of the universe. We are beginning to see CPU offloading technologies to augment the CPU’s duties such as compression, encryption, transcoding and more. SmartNICs, DPUs (data processing units), VPUs (visual processing units), GPUs (graphics processing units), etc have come forth to formulate a new computing paradigm.
Freeing up central resources with Computational Storage also accelerates the overall distributed data processing in the whole data architecture. The CPU and the adjoining memory subsystem are less required to perform context switching caused by I/O interrupts as in most of the compute/storage architecture today. The total effect relieves the CPU and giving back more CPU cycles to perform higher processing tasks, resulting in faster performance overall.
The rise of memory interconnects is enabling a more distributed computing fabric of data processing subsystems. The rising CXL (Compute Express Link™) interconnect protocol, especially after the Gen-Z annex, has emerged a force to be reckoned with. This rise of memory interconnects will likely strengthen the testimony of Computational Storage in the fast approaching future.
Data movement is expensive. Not just costs, but also latency and resources as well. Thus there were many narratives to move compute closer to where the data is stored because moving compute is definitely more economical than moving data. I borrowed the analogy of the 2 animals from some old NetApp® slides which depicted storage as the elephant, and compute as birds. It was the perfect analogy, because the storage is heavy and compute is light.
“Close up of a white Great Egret perching on top of an African Elephant aa Amboseli national park, Kenya”
Before the animals representation came about I used to use the term “Data locality, Data Mobility“, because of past work on storage technology in the Oil & Gas subsurface data management pipeline.
Take stock of your data movement
I had recent conversations with an end user who has been paying a lot of dollars keeping their “backup” and “archive” in AWS Glacier. The S3 storage is cheap enough to hold several petabytes of data for years, because the IT folks said that the data in AWS Glacier are for “backup” and “archive”. I put both words in quotes because they were termed as “backup” and “archive” because of their enterprise practice. However, the face of their business is changing. They are in manufacturing, oil and gas downstream, and the definitions of “backup” and “archive” data has changed.
For one, there is a strong demand for reusing the past data for various reasons and these datasets have to be recalled from their cloud storage. Secondly, their data movement activities still mimicked what they did in the past during their enterprise storage days. It was a classic lift-and-shift when they moved to the cloud, and not taking stock of their data movements and the operations they ran on these datasets. Still ongoing, their monthly AWS cost a bomb.
Actually, Edge Computing is already here. It has been here on everyone’s lips for quite some time, but for me and for many others, Edge Computing is still a hodgepodge of many things. The proliferation of devices, IoT, sensor, end points being pulled into the ubiquitous term of Edge Computing has made the scope ever changing, and difficult to pin down. And it is this proliferation of edge devices that will generate voluminous amount of data. Obvious questions emerge:
How to do you store all the data?
How do you process all the data?
How do you derive competitive value from the data from these edge devices?
How do you securely transfer and share the data?
From the storage technology perspective, it might be easier to observe what are the traits of the data generated on the edge device. In this blog, we also observe what could some new storage technologies out there that could be part of the Edge Computing present and future.
Edge Computing overview – Cloud to Edge to Endpoint
Storage at the Edge
The mantra of putting compute as close to the data and processing it where it is stored is the main crux right now, at least where storage of the data is concerned. The latency to the computing resources on the cloud and back to the edge devices will not be conducive, and in many older settings, these edge devices in factory may not be even network enabled. In my last encounter several years ago, there were more than 40 interfaces, specifications and protocols, most of them proprietary, for the edge devices. And there is no industry wide standard for these edge devices too.