The hype of Deep Learning (DL), Machine Learning (ML) and Artificial Intelligence (AI) has reached an unprecedented frenzy. Every infrastructure vendor from servers, to networking, to storage has a word to say or play about DL/ML/AI. This prompted me to explore this hyped ecosystem from a storage perspective, notably from a storage performance requirement point-of-view.
One question on my mind
There are plenty of questions on my mind. One stood out and that is related to storage performance requirements.
Reading and learning from one storage technology vendor to another, the context of everyone’s play against their competitors seems to be “They are archaic, they are legacy. Our architecture is built from ground up, modern, NVMe-enabled“. And there are more juxtaposing, but you get the picture – “We are better, no doubt“.
Are the data patterns and behaviours of AI different? How do they affect the storage design as the data moves through the workflow, the data paths and the lifecycle of the AI ecosystem?
I met Tom Clark in 2000. He was a guest at the Quantum’s event in Langkawi, a resort island up north of the Malaysian peninsular. NetApp and Bakbone were the supporting sponsors, and I had the wonderful opportunity to speak with a refined gentleman in Tom. He was polite, soft-spoken but he made an impression with me. He promoted his book “Designing Storage Area Networks” and I was preparing for my SNIA FC-SAN Professional Certification, a predecessor of the Storage Networking Certified Program (SCSP). That book became my bible for a few years until the second edition came out.
From his first book, I learned about applications and workloads and how differing patterns and behaviours in data affected the design of storage platforms, the medium (memory and hard disk drives) choice, RAID and more. And this will be how I approach the subject here.
It was his thoughts and the ideas from his book that influenced and shaped the methodology I look at storage performance today. Sadly, Tom passed away in 2010.
The data patterns
Let us establish each component. Deep Learning is a subset of Machine Learning, which in turn is a subset of Artificial Intelligence. I am no expert in this vast and ever changing field, and my observations and my readings are helping me learn more and deeper too. I have no means to resources in the storage, networking and compute infrastructure, and in data, applications and workloads. Therefore, I rely on my past experiences and knowledge to deduce what I am about to share. I welcome comments and feedback to help me (and others) learn more.
First, Deep Learning. We need to fill DL with heaps and heaps of data. Thus the DL workload pattern is likely to be ultra read-intensive, random, and small data sets, unless they are large images related to oil and gas, or ultra-high resolution videos. I presume the writes I/Os would be a much smaller ratio compared to the reads. DL workloads are possibly more storage intensive rather than compute intensive.
Because of the voluminous data, sharing the data to serve many training nodes in DL would require a decent network with high throughput. RDMA and Infiniband networks are good technologies to consider but could add up to the costs.
As the data moves up to ML after the right training models have been applied to DL, the data pattern is read-intensive but probably not to the degree and intensity of DL. The IOPS pattern could change where the mix ratio of reads and writes are lesser to the extent of DL. A storage system to handle high IOPs is a consideration, together with a high throughput network ecosystem. The compute requirement grows with ML workloads and it is also important to feed the compute platforms to keep the GPUs utilized as much as possible. Under utilized GPUs are wasteful costs.
As the workflows get closer to cognitive systems, AI, the compute demand grows. Inferencing, the act of making highly accurate assumptions based on known thought and learning models, is an example of foundational cognitive intelligence, and will rely on heavy computation and neural networks. In this part of the data path, where there is inferencing, learning, re-learning, and re-establishing accuracy and cognition, the performance demands on the storage system would again shift to a mix of high IOPs, small decision data sets which are highly randomized.
Another consideration is of course the data fabric and the data patterns between the central cloud and the edge. The footprint and the capability of the storage systems at the edge would be lower in resource and power, compared to the immense capability in the cloud. Again, we look at the near real-time latency of these storage systems as well, albeit smaller (in terms of everything) in its performance delivery.
These are my thoughts and considerations I have developed so far.
An ever changing data landscape
What I have written so far is not permanent. There is no single approach to define the data patterns and behaviours as it traverse through the neural pathways of the AI ecosystem in different sections of the data lifecycle. For the moment, the things I have described so far will stay true but soon, they will change again.
I do not prescribe myself as an authority on the subject matter. In fact, I am far from it. In this ever changing data landscape, I have to change along with it to learn, and I learn from the change. But I hope my thoughts – these considerations – can be helpful to all.
Final thoughts? I welcome your feedback and comments.