Data movement is expensive. Not just costs, but also latency and resources as well. Thus there were many narratives to move compute closer to where the data is stored because moving compute is definitely more economical than moving data. I borrowed the analogy of the 2 animals from some old NetApp® slides which depicted storage as the elephant, and compute as birds. It was the perfect analogy, because the storage is heavy and compute is light.
Before the animals representation came about I used to use the term “Data locality, Data Mobility“, because of past work on storage technology in the Oil & Gas subsurface data management pipeline.
Take stock of your data movement
I had recent conversations with an end user who has been paying a lot of dollars keeping their “backup” and “archive” in AWS Glacier. The S3 storage is cheap enough to hold several petabytes of data for years, because the IT folks said that the data in AWS Glacier are for “backup” and “archive”. I put both words in quotes because they were termed as “backup” and “archive” because of their enterprise practice. However, the face of their business is changing. They are in manufacturing, oil and gas downstream, and the definitions of “backup” and “archive” data has changed.
For one, there is a strong demand for reusing the past data for various reasons and these datasets have to be recalled from their cloud storage. Secondly, their data movement activities still mimicked what they did in the past during their enterprise storage days. It was a classic lift-and-shift when they moved to the cloud, and not taking stock of their data movements and the operations they ran on these datasets. Still ongoing, their monthly AWS cost a bomb.
Again, People and Process
To attend to these challenges, an organization has to relook at their data profiles and data activities. These fall into the people and the processes of the respective organizations. Under the umbrella of data management, data should be classified under a taxonomy of usage. With the classification, we define the data profiles including the availability of the data, the RPO (recovery point objectives), the performance, the security and compliance aspects of each respective datasets. This will allow an organization to know a cost related to the data, in its presence and in its absence as well. Then we look at the data movement, the activities such as ingestions into different data repositories, how that data is used, how it is shared, and how it is protection with different classification respective to data privacy and data compliance.
As in my earlier example, attaching the wrong profile to a dataset can have a heavy cost, in many ways.
Storage technologies to consider
One of the philosophies of Apache® Hadoop was to bring compute closer to data, and it has endeared the world to the yellow elephant technology. The trouble with Hadoop was it was too complex, too sluggish and did not transform for the clouds.
Hyper convergence is another storage technology to consider to put data close to compute, but the trouble with generation 1 of hyper convergence platforms is they are too confining. Many cannot separate the compute and the storage components apart without one affecting the performance of the other. NetApp® tried with the Solidfire storage to paint the storage of scaling compute and storage separately but decided that the hyper converged market was not its cup of tea. A few other vendors decided to make a SAN to the limitations of first generation hyper converged and called it dHCI (disaggregated Hyper Converged Infrastructure).
A second generation of hyper converged infrastructure is emerging but this segment comes from the pedigree enterprise storage vendors unlike previous ones which designed from the compute side. iXsystems™ TrueNAS® SCALE and Dell EMC PowerStore X have a Debian KVM and VMware® ESXi respectively. TrueNAS® SCALE looks to be a better bet with impressive integration with containers and Kubernetes, and more. Check out this latest webinar from BayLISA (Bay Area Large Installation Systems Administrator) talking about TrueNAS® SCALE: Open Source, Scale-Out HCI & Kubernetes Storage.
Copy data management can also be considered a storage solution to reduce the amount of duplicate data to move.
Computational storage is another storage technical that is brewing rapidly in a burgeoning new market segment. SNIA® has been a strong advocate since its inception, and here is a look at the state of Computation Storage market.
Plain and simple data management
I may have listed a few storage related technologies that can alleviate the issue of data movement and bring compute closer to the data. But fundamentally, it is just plain and simple data management discipline. Knowing where your data repositories are, what their data profiles are, how they are moved and used over the data lifecycle in the organization, and finding crevices where data can be optimized where it is closer to compute, are just simple animal symbiotic relationship.
Bringing compute to data not only speed the processing of the data, especially at the edge and IoT devices, but also reduces the need to send voluminous data to a cloud facility. That data movement dynamics must be understood before dumping it to a premises such as the cloud storage, just like we understand the nature between the elephants and the birds.