The thought of it has been on my mind since Commvault GO 2019. It was sparked when Don Foster, VP of Storage Solutions of Commvault answered a question posted by one of the analysts. What he said made a connection, as I was searching for the better insights to how Commvault and Hedvig would end up to be together.
Data Deluge is a swamp thing now
Several years ago, I heard Stephen Brobst, CTO of Teradata brought up the term “Data Swamp“. It was the anti- part of the Data Lakes, and this was back when Data Lakes and Hadoop were all the rage. His comments were raw, honest and it was leading to the truth out there.
I was enamoured by his thoughts at the time, and today, his comments about the Data Swamp manifested itself.
We have too much data. We have too little data. Either way, the data that we collect, the data that we create, and more that we get from many data sources have to be of value. From my nascent experience I can comment working with a time-series database like InfluxDB. In order to ingest data, there is a lot of work to prepare the data, and enrich the data to get the output and visualization in Chronograf or Grafana. A big part of the work is to align the data to be relevant for the database.
Label data personality at ingest
How do we get valuable and relevant data from the data lake? There are probably billions of data bits to label but applying common definitions to the data is possible. In fact, this will allow disparate storage and information systems to recognize these common “labeling” and gain tremendous benefits from them.
Many years ago, I was introduced to Project COMET at Hitachi Data Systems. I never knew what the acronym stood for, and I assumed that COMET was “Common Metadata”. I was the Oil & Gas Industry Manager at the time. I was tasked to introduce the Hitachi Content Platform (HCP) to the upstream E&P (Exploration & Production) because there was a natural integration with seismic data if they were stored as objects.
Metadata, and tons of it, is the compass to the voluminous data in seismic interpretation, that most valuable phase in upstream. This is where the G&G (geological and geophysical) engineers are paid big bucks, where they will point to the location of where oil and gas are likely to be found. And the object storage of HCP has a natural inclination with metadata.
Metadata is invaluable to the data. From the photo below, you can see the valuable insight the metadata brings to the X-Ray. From the metadata of the photo on the right, we know so much more.
The idea of Project COMET was to insert common metadata details into the groups of data. For instance, we can define the data protection level of the data. This means that there must always be one or more copies of the data, hence protected. We can also define GDPR (General Data Protection Regulation) settings for the data, giving the data a common compliance identity. This is what I mean by advocating common data personality in the title of this blog post.
And the “injection” is applied at the ingestion point of the data, as it enters the storage repository. I don’t think this process is very difficult because we are already applying XML (eXtensible Markup Language) and JSON (Javascript Object Notation) to describe the data, similar to the objective of metatdata. I am no expert in this area but COMET was based on XML, and the implementation of this common data personality is highly plausible. When the data personality and its identified characteristics are imbued with the metadata, the value of the data can be marked and stated.
“You had me at hello“. The famous quote from the movie Jerry MacGuire. It is time to say “You label me at ingest“.
Intent-based Data Personality
This leads to the concept of intent-based data personalities, with outcomes. We want the data to be always protected with 2 copies at all time within the data ecosystem. Or we want the data to comply to the criteria of basic data compliance or we want the data to land on a storage which can provide more than 20,000 IOPS. There are the outcomes, the intent of common data personalities. Laying down the foundation of common data personalities will lead to better outcomes and values for the data.
Helping data at the Edge
Edge Computing will be the new black. No doubt about it because we will have billions of devices, sensors and edge components generating data at the end point. Systems will be inundated with data which will not have any filter. This means that the data workers have to do a lot of data preparation to sieve through the data, sorting and classifying the data into classes of data requirements. Why not put the data through a storage and data management platform which can inject data personalities such as data protection level, data governance and other requirements right at the birth of the data, right at the source of it at the edge. It is something worthy to consider.
Reducing the pain of data preparation
Although we can give this data personality to all bits and pieces of data, the common and foundational requirements in the data itself cannot be understated. It lays the brick for data to be important, to be relevant, to be of value.
We all have been through the pains of preparing the data. To put together the data for various business and operational needs, and to be functional for their purposes, the time and the resources used to get the data to conform to the formats, the structures and more are tedious, error-prone and draining. Putting in common data foundations can help alleviate the pains, and give the data workers the freedom to focus on more important requirements.
Besides the soothing of the pain of data preparation, the common data personality also improves data portability, data security, data classification and eDiscovery.
No to Data Graveyards
After all, we do not want our Data Lakes to become a Data swamp, and they might end up as Data Graveyards. This article talking about Data Lakes and Data Swamps highlights the differences of both. Time to put in the common data personalities to enhance and maximize the value of data.
CTO Advisor Take
Update: While waiting for the right time to publish this blog, CTO Advisor Keith Townsend shared “Are CIOs missing the Metadata Bus?“. His views were apt and discussed the notion of leveraging metadata to decrease time to value for data analysis. The time wasted to prepare and normalizing the data for applying the algorithms mentioned by Keith is similar to the work I have to do with the Python Pandas modules to prepare the data and converting them into line protocol for ingestion to the InfluxDB database.
Time to seriously think about integrating common data personality to the data. It starts with enriching the metadata with common data personality.
Excellent article. We need to eliminate the simple “Save” command and replace it with an enriched “Describe” process (perhaps automated via ML) that will populate searchable metadata that gets saved with the data- which should be saved in Object-like storage instead of the archaic Posix constructs we work with today. Then our back end data infrastructure can work with the intent this article espouses.
Pingback: Random Short Take #26 | PenguinPunk.net
Pingback: Time to Advocate Common Data Personality - Tech Field Day