Preliminary Data Taxonomy at ingestion. An opportunity for Computational Storage

Data governance has been on my mind a lot lately. With all the incessant talks and hype about Artificial Intelligence, the true value of AI comes from good data. Therefore, it is vital for any organization embarking on their AI journey to have good quality data. And the journey of the lifecycle of data in an organization starts at the point of ingestion, the data source of how data is either created, acquired to be presented up into the processing workflow and data pipelines for AI training and onwards to AI applications.

In biology, taxonomy is the scientific study and practice of naming, defining and classifying biological organisms based on shared characteristics.

And so, begins my argument of meshing these 3 topics together – data ingestion, data taxonomy and with Computational Storage. Here goes my storage punditry.

Data Taxonomy in post-injection 

I see that data, any data, has to arrive at a repository first before they are given meaning, context, specifications. These requirements are different from file permissions, ownerships, ctime and atime timestamps, the content of the ingested data stream are made to fit into the mould of the repository the data is written to. Metadata about the content of the data gives the data meaning, context and most importantly, value as it is used within the data lifecycle. However, the metadata tagging, and preparing the data in the ETL (extract load transform) or the ELT (extract load transform) process are only applied post-ingestion. This data preparation phase, in which data is enriched with content metadata, tagging, taxonomy and classification, is expensive, in term of resources, time and currency.

Elements of a modern event-driven architecture including data ingestion (Credit: Qlik)

Even in the burgeoning times of open table formats (Apache Iceberg, HUDI, Deltalake, et al), open big data file formats (Avro, Parquet) and open data formats (CSV, XML, JSON, the format specifications with added context and meanings are added in and augmented post-injection.

Computational Storage objective

One of the Computational Storage (CS) design thinking that I always relate to is to Move Compute closer to the Data. Data is heavy and difficult to move while Compute is fleet footed, lightweight relatively. I even made the analogy of storage being the elephant, and compute being the birds in my previous blog. Thus, in-situ processing, as Computational Storage was previously known as, means that CS has the opportunity to process the data that arrives at the data source, at the ingestion point.

And if one spends the time to read the SNIA® Computational Storage Architecture and Programming Model Version 1.0, the capability of performing data taxonomy, metadata tagging at the CSD (Computational Storage Drive) is certainly very possible. A CSD is a storage component that has a CSE (Computational Storage Engine) with persistent storage, usually a solid-state medium. A CSE is a CS component that can execute one or more CSFs (Computational Storage Functions). The architecture of the CSD component is shown below:

Components of the Computational Storage Drive with CSE and CSF elements

For more understanding of the acronyms I mentioned above, please read the SNIA® document I mentioned above.

Today, I observe that most CSFs focus on encryption, compression (Scaleflux in-line compression), deduplication, ransomware detection (IBM® FlashCore Module with Storage Defender software) and a few more. The CSF functions examples listed in the CS Architecture and Programming Models are:

  • Compression CSF
  • Database Filter CSF
  • Encryption CSF
  • Erasure Coding CSF
  • Regex CSF
  • Scatter-Gather CSF
  • Pipeline-CSF
  • Video Compression CSF
  • Hash/CRC CSF
  • Data deduplication CSF
  • Large Dataset CSF

The capability to introduce a Data Taxonomy CSF is definitely very real. This CSF can tag and imbue the data context, classifies the data and gives meaning almost in real time as the source data is being ingested, and then moved along in the data pipeline downstream to the relevant data repositories and data processing factories, including AI training, already ready enriched and better prepared for the ETL or ETL, or preparation phases. The obvious advantage is it will cut down a lot of the heavy lifting and consuming portion of these phases. After all, time is of the essence in preparing the data for the AI training.

Data Governance, E-C-C and EDA – The Bigger Picture

I ponder this from 2 different angles. From the data management perspective, preliminary data taxonomy activities enriches the data at the very beginning of data ingestion. The value of the data is realized early, and thus data quality is enhanced before they are processed at the data preparation phase. Computational Storage drives and the CS ecosystem can further improve the data performance with compression, secure it with encryption and ransomware detection and response, and of course, the data is already enriched. This ensures that data quality is curated and managed well, to be in line with the organization’s data governance practices.

Secondly, from a data infrastructure angle, this lends to the Edge-to-Core-Cloud (E-C-C) infrastructure architecture, where an organization can achieve a fluid data path from data ingestion to AI data training. Similarly, for real time data ingestion processing, as opposed to batch processing, Event Driven Architectures (EDAs) are very much part of the modern data architecture of many organizations.

To consolidate the bigger picture, data well defined and managed set the roadmap of good data quality within the organization starting the sequence in the data pipeline right at the data ingestion source.

My CS wish.

I have never worked with Computational Storage. I have written about Computational Storage because I see that it has tremendous business value. In the age of AI today, the data journey is not just LLMs (large language models), generative AI, or novel stuff like ChatGPT. It begins at the data source, and the data goes through transformation and training phases before technologies like AI can work with the intelligence we human expect.

Therefore, it is my wish that at every data ingestion point, every storage element at the data source starting point, there is a CS element in it. Every solid state drive should already be a CSD drive with a CSE engine in it, ready to enrich the data. That, in my opinion, is Computational Storage for the AI age.

Tagged , , , , , , , , , , , , , , , , , , , , . Bookmark the permalink.

About cfheoh

I am a technology blogger with 30 years of IT experience. I write heavily on technologies related to storage networking and data management because those are my areas of interest and expertise. I introduce technologies with the objectives to get readers to know the facts and use that knowledge to cut through the marketing hypes, FUD (fear, uncertainty and doubt) and other fancy stuff. Only then, there will be progress. I am involved in SNIA (Storage Networking Industry Association) and between 2013-2015, I was SNIA South Asia & SNIA Malaysia non-voting representation to SNIA Technical Council. I currently employed at iXsystems as their General Manager for Asia Pacific Japan.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.