Preliminary Data Taxonomy at ingestion. An opportunity for Computational Storage

Data governance has been on my mind a lot lately. With all the incessant talks and hype about Artificial Intelligence, the true value of AI comes from good data. Therefore, it is vital for any organization embarking on their AI journey to have good quality data. And the journey of the lifecycle of data in an organization starts at the point of ingestion, the data source of how data is either created, acquired to be presented up into the processing workflow and data pipelines for AI training and onwards to AI applications.

In biology, taxonomy is the scientific study and practice of naming, defining and classifying biological organisms based on shared characteristics.

And so, begins my argument of meshing these 3 topics together – data ingestion, data taxonomy and with Computational Storage. Here goes my storage punditry.

Data Taxonomy in post-injection 

I see that data, any data, has to arrive at a repository first before they are given meaning, context, specifications. These requirements are different from file permissions, ownerships, ctime and atime timestamps, the content of the ingested data stream are made to fit into the mould of the repository the data is written to. Metadata about the content of the data gives the data meaning, context and most importantly, value as it is used within the data lifecycle. However, the metadata tagging, and preparing the data in the ETL (extract load transform) or the ELT (extract load transform) process are only applied post-ingestion. This data preparation phase, in which data is enriched with content metadata, tagging, taxonomy and classification, is expensive, in term of resources, time and currency.

Elements of a modern event-driven architecture including data ingestion (Credit: Qlik)

Even in the burgeoning times of open table formats (Apache Iceberg, HUDI, Deltalake, et al), open big data file formats (Avro, Parquet) and open data formats (CSV, XML, JSON et.al), the format specifications with added context and meanings are added in and augmented post-injection.

Continue reading

Data Trust and Data Responsibility. Where we should be at before responsible AI.

Last week, there was a press release by Qlik™, informing of a sponsored TechTarget®‘s Enterprise Strategy Group (ESG) about the state of responsible AI practices across industries. The study highlighted critical gaps in the approach to responsible AI, ethical AI practices and AI regulatory compliances. From the study, Qlik™ emphasizes on having a solid data foundation. To get to that bedrock foundation, we must trust the data and we must be responsible for the kinds of data that built that foundation. Hence, Data Trust and Data Responsibility.

There is an AI boom right now. Last year alone, the AI machine and its hype added in USD$2.4 trillion market cap to US tech companies. 5 months into 2024, AI is still supernova hot. And many are very much fixated to the infallible fables and tales of AI’s pompous splendour. It is this blind faith that I see many users and vendors alike sidestepping the realities of AI in the present state as it is.

AI is not always responsible. Then it begs the question, “Are we really working with a responsible set of AI applications and ecosystems“?

Responsible AI. Are we there yet?

AI still hallucinates, unfortunately. The lack of transparency of AI applications coming to a conclusion and a recommended decision is not always known. What if you had a conversation with ChatGPT and it says that you are dead. Well, that was exactly what happened when Tom’s Guide writer, Tony Polanco, found out from ChatGPT that he passed away in September 2021.

Continue reading