I am on a learning streak again. The most prominent technology that keeps landing on my tray at present is, of course, Artificial Intelligence (AI).
AI is hot. Very hot. And overhyped. Everyone is an expert nowadays. Yeah, right. Not me.
Underneath that glossy veneer of the AI hype, there are much going on behind the scenes to make AI great. The 2 areas I have been involved in and practiced for a long time are data infrastructure a.k.a. storage, and data management. And both are playing prominent parts in the advancement of the AI ecosystem. This makes me very excited.
I am no expert, but learning from various sources is already telling me that AI is pushing both storage and data management harder than ever before, much harder than traditional enterprise on-premises use cases and even the cloud computing applications. I ask myself, “where do I start my learning again?” as I journal my process.
Storage performance in a Data Pipeline
Speed of how AI responds is Trust. The faster it is to the accurate and relevant responses will build trust in AI. To get to the speed that we want is not an easy thing, and storage a.k.a. data infrastructure is doing its part. I pick up my learning from understanding the AI pipeline. One early help comes from my friend, Gina Rosenthal, who attended the Solidigm‘s presentation at AI Field Day in February 2024. Her article, titled “Why storage matters for AI – Solidigm“, kickstarted my learning juices again.
I was particularly captured by this slide in Gina’s article. It defines the laborious path data takes to become useful for AI applications.
AI data, through these 5 phases of ingestion, preparation, training, checkpointing and inferencing, requires very high storage performance. At different stages, there is an incredible demand for high read IOPS or there can be ultra-high write throughput with low, low latency. All these high performance storage I/O behaviors are well summarized in the last row of the table above.
At the same time, I am also taking on a different view with another source, SNIA®. In another February 2024 presentation, “Addressing the hidden costs of AI“, the SNIA® delegates articulated the challenges on the storage infrastructure, and even more so. One slide interested me in particular.
The GPUs are the ultimate workhorses in this AI ecosystem. The goal of all the supporting casts, and this includes storage, is to keep feeding the GPUs, to keep them working to develop and birth new AI-based applications. An idle cluster of GPUs is extremely expensive, operationally and investment wise.
Data Management
Data is the fuel that drives AI applications. Bad fuel quality equals bad outcomes, and hence, the GIGO (garbage in, garbage out) mantra holds true. I have written about this in my previous blog, And great AI starts with good data management.
However, there are many considerations with data. Should an organization use its own data? Should it use public data? Are they permitted to the sourced data? To what extent can they use the data? What about data created using generative AI services? What about data scraping or information scraping? So many questions trashing my mind right now.
After all, if an organization uses unsanctioned data or data it had sourced from unknown parties, will the AI applications that use those data safe? Can we trust AI applications generated from such data that is yet to be trusted? We are already informed of AI hallucinations, information disseminated by some AI applications in misleading ways. If the amount of misleading data, maybe misinformation or disinformation exceeds 51% or more, over actual truth, will it tip the scale of human beings understanding Truth?
OK. That maybe a bit far fetched, for now. This has led me down the path of Data Governance.
Organizations now has a duty to ensure that the data sourced, prepared, trained are designed and developed for truthful AI. And I started down this learning path because NIST has just recognized the Governing of data in its latest release of Cybersecurity Framework (CSF) 2.0. Data Governance has just been given the prominent recognition in the management of data security, and I just wrote about it in my previous blog in the heightened awareness of data thefts in Malaysia, and pretty much all around the world.
On the flip side, what about our data that is given willing? If the users and the consumers are easily giving away personal data to an organization with purportedly strong data security and data privacy, can we truly, truly trust that organization. This is what happened when the Malaysian government tried to initiate PADU (Pengkalan Data Utama), a centralized database hub for all Malaysian citizens recently. It is to allow the Malaysian government to dispense target subsidies and assistance, for a start.
Reception to PADU has been tepid. To me that is a good thing, because Malaysians are now more aware than ever to share their PII (Personal Identifiable Information) easily. There is a Malay proverb, “Harapkan pagar, pagar makan padi“, which means trusting someone who might betray you later. I am happy for Malaysians for their digital maturity.
But in other parts of the world, I am perplexed how poorly PII (personal identifiable information) is playing out. I am upset of the Worldcoin project collecting very personal details of an individual’s eyeballs through an orb in exchange for some cryptocurrency reward. Or the Tencent Weixin Palm Scan Wallet payment using the veins in your hands. All in the name of technological advancement and convenience.
But who are these organizations that many human beings have given their PII to? Are they given the mandate to use our PII? Even if assurance is given, how safe can the data be? Can these organizations be compelled to pass on the PII of an individual to the authorities? The China Social Credit System is already doing that to some extent.
To ensure data is properly procured, handled, shared, stored, protected and more, Data Governance is at the heart of it. I am interested in learning in this space. In the background, ESG (environmental, social governance) is slowly making sense now in South East Asia.
As I was researching Data Governance over the weekend, I chanced upon a very interesting and apt conversation by Straits Interactive, a Singapore data protection consulting company. The conversation brought up the term “explainability” which links to the topic of how AI is using the data, good and bad. This is the part where we have to know the reasons how algorithms behind the AI applications made a decision or recommendation based on data. This presents a critical question of what sort of data should be trained with the AI models. Data governance becomes vital to ensure proper and trustful data is presented in a fair and transparent manner to AI. It boils down to Trust, the element of accepting the “truth” presented.
And I got to know of the ISO/IEC 38505 standard, that provides the guidelines for data governance, and also the ISO/IEC 42001 standard as well.
You can watch the Youtube video below:
It’s just the beginning
Those are the tough questions I am exploring as I embark on this journey of furthering data infrastructure, storage and data management in the era of AI. And the pressure on these areas is much more demanding now, pushing the envelopes harder than ever before.
These are my starting points. It is still early days, and things in the AI infrastructure and data management ecosystem are still fluid as we speak. I am just making my early steps, gathering and absorbing knowledge before I can dispense my own take on designing a storage architecture and a data management framework for AI workloads and use cases. I will share more as I go deeper into my learnings in future blogs.
I am still that storage dinosaur; I have been called that in my face. But I am evolving. Learning never stops with me. I am never stop never stopping (borrowed from Conner4Real). It will be fun times again. I am just loving this AI with storage and data management stuff.
Pingback: AI Is Pushing Storage and Data Management Harder Than Ever - Tech Field Day
Pingback: Infrastructure advancements reshaping the AI and data landscape - SiliconANGLE