Rethinking Storage OKRs for AI Data Infrastructure – Part 2

[ Preamble: This analysis focuses on my own journey as I incorporate my experiences into this new market segment called AI Data Infrastructure. There are many elements of HPC (High Performance Computing) at play here. Even though things such as speeds and feeds, features and functions crowd many conversations, as many enterprise storage vendors do, these conversations, in my opinion, are secondary. There are more vital and important operational technology and technical elements that an organization has to consider prudently. They involve asking the hard questions beyond the marketing hype and fluff. I call these elements of consideration Storage Objectives and Key Results (OKRs) for AI Data Infrastructure.

I had to break this blog into 2 parts. It has become TL;DR-ish. This is Part 2 ]

This is a continuation from Part 1 of my blog last week. I spoke about the 4 key OKRs (Objectives and Key Results) we look at from the storage point-of-view with regards to AI data infrastructure. To recap, they are:

  • Reliability
  • Speed
  • Power Efficiency
  • Security

Power Efficiency

Patrick Kennedy of ServeTheHome (STH) fame, astutely explained the new generation of data center racks required by NVIDIA® and AMD® in his article “Here is how much Power we expect AMD® and NVIDIA® racks will need in 2027” 2 weeks ago. Today, the NVIDIA® GB200 NVL72 ORv3 rack design takes up 120kW per rack. That’s an insane amount of power consumption that can only go up in the next 2-3 years. That is why power efficiency must be an OKR metric to be deeply evaluated.

When you operate a GPU compute farm, whether it is 8 GPUs or 16,384 GPUs, keep operations tight is vital to ensure that maximum power efficiency is right up there with the rest of the operational OKRs. The element of power consumption becomes a cost factor in the data infrastructure design for AI.

2 very important units of measurements I would look into, and that have become valuable OKRs to achieve are Performance per Watt (Performance/Watt) and Performance per Rack Unit (Performance/RU).

Power Efficiency in Data Center is a Must.

Continue reading

Rethinking Storage OKRs for AI Data Infrastructure – Part 1

[ Preamble: This analysis focuses on my own journey as I incorporate my past experiences into this new market segment called AI Data Infrastructure, and gaining new ones.

There are many elements of HPC (High Performance Computing) at play here. Even though things such as speeds and feeds, features and functions crowd many conversations, as many enterprise storage vendors like to do, these conversations, in my opinion, are secondary. There are more vital and important operational technology and technical elements that an organization has to consider prudently, vis-a-vis to ROIs (returns of investments). They involve asking the hard questions beyond the marketing hype and fluff. I call these elements of consideration Storage Objectives and Key Results (OKRs) for AI Data Infrastructure.

I had to break this blog into 2 parts. It has become TL;DR-ish. This is Part 1 ]

I have just passed my 6-month anniversary with DDN. Coming into the High Performance Storage System (HPSS) market segment, with the strong focus on the distributed parallel filesystem of Lustre®, there was a high learning curve for me. I spend over 3 decades in Enterprise Storage, with some of the highest level of storage technologies there were in that market segment. And I have already developed my own approach to enterprise storage, based on the A.P.P.A.R.M.S.C.. That was already developed and honed from 25 years ago.

The rapid adoption of AI has created a technology paradigm shift. Artificial Intelligence (AI) came in and blurred many lines. It also has been evolving my thinking when it comes to storage for AI. There is also a paradigm shift in my thoughts, opinions and experiences as well.

AI has brought HPSS technologies like Lustre® in DDN EXAscaler platform , proven in the Supercomputing world, to a new realm – the AI Data Infrastructure market segment. On the other side, many enterprise storage vendors aspire to be a supplier to the AI Data Infrastructure opportunities as well. This convergence from the top storage performers for Supercomputing, in the likes of DDN, IBM® (through Storage Scale), HPE® (through Cray, which by-the-way often uses the open-source Lustre® edition in its storage portfolio), from the software-defined storage players in Weka IO, Vast Data, MinIO, and from the enterprise storage array vendors such as NetApp®, Pure Storage®, and Dell®.

[ Note that I take care not to name every storage vendor for AI because many either do OEMs or repacking and rebranding of SDS technology into their gear such as HPE® GreenLake for Files and Hitachi® IQ. You can Google to find out who the original vendors are for each respectively. There are others as well. ]

In these 3 simplified categories (HPSS, SDS, Enterprise Storage Array), I have begun to see a pattern of each calling its technology as an “AI Data Infrastructure”. At the same time, I am also developing a new set of storage conversations for the AI Data Infrastructure market segment, one that is based on OKRs (Objectives and Key Results) rather than just features, features and more features that many SDS and enterprise storage vendors like to tout. Here are a few thoughts that we should look for when end users are considering a high-speed storage solution for their AI journey.

AI Data Infrastructure

GPU is king

In the AI world, the GPU infrastructure is the deity at the altar. The utilization rate of the GPUs is kept at the highest to get the maximum compute infrastructure return-on-investment (ROI). Keeping the GPUs resolutely busy is a must. HPSS is very much part of that ecosystem.

These are a few OKRs I would consider the storage or data infrastructure for AI.

  • Reliability
  • Speed
  • Power Efficiency
  • Security

Let’s look at each one of them from the point of view of a storage practitioner like me.

Continue reading

AI and the Data Factory

When I first heard of the word “AI Factory”, the world was blaring Jensen Huang‘s keynote at NVIDIA GTC24. I thought those were cool words, since he mentioned about the raw material of water going into the factory to produce electricity. The analogy was spot on for the AI we are building.

As I engage with many DDN partners and end users in the region, week in, week out, the “AI Factory” word keeps popping into conversations. Yet, many still do not know how to go about building this “AI Factory”. They only know they need to buy GPUs, lots of them. These companies’ AI ambitions are unabated. And IDC predicts that worldwide spending on AI will double by 2028, and yet, the ROI (returns on investment) remains elusive.

At the ground level, based on many conversations so far, the common theme is, the steps to begin building the AI Factory are ambiguous and fuzzy to most. I like to share my views from a data storage point of view. Hence, my take on the Data Factory for AI.

Are you AI-ready?

We have to have a plan but before we take the first step, we must look at where we are standing at the present moment. We know that to train AI, the proverbial step is, we need lots of data. Deep Learning (DL) works with Large Language Models (LLMs), and Generative AI (GenAI), needs tons of data.

If the company knows where they are, they will know which phase is next. So, in the AI Maturity Model (I simplified the diagram below), where is your company now? Are you AI-ready?

Simplified AI Maturity Model

Get the Data Strategy Right

In his interview with CRN, MinIO’s CEO AB Periasamy quoted “For generative AI, they realized that buying more GPUs without a coherent data strategy meant GPUs are going to idle out”. I was struck by his wisdom about having a coherent data strategy because that is absolutely true. This is my starting point. Having the Right Data Strategy.

In the AI world, from a data storage guy, data is the fuel. Data is the raw material that Jensen alluded to, if it was obvious. We have heard this anecdotal quote many times before, even before the AI phenomenon took over. AI is data-driven. Data is vital for the ROI of AI projects. And thus, we must look from the point of the data to make the AI Factory successful.

Continue reading

What next after Cyber Resiliency?

There was a time some years ago when some storage vendors, especially the object storage ones, started calling themselves the “last line of defence”. And even further back, when the purpose-built backup appliances (PBBAs) first appeared, a very smart friend of mine commented that they shouldn’t call it “backup appliance”, but rather they should call it “restore appliance”. That was because the data restoration part, or to be more relevant in today’s context, data recovery is the key to a crucial line of defence against cybersecurity threats to data, especially ransomware. We have a saying in the industry. “Hundreds of good backups are not as good as one good restore.” Of course, this data restoration part has become more sophisticated in the data recovery processes.

In recent years, we also seen the amalgamation of both data protection species – the backup/restore side and the cybersecurity side – giving rise to the term and the proliferation of Cyber Resilience.

Dialing Cyber Resilience (Picture from tehtris.com)

I have no qualms or lack of confidence of the cyber resilience technologies. I am pretty sure they can do the job extremely well, so much so, that some give million dollars guarantees if ever their solution failed. Druva announced their Data Resiliency Guarantee of USD$10 million and Rubrik has their Ransomware Recovery Warranty.

Of course, these warranties and guarantees come with terms and conditions, and caveats and not everyone is besotted by these big numbers’ payout. My friend, Andrew Martin, wrote a tongue-in-cheek piece last year about Rubrik’s warranty guarantee in his Data Storage Asia blog last year, which discussed whether it was Rubrik’s genuineness or spuriousness that might win or lose customers’ affections. You should read his blog to decide.

Continue reading

Nurturing Data Governance for Cybersecurity and AI

Towards the middle of the 2000s, I started getting my exposure in Data Governance. This began as I was studying and practising to be certified as an Oracle Certified Professional (OCP) circa 2002-2003. My understanding of the value of data and databases in the storage world, now better known as data infrastructure, grew and expanded quickly. I never gotten my OCP certification because I ran out of money investing in the 5 required classes that included PL/SQL, DBA Admin I and II, and Performance Tuning. My son, Jeffrey was born in 2002, and money was tight.

The sentiment of data governance of most organizations I have engaged with at that time, and over the next course of almost 18 years or so, pre-Covid, the practice of data governance was to comply to some regulatory requirements. 

All that is changing. Early 2024, NIST released the second version of their Cybersecurity Framework (CSF). CSF 2.0 placed Data Governance in the center of the previous 5 pillars of CSF 1.1. The diagram below shows the difference between the versions.

High level change of Cybersecurity Framework 1.1 to 2.0.

Ripples like this in my data management radar are significant, noticeable and important to me. I blogged about it in my April 2024 blog “NIST CSF 2.0 brings Data Governance into the Light“.

Continue reading

Preliminary Data Taxonomy at ingestion. An opportunity for Computational Storage

Data governance has been on my mind a lot lately. With all the incessant talks and hype about Artificial Intelligence, the true value of AI comes from good data. Therefore, it is vital for any organization embarking on their AI journey to have good quality data. And the journey of the lifecycle of data in an organization starts at the point of ingestion, the data source of how data is either created, acquired to be presented up into the processing workflow and data pipelines for AI training and onwards to AI applications.

In biology, taxonomy is the scientific study and practice of naming, defining and classifying biological organisms based on shared characteristics.

And so, begins my argument of meshing these 3 topics together – data ingestion, data taxonomy and with Computational Storage. Here goes my storage punditry.

Data Taxonomy in post-injection 

I see that data, any data, has to arrive at a repository first before they are given meaning, context, specifications. These requirements are different from file permissions, ownerships, ctime and atime timestamps, the content of the ingested data stream are made to fit into the mould of the repository the data is written to. Metadata about the content of the data gives the data meaning, context and most importantly, value as it is used within the data lifecycle. However, the metadata tagging, and preparing the data in the ETL (extract load transform) or the ELT (extract load transform) process are only applied post-ingestion. This data preparation phase, in which data is enriched with content metadata, tagging, taxonomy and classification, is expensive, in term of resources, time and currency.

Elements of a modern event-driven architecture including data ingestion (Credit: Qlik)

Even in the burgeoning times of open table formats (Apache Iceberg, HUDI, Deltalake, et al), open big data file formats (Avro, Parquet) and open data formats (CSV, XML, JSON et.al), the format specifications with added context and meanings are added in and augmented post-injection.

Continue reading

Making Immutability the key factor in a Resilient Data Protection strategy

We often hear “Cyber Resilience” word thrown around these days. Every backup vendor has a cybersecurity play nowadays. Many have morphed into cyber resilience warrior vendors, and there is a great amount of validation in terms of Cyber Resilience in a data protection world. Don’t believe me?

Check out this Tech Field Day podcast video from a month ago, where my friends, Tom Hollingsworth and Max Mortillaro discussed the topic meticulously with Krista Macomber, who has just become the Research Director for Cybersecurity at The Futurum Group (Congrats, Krista!).

Cyber Resilience, as well articulated in the video, is not old wine in a new bottle. The data protection landscape has changed significantly since the emergence of cyber threats and ransomware that it warrants the coining of the Cyber Resilience terminology.

But I want to talk about one very important cog in the data protection strategy, of which cyber resilience is part of. That is Immutability, because it is super important to always consider immutable backups as part of that strategy.

It is no longer 3-2-1 anymore, Toto. 

When it comes to backup, I always start with 3-2-1 backup rule. 3 copies of the data; 2 different media; 1 offsite. This rule has been ingrained in me since the day I entered the industry over 3 decades ago. It is still the most important opening line for a data protection specialist or a solution architect. 3-2-1 is the table stakes.

Yet, over the years, the cybersecurity threat landscape has moved closer and closer to the data protection, backup and recovery realm. This is now a merged super-segment pangea called cyber resilience. With it, the conversation from the 3-2-1 backup rule in these last few years is now evolving into something like 3-2-1-1-0 backup rule, a modern take of the 3-2-1 backup rule. Let’s take a look at the 3-2-1-1-0 rule (simplified by me).

The 3-2-1-1-0 Backup rule (Credit: https://www.dataprise.com/services/disaster-recovery/baas/)

Continue reading

Data Trust and Data Responsibility. Where we should be at before responsible AI.

Last week, there was a press release by Qlik™, informing of a sponsored TechTarget®‘s Enterprise Strategy Group (ESG) about the state of responsible AI practices across industries. The study highlighted critical gaps in the approach to responsible AI, ethical AI practices and AI regulatory compliances. From the study, Qlik™ emphasizes on having a solid data foundation. To get to that bedrock foundation, we must trust the data and we must be responsible for the kinds of data that built that foundation. Hence, Data Trust and Data Responsibility.

There is an AI boom right now. Last year alone, the AI machine and its hype added in USD$2.4 trillion market cap to US tech companies. 5 months into 2024, AI is still supernova hot. And many are very much fixated to the infallible fables and tales of AI’s pompous splendour. It is this blind faith that I see many users and vendors alike sidestepping the realities of AI in the present state as it is.

AI is not always responsible. Then it begs the question, “Are we really working with a responsible set of AI applications and ecosystems“?

Responsible AI. Are we there yet?

AI still hallucinates, unfortunately. The lack of transparency of AI applications coming to a conclusion and a recommended decision is not always known. What if you had a conversation with ChatGPT and it says that you are dead. Well, that was exactly what happened when Tom’s Guide writer, Tony Polanco, found out from ChatGPT that he passed away in September 2021.

Continue reading

NIST CSF 2.0 brings Data Governance into the light

In the past weekend, I watched a CNA Insider video delving into Data Theft in Malaysia. It is titled “Data Theft in Malaysia: How your personal information may be exploited | Cyber Scammed”.

You can watch the 45-minute video below.

Such dire news is nothing new. We Malaysians are numbed to those telemarketers calling and messaging to offer their credit card services, loans, health spa services. You name it; there is something to sell. Of course, these “services” are mostly innocuous, but in recent years, the forms of scams are risen up several notches and severity levels. The levels of sophistication, the impacts, and the damages (counting financial and human casualties) have rocketed exponentially. Along with the news, mainstream and others, the levels of awareness and interests in data, especially PII (personal identifiable information) in Malaysians, are at its highest yet.

Yet the data theft continues unabated. Cybersecurity Malaysia (CSM), just last week, reported a 1,192% jump of data theft cases in Malaysia in 2023. In an older news last year, cybersecurity firm Surf Shark ranked Malaysia as the 8th most breached country in Q3 of 2023.
Continue reading