The AI Platformization of Storage – The Data Intelligence Platform

The IT industry uses the word “platform” all the time. Often, I find myself shifting between the many jargons circling “platform”, loosely. I am pretty sure many others are doing so as well.

I finally found the word “platformization” giving right vibes in a meaningful way in February last year, when Palo Alto Networks pivoted to platformization. Their stock tumbled that day. Despite the ambiguous definition “platformization” when Palo Alto Networks (PANW) mentioned it, I understood their strategy.

Defence-in-Depth in cybersecurity wasn’t exactly working for many organizations. Cybersecurity point solutions peppered the landscape. There were so many leaks and gaps. Platformization, from the PANW‘s point-of-view, is the reverse C&C (command & control), if you know the cybersecurity speak. PANW wants to take charge all the way for all things cybersecurity, and it made sense to me from a data perspective.

Paradigm shift for Data. 

For the longest time, networked storage technology has been about data sharing, be it blocks, files or objects. The data from these protocols is delivered over the network, mostly over Fibre Channel and/or Ethernet (although I remembered implementing NFS over Asynchronous Transfer Mode at Sarawak Shell in East Malaysia), in a client-server fashion.

By late 2000s onwards, unified storage or multi-protocol storage (where the storage array is able to served all 3 SAN, NAS and S3 services) was all the rage. All the prominent enterprise storage vendors had a solution or two in their solutions portfolio. I started viewing networked storage as a Data Services Platform which I started explaining it in 2017. Within the data services platform, various features revolve around my A.P.P.A.R.M.S.C. framework (I crafted the initial framework in 2000, thanks to Jon Toigo‘s book – The Holy Grail of Data Management). This framework and the approach I used for my consulting and analyst work worked well and is still relevant, even after 25 years.

But AI is changing the data landscape. AI is changing the way data is consumed and processed through the networks between the compute layer and the storage layer. It is indeed, for me, a paradigm shift of data, and the storage layer, better known as AI Data Infrastructure now, is shifting as well. And this shift will accelerate the exponential growth in innovations, with AI and super-charged data leading the way.

DDN Infinia Data Intelligence Platform (screencapture from DDN Beyond Artificial webinar)

Continue reading

Preliminary Data Taxonomy at ingestion. An opportunity for Computational Storage

Data governance has been on my mind a lot lately. With all the incessant talks and hype about Artificial Intelligence, the true value of AI comes from good data. Therefore, it is vital for any organization embarking on their AI journey to have good quality data. And the journey of the lifecycle of data in an organization starts at the point of ingestion, the data source of how data is either created, acquired to be presented up into the processing workflow and data pipelines for AI training and onwards to AI applications.

In biology, taxonomy is the scientific study and practice of naming, defining and classifying biological organisms based on shared characteristics.

And so, begins my argument of meshing these 3 topics together – data ingestion, data taxonomy and with Computational Storage. Here goes my storage punditry.

Data Taxonomy in post-injection 

I see that data, any data, has to arrive at a repository first before they are given meaning, context, specifications. These requirements are different from file permissions, ownerships, ctime and atime timestamps, the content of the ingested data stream are made to fit into the mould of the repository the data is written to. Metadata about the content of the data gives the data meaning, context and most importantly, value as it is used within the data lifecycle. However, the metadata tagging, and preparing the data in the ETL (extract load transform) or the ELT (extract load transform) process are only applied post-ingestion. This data preparation phase, in which data is enriched with content metadata, tagging, taxonomy and classification, is expensive, in term of resources, time and currency.

Elements of a modern event-driven architecture including data ingestion (Credit: Qlik)

Even in the burgeoning times of open table formats (Apache Iceberg, HUDI, Deltalake, et al), open big data file formats (Avro, Parquet) and open data formats (CSV, XML, JSON et.al), the format specifications with added context and meanings are added in and augmented post-injection.

Continue reading

Data Trust and Data Responsibility. Where we should be at before responsible AI.

Last week, there was a press release by Qlik™, informing of a sponsored TechTarget®‘s Enterprise Strategy Group (ESG) about the state of responsible AI practices across industries. The study highlighted critical gaps in the approach to responsible AI, ethical AI practices and AI regulatory compliances. From the study, Qlik™ emphasizes on having a solid data foundation. To get to that bedrock foundation, we must trust the data and we must be responsible for the kinds of data that built that foundation. Hence, Data Trust and Data Responsibility.

There is an AI boom right now. Last year alone, the AI machine and its hype added in USD$2.4 trillion market cap to US tech companies. 5 months into 2024, AI is still supernova hot. And many are very much fixated to the infallible fables and tales of AI’s pompous splendour. It is this blind faith that I see many users and vendors alike sidestepping the realities of AI in the present state as it is.

AI is not always responsible. Then it begs the question, “Are we really working with a responsible set of AI applications and ecosystems“?

Responsible AI. Are we there yet?

AI still hallucinates, unfortunately. The lack of transparency of AI applications coming to a conclusion and a recommended decision is not always known. What if you had a conversation with ChatGPT and it says that you are dead. Well, that was exactly what happened when Tom’s Guide writer, Tony Polanco, found out from ChatGPT that he passed away in September 2021.

Continue reading

Unstructured Data Observability with Datadobi StorageMAP

Let’s face it. Data is bursting through its storage seams. And every organization now is storing too much data that they don’t know they have.

By 2025, IDC predicts that 80% the world’s data will be unstructured. IDC‘s report Global Datasphere Forecast 2021-2025 will see the global data creation and replication capacity expand to 181 zettabytes, an unfathomable figure. Organizations are inundated. They struggle with data growth, with little understanding of what data they have, where the data is residing, what to do with the data, and how to manage the voluminous data deluge.

The simple knee-jerk action is to store it in cloud object storage where the price of storage is $0.0000xxx/GB/month. But many IT departments in these organizations often overlook the fact that that the data they have parked in the cloud require movement between the cloud and on-premises. I have been involved in numerous discussions where the customers realized that they moved the data in the cloud moved too frequently. Often it was an erred judgement or short term blindness (blinded by the cheap storage costs no doubt), further exacerbated by the pandemic. These oversights have resulted in expensive and painful monthly API calls and egress fees. Welcome to reality. Suddenly the cheap cloud storage doesn’t sound so cheap after all.

The same can said about storing non-active unstructured data on primary storage. Many organizations have not been disciplined to practise good data management. The primary Tier 1 storage becomes bloated over time, grinding sluggishly as the data capacity grows. I/O processing becomes painfully slow and backup takes longer and longer. Sounds familiar?

The A in ABC

I brought up the ABC mantra a few blogs ago. A is for Archive First. It is part of my data protection consulting practice conversation repertoire, and I use it often to advise IT organizations to be smart with their data management. Before archiving (some folks like to call it tiering, but I am not going down that argument today), we must know what to archive. We cannot blindly send all sorts of junk data to the secondary or tertiary storage premises. If we do that, it is akin to digging another hole to fill up the first hole.

We must know which unstructured data to move replicate or sync from the Tier 1 storage to a second (or third) less taxing storage premises. We must be able to see this data, observe its behaviour over time, and decide the best data management practice to apply to this data. Take note that I said best data management practice and not best storage location in the previous sentence. There has to be a clear distinction that a data management strategy is more prudent than to a “best” storage premises. The reason is many organizations are ignorantly thinking the best storage location (the thought of the “cheapest” always seems to creep up) is a good strategy while ignoring the fact that data is like water. It moves from premises to premises, from on-prem to cloud, cloud to other cloud. Data mobility is a variable in data management.

Continue reading

Backup – Lest we forget

World Backup Day – March 31st

Last week was World Backup Day. It is on March 31st every year so that you don’t lose your data and become an April’s Fool the next day.

Amidst the growing awareness of the importance of backup, no thanks to the ever growing destructive nature of ransomware, it is important to look into other aspects of data protection – both a data backup/recovery and a data security –  point of view as well.

3-2-1 Rule, A-B-C and Air Gaps

I highlighted the basic 3-2-1 rule before. This must always be paired with a set of practised processes and policies to cultivate all stakeholders (aka the people) in the organization to understand the importance of protecting the data and ensuring data recoverability.

The A-B-C is to look at the production dataset and decide if the data should be stored in the Tier 1 storage. In most cases, the data becomes less active and these datasets may be good candidates to be archived. Once archived, the production dataset is smaller and data backup operations become lighter, faster and have positive causation as well.

Air gaps have returned to prominence since the heightened threats on data in recent years. The threats have pushed organizations to consider doing data offsite and offline with air gaps. Cost considerations and speed of recovery can be of concerns, and logical air gaps are also gaining style as an acceptable extra layer of data. protection.

Backup is not total Data Protection cyberdefence

If we view data protection more holistically and comprehensively, backup (and recovery) is not the total data protection solution. We must ignore the fancy rhetorics of the technology marketers that backup is the solution to ensure data protection because there is much more than that.

The well respected NIST (National Institute of Standards and Technology) Cybersecurity Framework places Recovery (along with backup) as the last pillar of its framework.

NIST Cybersecurity Framework

Continue reading

Right time for Andrew. The Filesystem that is.

I couldn’t hold my excitement when I discovered Auristor® early last week. I stumbled upon this Computerweekly article “Want to side step Public Cloud? Auristor® offers global file storage.” Given the many news not exactly praising the public cloud storage vendors nowadays, the article’s title caught my attention. Immediately Andrew File System (AFS) was there. I was perplexed at first because I have never seen or heard a commercial version of AFS before. This news gave me goosebumps.

For the curious, I am sure many will ask who is this Andrew anyway? What is my relationship with this Andrew?

One time with Andrew

A bit of my history. I recalled quite vividly helping Intel in Penang, Malaysia to implement their globally distributed file caching mechanism with the NetApp® filer’s NFS. It was probably 2001 and I believed Intel wanted to share their engineering computing (EC) files between their US facilities and Intel Penang Design Center (PDC). As I worked along with the Intel folks, I found out that this distributed file caching technology was called Andrew File System (AFS).

Although I couldn’t really recalled how the project went, I remembered it being a bed of bugs at that time. But being the storage geek that I am, I obviously took some time to get to know Andrew the File System. 20 years have gone by, and I never really thought of AFS coming out as a commercial solution or even knew of it as one, until Auristor®,

Auristor Logo

Continue reading

Falconstor Software Defined Data Preservation for the Next Generation

Falconstor® Software is gaining momentum. Given its arduous climb back to the fore, it is beginning to soar again.

Tape technology and Digital Data Preservation

I mentioned that long term digital data preservation is a segment within the data lifecycle which has merits and prominence. SNIA® has proved that this is a strong growing market segment through its 2007 and 2017 “100 Year Archive” surveys, respectively. 3 critical challenges of this long, long-term digital data preservation is to keep the archives

  • Accessible
  • Undamaged
  • Usable

For the longest time, tape technology has been the king of the hill for digital data preservation. The technology is cheap, mature, and many enterprises has built their long term strategy around it. And the pulse in the tape technology market is still very healthy.

The challenges of tape remain. Every 5 years or so, companies have to consider moving the data on the existing tape technology to the next generation. It is widely known that LTO can read tapes of the previous 2 generations, and write to it a generation before. The tape transcription process of migrating digital data for the sake of data preservation is bad because it affects the structural integrity and quality of the content of the data.

In my times covering the Oil & Gas subsurface data management, I have seen NOCs (national oil companies) with 500,000 tapes of all generations, from 1/2″ to DDS, DAT to SDLT, 3590 to LTO 1-7. And millions are spent to transcribe these tapes every few years and we have folks like Katalyst DM, Troika and more hovering this landscape for their fill.

Continue reading

Commvault coming all together

[Disclosure: I was invited by Commvault as a Media person and Social Ambassador to their Commvault GO 2019 Conference and also a Tech Field Day eXtra delegate from Oct 13-17, 2019 in the Denver CO, USA. My expenses, travel, accommodation and conference fees were covered by Commvault, the organizer and I was not obligated to blog or promote their technologies presented at this event. The content of this blog is of my own opinions and views]

This trip to the Commvault GO conference was pretty much a mission to find answers to their Hedvig acquisition just a month ago. It was an unprecedented move for Commvault and I, as an industry observer and pundit, took the news positively. I wrote in my blog about Commvault’s big bet and I liked their boldness in their approach.

But the news did not bode well back here in Malaysia. The local technology news portal, Data Storage Asean picked up the news in a rather unconvinced way. 2 long time Commvault partners I spoke to were obviously unhappy because the acquisition made little sense to them on the back of closing of the Commvault Malaysia office just weeks before this with more unsettling rumours of the Commvault team in Asia Pacific. The broken trust and the fear of what the future held for the Commvault customers in Malaysia and in the region were riding along with me on this trip.

But I have seen the beginning of the Commvault transformation from the Commvault GO conferences I have attended since 2017. This is my 3rd Commvault GO and I ended Day 1 with good vibes.

Here were some of my highlights in the first day. Continue reading