iRODS Archives - Storage Gaga

Preliminary Data Taxonomy at ingestion. An opportunity for Computational Storage

By cfheoh | June 3, 2024 - 8:18 am |June 3, 2024 Algorithm, Analytics, API, Artificial Intelligence, Big Data, Cloud, Clusters, Composable Infrastructure, compression, Containers, Data Fabric, Data Governance, Data Management, Data Security, Deduplication, deduplication, Deep Learning, Digital Transformation, eDiscovery, IBM, ILM, iRODS, Machine Learning, Scaleflux, Solid State Devices

Leave a comment

Data governance has been on my mind a lot lately. With all the incessant talks and hype about Artificial Intelligence, the true value of AI comes from good data. Therefore, it is vital for any organization embarking on their AI journey to have good quality data. And the journey of the lifecycle of data in an organization starts at the point of ingestion, the data source of how data is either created, acquired to be presented up into the processing workflow and data pipelines for AI training and onwards to AI applications.

In biology, taxonomy is the scientific study and practice of naming, defining and classifying biological organisms based on shared characteristics.

And so, begins my argument of meshing these 3 topics together – data ingestion, data taxonomy and with Computational Storage. Here goes my storage punditry.

Data Taxonomy in post-injection

I see that data, any data, has to arrive at a repository first before they are given meaning, context, specifications. These requirements are different from file permissions, ownerships, ctime and atime timestamps, the content of the ingested data stream are made to fit into the mould of the repository the data is written to. Metadata about the content of the data gives the data meaning, context and most importantly, value as it is used within the data lifecycle. However, the metadata tagging, and preparing the data in the ETL (extract load transform) or the ELT (extract load transform) process are only applied post-ingestion. This data preparation phase, in which data is enriched with content metadata, tagging, taxonomy and classification, is expensive, in term of resources, time and currency.

Elements of a modern event-driven architecture including data ingestion (Credit: Qlik)

Even in the burgeoning times of open table formats (Apache Iceberg, HUDI, Deltalake, et al), open big data file formats (Avro, Parquet) and open data formats (CSV, XML, JSON et.al), the format specifications with added context and meanings are added in and augmented post-injection.

Continue reading →

And great AI starts with good Data Management

By cfheoh | November 20, 2023 - 7:00 am |November 15, 2023 Algorithm, Analytics, API, Artificial Intelligence, Backup, Big Data, Business Continuity, Composable Infrastructure, Containers, Data, Data Archiving, Data Management, Data Protection, Data Security, Deep Learning, Digital Transformation, Disaster Recovery, DMTF, iRODS, Machine Learning, Object Storage, Reliability, Security, Software-defined Datacenter, Storage Tiering, Uncategorized, Virtualization

Leave a comment

Processing data has become more expensive.

Somewhere, there is a misconception that data processing is cheap. That stems from the well-known pricings of the capacities of public cloud storage that are a fraction of cents per month. But data in storage has to be worked upon, and has to be built up and protected to increase its value. Data has to be processed, moved, shared, and used by applications. Data induce workloads. Nobody keeps data stored forever and never be used again. Nobody buys storage just for capacity alone.

We have a great saying in the industry. No matter, where the data moves, it will land in a storage. So, it is clear that data does not exist in ether. And yet, I often see how little attention and prudence and care, when it comes to data infrastructure and data management technologies, the very components that are foundational to great data.

Great data management for Great AI

AI is driving up costs in data processing

A few recent articles drew my focus into the cost of data processing.

Here is one posted by a friend on Facebook. It is titled “The world is running out of data to feed AI, experts warn.”

My first reaction was, “How can we run out of data“? We have so much data in the world today that the 175 zettabytes predicted by IDC when we reach 2025 might be grossly inaccurate. According to Exploding Topics, it is estimated that we create 328.77TB of data per day, 120 zettabytes per year. While I cannot vouch for the accuracy of the numbers, the numbers are humongous.

Continue reading →

Truthful information under attack. The call for Data Preservation

By cfheoh | July 11, 2022 - 8:00 am |July 10, 2022 Algorithm, Artificial Intelligence, Backup, Cloud, Data, Data Archiving, Data Availability, Data Management, Data Privacy, Data Protection, Decentralized Storage, Digital Transformation, iRODS, Reliability, Security, SNIA, Tape storage

Leave a comment

The slogan of The Washington Post is “Democracy Dies in Darkness“. Although not everyone agrees with the US brand of democracy, the altruism of WaPo‘s (the publication’s informal name) slogan is a powerful one. The venerable newspaper remains the beacon in the US as one of the most trustworthy sources of truthful, honest information.

4 Horsemen of Apocalypse with the 5th joining

Misinformation

Misinformation has become a clear and present danger to humanity. Fake news, misleading information, lies are fueling and propelling the propaganda and agenda of the powerful (and the deranged). Facts are blurred, obfuscated, and even removed and replaced with misinformation to push for the undesirable effects that will affect the present and future generations.

The work of SNIA®

Data preservation is part of Data Management. More than a decade ago, SNIA® has already set up a technical work group (TWG) on Long Term Retention and proposed a format for long-term storage of digital format. It was called SIRF (Self-contained Information Retention Format). In the words of SNIA®, “The SIRF format enables long-term physical storage, cloud storage and tape-based containers effective and efficient ways to preserve and secure digital information for many decades, even with the ever-changing technology landscape.”

I don’t think battling misinformation was SNIA®’s original intent, but the requirements for a vendor-neutral organization as such to present and promote long term data preservation is more needed than ever. The need to protect the truth is paramount.

SNIA® continues to work with many organizations to create and grow the ecosystem for long term information retention and data preservation.

NFTs can save data

Despite the hullabaloo of NFTs (non-fungible tokens), which is very much soiled and discredited by the present day cryptocurrency speculations, I view data (and metadata) preservation as a strong use case for NFTs. The action is to digitalize data into an NFT asset.

Here are a few arguments:

NFTs are unique. Once they are verified and inserted into the blockchain, they are immutable. They cannot be modified, and each blockchain transaction is created with one never to be replicated hashed value.
NFTs are decentralized. Most of the NFTs we know of today are minted via a decentralized process. This means that the powerful cannot (most of the time), effect the NFTs state according to its whims and fancies. Unless the perpetrators know how to manipulate a Sybil attack on the blockchain.
NFTs are secure. I have to set the knowledge that NFTs in itself is mostly very secure. Most of the high profiled incidents related to NFTs are more of internal authentication vulnerabilities and phishing related to poor security housekeeping and hygiene of the participants.
NFTs represent authenticity. The digital certification of the NFTs as a data asset also define the ownership and the originality as well. The record of provenance is present and accounted for.

Since NFTs started as a technology to prove the assets and artifacts of the creative industry, there are already a few organizations that playing the role. Orygin Art is one that I found intriguing. Museums are also beginning to explore the potential of NFTs including validating and verifying the origins of many historical artifacts, and digitizing these physical assets to preserve its value forever.

The technology behind NFTs are not without its weaknesses as well but knowing what we know today, the potential is evident and power of the technology has yet to be explored fully. It does present a strong case in preserving the integrity of truthful data, and the data as historical artifacts.

Protect data safety and data integrity

Misinformation is damaging. Regardless if we believe the Butterfly Effect or not, misinformation can cause a ripple effect that could turn into a tidal wave. We need to uphold the sanctity of Truth, and continue to protect data safety and data integrity. The world is already damaged, and it will be damaged even more if we allow misinformation to permeate into the fabric of the global societies. We may welcome to a dystopian future, unfortunately.

This blog hopes to shake up the nonchalant state that we view “information” and “misinformation” today. There is a famous quote that said “Repeat a lie often enough and it becomes the truth“. We must lead the call to combat misinformation. What we do now will shape the generations of our present and future. Preserve Truth.

WaPo “Democracy Dies in Darkness”

[ Condolence: Japan Prime Minister, Shinzo Abe, was assassinated last week. News sources mentioned that the man who killed him had information that the slain PM has ties to a religious group that bankrupted his mother. Misinformation may played a role in the killing of the Japanese leader. ]

Is there no end to the threat of ransomware?

By cfheoh | June 20, 2022 - 8:00 am |June 20, 2022 Appliance, Artificial Intelligence, Backup, Business Continuity, Cloud, Cohesity, Data, Data Archiving, Data Availability, Data Management, Data Privacy, Data Protection, Data Security, Digital Transformation, Disaster Recovery, Druva, Filesystems, HDS, Hitachi Vantara, ILM, iRODS, Racktop Systems, Reliability, Rubrik, SASE, Security, Sophos, Storage Field Day, Tech Field Day

Leave a comment

I find it blasphemous that with all the rhetoric of data protection and cybersecurity technologies and solutions in the market today, the ransomware threats and damages have grown proportionately larger each year. In a recent report by Kaspersky on Anti-Ransomware Day May 12th, 9 out of 10 of organizations previously attacked by ransomware are willing to pay again if attacked again. A day before my scheduled talk in Surabaya East Java 2 weeks’ back, the chatter through the grapevine was one bank in Indonesia was attacked by ransomware on that day. These news proved how virulent and dangerous the ransomware scourge is and has become.

And the question that everyone wants an answer to is … why are ransomware threats getting bigger and more harmful and there are no solutions to it?

Digital transformation and its data are very attractive targets

Today, all we hear from the data protection and storage vendors are recovery, restore that data blah, blah, blah and more blah, blah, blahs. The end point EDR (endpoint detection and response) solutions say they can stop it; the cybersecurity experts preach depth in defense; and the network security guys say use perimeter fencing. And the anti-phishing chaps say more awareness and education required. One or all have not worked effectively these few years. Ransomware’s threats and damages are getting worse. Why?

Continue reading →

How well do you know your data and the storage platform that processes the data

By cfheoh | December 20, 2021 - 7:30 am |December 20, 2021 100Gigabit Ethernet, 10Gigabit Ethernet, Algorithm, Analytics, Appliance, Backup, Big Data, Business Continuity, Cloud, Clusters, Composable Infrastructure, compression, Confluent, Data Archiving, Data Availability, Data Fabric, Data Management, Data Privacy, Data Protection, Data Security, Deduplication, Digital Transformation, Disaster Recovery, Edge Computing, Filesystems, Hyperconvergence, ILM, Industry 4.0, InfluxDB, iRODS, Machine Learning, NAS, NFS, NVMe, Object Storage, Performance Caching, Pravega, Reliability, SATA, Scale-out architecture, Security, Software Defined Storage, Storage Area Network, Storage Optimization, Storage Tiering, Unified Storage, VDI, Virtualization

Leave a comment

Last week was consumed by many conversations on this topic. I was quite jaded, really. Unfortunately many still take a very simplistic view of all the storage technology, or should I say over-marketing of the storage technology. So much so that the end users make incredible assumptions of the benefits of a storage array or software defined storage platform or even cloud storage. And too often caveats of turning on a feature and tuning a configuration to the max are discarded or neglected. Regards for good storage and data management best practices? What’s that?

I share some of my thoughts handling conversations like these and try to set the right expectations rather than overhype a feature or a function in the data storage services.

Complex data networks and the storage services that serve it

I/O Characteristics

Applications and workloads (A&W) read and write from the data storage services platforms. These could be local DAS (direct access storage), network storage arrays in SAN and NAS, and now objects, or from cloud storage services. Regardless of structured or unstructured data, different A&Ws have different behavioural I/O patterns in accessing data from storage. Therefore storage has to be configured at best to match these patterns, so that it can perform optimally for these A&Ws. Without going into deep details, here are a few to think about:

Random and Sequential patterns
Block sizes of these A&Ws ranging from typically 4K to 1024K.
Causal effects of synchronous and asynchronous I/Os to and from the storage

Continue reading →

Rethinking File Security Fundamentals

By cfheoh | May 24, 2021 - 9:00 am |May 24, 2021 Algorithm, Analytics, API, Artificial Intelligence, Business Continuity, Data Availability, Data Corruption, Data Management, Data Privacy, Data Protection, Data Security, Deep Learning, Digital Transformation, Disaster Recovery, eDiscovery, Filesystems, iRODS, Machine Learning, Object Storage, Snapshots, Virtualization

Leave a comment

I took a week off blogging last week but the lazy days were inundated by bad news. A few more devastating ransomware attacks. This time, Colonial Pipeline in the US was hacked and its networks were shutdown by ransomware. These ransomware threats are never ending, and they are getting more damaging than ever. It is like trying to plug a leaking boat with your hands, and more leaks appear as you plug them.

More ransomware news hitting healthcare around the world last week:

[ May 15, 2021 ] Ireland’s health service hit by ‘significant’ ransomware attack
[ May 20, 2021 ] Irish hospitals are latest to be hit by ransomware attacks
[ May 19, 2021 ] Ransomware attacks hit AXA’s Asia unit, New Zealand health provider
[ May 20, 2021 ] Ransomware attacks are spiking. Is your company prepared?
[ May 20, 2021 ] RansomCloud: It’s new, it’s here now and it’s coming to a server near you

We are forever chasing for a solution, forever losing because almost all technology defenses to protect the data against ransomware are reactive. Why is ransomware still such a big threat then? Time to rethink file security fundamentals.

Data everywhere

Continue reading →

Category Archives: iRODS