compression – Storage Gaga

Intelligent Data Movement and Data Placement dictate the future of AI Data Infrastructure

By cfheoh | July 29, 2025 - 7:48 am |July 29, 2025 100Gigabit Ethernet, Algorithm, Analytics, Artificial Intelligence, BeeGFS, Big Data, Big Switch Networks, Broadcom, compression, Computational Storage, Containers, CXL, Data Direct Networks, Data Management, DDN, Filesystems, Flash, Hammerspace, High Performance Computing, Machine Learning, NFS, nVidia, NVMe, Parallel NFS, Performance Benchmark, Performance Caching, pNFS, Scale-out architecture, Software Defined Storage, Storage Optimization, Storage Tiering, Vast Data, WekaIO

1 Comment

I have been reading a couple of articles over the weekend which started by placing the weights of outdated networking infrastructure slowing down AI ambitions. The 2 articles are:

The AI Infrastructure bottleneck no one talks about (which turned out to be a not-so-subtle play for Netris, a secure multi-tenant network provisioner technology).
Data Infrastructure: The missing link in successful AI adoption (a more subtle introduction of Indicium, an AI data services company).

I did not fully agree that networking infrastructure is the main inhibitor of AI ambitions per se. Not from the experiences and the present development in high performance networking of what I know so far. In fact, AI networking infrastructure has been growing leaps and bounds, laying down ultra-high throughput plumbing between the GPUs (inadvertently up the stack to the AI models and applications) and the data storage infrastructure.

The NVIDIA-heavy GPU compute infrastructure is of course, dominated by its own NVIDIA’s networking infrastructure. Both NVIDIA Spectrum (Ethernet) and Quantum (InfiniBand), BlueField (data processing units), ConnectX and LinkX are the mainstays of DGX Cloud, a big part of NVIDIA NCPs as well.

In fact, in one of DDN’s NCP customers, I have seen a 10-node DDN EXAscaler cluster deliver almost 1.1TB/sec read and 750GB/sec write throughput to the GPU compute cluster, out-of-the-box, all with 200Gbps networking gear.

Continue reading →

Preliminary Data Taxonomy at ingestion. An opportunity for Computational Storage

By cfheoh | June 3, 2024 - 8:18 am |June 3, 2024 Algorithm, Analytics, API, Artificial Intelligence, Big Data, Cloud, Clusters, Composable Infrastructure, compression, Containers, Data Fabric, Data Governance, Data Management, Data Security, Deduplication, deduplication, Deep Learning, Digital Transformation, eDiscovery, IBM, ILM, iRODS, Machine Learning, Scaleflux, Solid State Devices

Leave a comment

Data governance has been on my mind a lot lately. With all the incessant talks and hype about Artificial Intelligence, the true value of AI comes from good data. Therefore, it is vital for any organization embarking on their AI journey to have good quality data. And the journey of the lifecycle of data in an organization starts at the point of ingestion, the data source of how data is either created, acquired to be presented up into the processing workflow and data pipelines for AI training and onwards to AI applications.

In biology, taxonomy is the scientific study and practice of naming, defining and classifying biological organisms based on shared characteristics.

And so, begins my argument of meshing these 3 topics together – data ingestion, data taxonomy and with Computational Storage. Here goes my storage punditry.

Data Taxonomy in post-injection

I see that data, any data, has to arrive at a repository first before they are given meaning, context, specifications. These requirements are different from file permissions, ownerships, ctime and atime timestamps, the content of the ingested data stream are made to fit into the mould of the repository the data is written to. Metadata about the content of the data gives the data meaning, context and most importantly, value as it is used within the data lifecycle. However, the metadata tagging, and preparing the data in the ETL (extract load transform) or the ELT (extract load transform) process are only applied post-ingestion. This data preparation phase, in which data is enriched with content metadata, tagging, taxonomy and classification, is expensive, in term of resources, time and currency.

Elements of a modern event-driven architecture including data ingestion (Credit: Qlik)

Even in the burgeoning times of open table formats (Apache Iceberg, HUDI, Deltalake, et al), open big data file formats (Avro, Parquet) and open data formats (CSV, XML, JSON et.al), the format specifications with added context and meanings are added in and augmented post-injection.

Continue reading →

FDT – Deduplication Reimagined in OpenZFS

By cfheoh | April 18, 2024 - 1:17 pm |April 18, 2024 compression, Deduplication, deduplication, Delphix, Filesystems, FreeNAS, iXsystems, Klara Systems, OpenZFS, Storage Optimization, TrueNAS

2 Comments

Deduplication in OpenZFS has been a bugbear for some years now. As data sets get larger, they have become even more difficult in using the present DeDuplication Table (DDT) method. Deduplication in OpenZFS is often derided as overwhelming and sluggish in performance.

Moreover, there is a common folklore passed on and on about allocating 5GB of RAM for every 1TB to dedupe in OpenZFS. I don’t know where this “sizing” came about. Probably derived from something Jeff Bonwick wrote back in the early days of ZFS. But there is some truth to this “rule of thumb”, commonly passed around in the TrueNAS® circles.

Nevertheless, given the exponential growth of data, and the advancement of processing power in modern day computer systems, the OpenZFS development community has decided to revamp the DDT method. Several prominent luminaries from iXsystems™, Klara Systems and the OpenZFS community have got together in mid-2023 to develop FDT or Fast Dedupe Table. And we got to see FDT announced to the world in the most recent OpenZFS Developer Summit in November 2023.

Fast Dedupe Table (FDT)

Fast Dedupe Table (FDT) is a log-based dedupe. In OpenZFS, all the write block I/Os that come into OpenZFS are coalesced into transaction groups (TXGs), hashed and checksummed, before they are committed to persistent media.

The new implementation in FDT is to put these incoming TXGs checksums and hashes into an append-only log structure in persistent storage, and also tracking the hashed changes in an AVL-tree residing in the memory. An AVL tree is a self-balancing binary search tree structure that is very efficient in searching, thus giving FDT the speed in initiating the deduplication lookups and updates.

OpenZFS Fast Dedupe Table (FDT) in a nutshell

The append-only log structure works hand-in-hand with the AVL tree to accept and stage (including intelligent sorting) the hash entries that are coming in after the TXGs writes. Then at a certain marker, that could be at a particular time-based trigger or a high-water mark, then the entries in the logs and AVL tree are flushed to the ZAP (ZFS Attribute Processor) where the actual full map of the OpenZFS blocks reside.

Continue reading →

Backup – Lest we forget

By cfheoh | April 4, 2022 - 8:00 am |April 20, 2022 Backup, Business Continuity, Cloud, Clumio, Cohesity, Commvault, compression, Data Archiving, Data Availability, Data Management, Data Protection, Data Security, Digital Transformation, Disaster Recovery, Druva, eDiscovery, ExaGrid, Falconstor, OwnBackup, Snapshots, Tape storage, Veeam, Veritas, Zerto

Leave a comment

World Backup Day – March 31st

Last week was World Backup Day. It is on March 31st every year so that you don’t lose your data and become an April’s Fool the next day.

Amidst the growing awareness of the importance of backup, no thanks to the ever growing destructive nature of ransomware, it is important to look into other aspects of data protection – both a data backup/recovery and a data security – point of view as well.

3-2-1 Rule, A-B-C and Air Gaps

I highlighted the basic 3-2-1 rule before. This must always be paired with a set of practised processes and policies to cultivate all stakeholders (aka the people) in the organization to understand the importance of protecting the data and ensuring data recoverability.

The A-B-C is to look at the production dataset and decide if the data should be stored in the Tier 1 storage. In most cases, the data becomes less active and these datasets may be good candidates to be archived. Once archived, the production dataset is smaller and data backup operations become lighter, faster and have positive causation as well.

Air gaps have returned to prominence since the heightened threats on data in recent years. The threats have pushed organizations to consider doing data offsite and offline with air gaps. Cost considerations and speed of recovery can be of concerns, and logical air gaps are also gaining style as an acceptable extra layer of data. protection.

Backup is not total Data Protection cyberdefence

If we view data protection more holistically and comprehensively, backup (and recovery) is not the total data protection solution. We must ignore the fancy rhetorics of the technology marketers that backup is the solution to ensure data protection because there is much more than that.

The well respected NIST (National Institute of Standards and Technology) Cybersecurity Framework places Recovery (along with backup) as the last pillar of its framework.

NIST Cybersecurity Framework

Continue reading →

Nakivo Backup Replication architecture and installation on TrueNAS – Part 1

By cfheoh | March 28, 2022 - 8:00 am |March 24, 2022 API, Appliance, Arcserve, Backup, Carbonite, Cloud, Commvault, compression, Data Archiving, Data Availability, Data Management, Data Protection, Data Security, deduplication, Deduplication, Disaster Recovery, Filesystems, FreeNAS, Infrascale, iSCSI, iXsystems, Linux, Microsoft, Microsoft Azure, Nakivo, NAS, NFS, Nutanix, Oracle, Quest Software, Security, Snapshots, Tape storage, TrueNAS, Veritas, virtual tape library, Virtualization, VMware

Leave a comment

Backup and Replication software have received strong mandates in organizations with enterprise mindsets and vision. But lower down the rung, small medium organizations are less invested in backup and replication software. These organizations know full well that they must backup, replicate and protect their servers, physical and virtual, and also new workloads in the clouds, given the threat of security breaches and ransomware is looming larger and larger all the time. But many are often put off by the cost of implementing and deploying a Backup and Replication software.

So I explored one of the lesser known backup and recovery software called Nakivo® Backup and Replication (NBR) and took the opportunity to build a backup and replication appliance in my homelab with TrueNAS®. My objective was to create a cost effective option for small medium organizations to enjoy enterprise-grade protection and recovery without the hefty price tag.

This blog, Part 1, writes about the architecture overview of Nakivo® and the installation of the NBR software in TrueNAS® to bake in and create the concept of a backup and replication appliance. Part 2, in a future blog post, will cover the administrative and operations usage of NBR.

Continue reading →

Computational Storage embodies Data Velocity and Locality

By cfheoh | March 21, 2022 - 8:00 am |March 31, 2022 Analytics, Appliance, Artificial Intelligence, Big Data, Blockchain, Cloud, Composable Infrastructure, compression, Containers, Data Management, Data Security, Deduplication, Deep Learning, Edge Computing, IDC, Machine Learning, NVMe, PCIe, SNIA, Solid State Devices

1 Comment

I have been earnestly observing the growth of Computational Storage for a number of years now. It was known by several previous names, with the name “in-situ data processing” stuck with me the most. The Computational Storage nomenclature became more cohesive when SNIA® put together the CMSI (Compute Memory Storage Initiative) some time back. This initiative is where several standards bodies, the major technology players and several SIGs (special interest groups) in SNIA® collaborated to advance Computational Storage segment in the storage technology industry we know of today.

The use cases for Computational Storage are burgeoning, and the functional implementations of Computational Storage are becoming vital to tackle the explosive data tsunami. In 2018 IDC, in its Worldwide Global Datasphere Forecast 2021-2025 report, predicted that the world will have 175 ZB (zettabytes) of data. That number, according to hearsay, has been revised to a heady figure of 250ZB, given the superlative rate data is being originated, spawned and more.

Computational Storage driving factors

If we take the Computer Science definition of in-situ processing, Computational Storage can be distilled as processing data where it resides. In a nutshell, “Bring Compute closer to Storage“. This means that there is a processing unit within the storage subsystem which does not require the host CPU to perform processing. In a very simplistic manner, a RAID card in a storage array can be considered a Computational Storage device because it performs the RAID functions instead of the host CPU. But this new generation of Computational Storage has much more prowess than just the RAID function in a RAID card.

There are many factors in Computational Storage that make a lot sense. Here are a few:

Voluminous data inundate the centralized architecture of the cloud platforms and the enterprise systems today. Much of the data come from end point devices – mobile devices, sensors, IoT, point-of-sales, video cameras, et.al. Pre-processing the data at the origin data points can help filter the data, reduce the size to be processed centrally, and secure the data before they are ingested into the central data processing systems
Real-time processing of the data at the moment the data is received gives the opportunity to create the Velocity of Data Analytics. Much of the data do not need to move to a central data processing system for analysis. Often in use cases like autonomous vehicles, fraud detection, recommendation systems, disaster alerts etc require near instantaneous responses. Performing early data analytics at the data origin point has tremendous advantages.
Moore’s Law is waning. The CPU (central processing unit) is no longer the center of the universe. We are beginning to see CPU offloading technologies to augment the CPU’s duties such as compression, encryption, transcoding and more. SmartNICs, DPUs (data processing units), VPUs (visual processing units), GPUs (graphics processing units), etc have come forth to formulate a new computing paradigm.
Freeing up central resources with Computational Storage also accelerates the overall distributed data processing in the whole data architecture. The CPU and the adjoining memory subsystem are less required to perform context switching caused by I/O interrupts as in most of the compute/storage architecture today. The total effect relieves the CPU and giving back more CPU cycles to perform higher processing tasks, resulting in faster performance overall.
The rise of memory interconnects is enabling a more distributed computing fabric of data processing subsystems. The rising CXL (Compute Express Link™) interconnect protocol, especially after the Gen-Z annex, has emerged a force to be reckoned with. This rise of memory interconnects will likely strengthen the testimony of Computational Storage in the fast approaching future.

Computational Storage Deployment Models

SNIA Computational Storage Universe in 2019

Continue reading →

How well do you know your data and the storage platform that processes the data

By cfheoh | December 20, 2021 - 7:30 am |December 20, 2021 100Gigabit Ethernet, 10Gigabit Ethernet, Algorithm, Analytics, Appliance, Backup, Big Data, Business Continuity, Cloud, Clusters, Composable Infrastructure, compression, Confluent, Data Archiving, Data Availability, Data Fabric, Data Management, Data Privacy, Data Protection, Data Security, Deduplication, Digital Transformation, Disaster Recovery, Edge Computing, Filesystems, Hyperconvergence, ILM, Industry 4.0, InfluxDB, iRODS, Machine Learning, NAS, NFS, NVMe, Object Storage, Performance Caching, Pravega, Reliability, SATA, Scale-out architecture, Security, Software Defined Storage, Storage Area Network, Storage Optimization, Storage Tiering, Unified Storage, VDI, Virtualization

Leave a comment

Last week was consumed by many conversations on this topic. I was quite jaded, really. Unfortunately many still take a very simplistic view of all the storage technology, or should I say over-marketing of the storage technology. So much so that the end users make incredible assumptions of the benefits of a storage array or software defined storage platform or even cloud storage. And too often caveats of turning on a feature and tuning a configuration to the max are discarded or neglected. Regards for good storage and data management best practices? What’s that?

I share some of my thoughts handling conversations like these and try to set the right expectations rather than overhype a feature or a function in the data storage services.

Complex data networks and the storage services that serve it

I/O Characteristics

Applications and workloads (A&W) read and write from the data storage services platforms. These could be local DAS (direct access storage), network storage arrays in SAN and NAS, and now objects, or from cloud storage services. Regardless of structured or unstructured data, different A&Ws have different behavioural I/O patterns in accessing data from storage. Therefore storage has to be configured at best to match these patterns, so that it can perform optimally for these A&Ws. Without going into deep details, here are a few to think about:

Random and Sequential patterns
Block sizes of these A&Ws ranging from typically 4K to 1024K.
Causal effects of synchronous and asynchronous I/Os to and from the storage

Continue reading →

What happened to NDMP?

By cfheoh | October 18, 2021 - 8:00 am |October 16, 2021 API, Appliance, Arcserve, Backup, Big Data, CIFS, Cohesity, compression, Data Management, Data Protection, Deduplication, EMC, Filesystems, HDS, IBM, LTO, NAS, NetApp, Rubrik, Snapshots, SNIA, Tape storage, virtual tape library, VTL

Leave a comment

The acronym NDMP shows up once in a while in NAS (Network Attached Storage) upgrade tenders. And for the less informed, NDMP (Network Data Management Protocol) was one of the early NAS data management (more like data mover specifications) initiatives to backup NAS devices, especially the NAS appliances that run proprietary operating systems code.

NDMP Logo

Backup software vendors often have agents developed specifically for an operating system or an operating environment. But back in the mid-1990s, 2000s, the internal file structures of these proprietary vendors were less exposed, making it harder for backup vendors to develop agents for them. Furthermore, there was a need to simplify the data movements of NAS files between backup servers and the NAS as a client, to the media servers and eventually to the tape or disk targets. The dominant network at the time ran at 100Mbits/sec.

To overcome this, Network Appliance® and PDC Solutions/Legato® developed the NDMP protocol, allowing proprietary NAS devices to run a standardized client-server architecture with the NDMP server daemon in the NAS and the backup service running as an NDMP client. Here is a simplified look at the NDMP architecture.

NDMP Client-Server Architecture

Continue reading →

Storage IO straight to GPU

By cfheoh | July 5, 2021 - 9:00 am |July 3, 2021 100Gigabit Ethernet, Algorithm, Analytics, API, Artificial Intelligence, Composable Infrastructure, compression, CXL, Deduplication, Deep Learning, Filesystems, High Performance Computing, Hyperconvergence, Machine Learning, Mellanox, Mellanox Technologies, Microsoft, nVidia, NVMe, RDMA, Vast Data, WekaIO

Leave a comment

The parallel processing power of the GPU (Graphics Processing Unit) cannot be denied. One year ago, nVidia® overtook Intel® in market capitalization. And today, they have doubled their market cap lead over Intel®, [as of July 2, 2021] USD$510.53 billion vs USD$229.19 billion.

Thus it is not surprising that storage architectures are changing from the CPU-centric paradigm to take advantage of the burgeoning prowess of the GPU. And 2 announcements in the storage news in recent weeks have caught my attention – Windows 11 DirectStorage API and nVidia® Magnum IO GPUDirect® Storage.

nVidia GPU

Exciting the gamers

The Windows DirectStorage API feature is only available in Windows 11. It was announced as part of the Xbox® Velocity Architecture last year to take advantage of the high I/O capability of modern day NVMe SSDs. DirectStorage-enabled applications and games have several technologies such as D3D Direct3D decompression/compression algorithm designed for the GPU, and SFS Sampler Feedback Streaming that uses the previous rendered frame results to decide which higher resolution texture frames to be loaded into memory of the GPU and rendered for the real-time gaming experience.

Continue reading →

First looks into Interplanetary File System

By cfheoh | June 28, 2021 - 9:00 am |June 27, 2021 API, Blitzscaling, Chia, Cloud, compression, Deduplication, deduplication, Edge Computing, Filesystems, FreeNAS, iXsystems, Linux, Object Storage, Reliability, Security, Software Defined Storage

1 Comment

The cryptocurrency craze has elevated another strong candidate in recent months. Filecoin, is leading the voice of a decentralized Internet, the next generation Web 3.0. In this blog, I am not going to write much about the Filecoin frenzy but the underlying distributed file system that powers this phenomenon – The Interplanetary File System.

[ Note: This is still a very new area for me, and the rest of the content of this blog is still nascent and developing ]

Interplanetary File System

Tremulous Client-Server web architecture

The entire Internet architecture is almost client and server. Your clients like browsers, apps, connect to Web services served from a collection of servers. As Web 3.0 approaches (some say it is already here), the client-server model is no longer perceived as the Internet architecture of choice. Billions, and billions of users, applications, devices relying solely on a centralized service would lead to many impactful consequences, and the reasons for decentralization, away from the client-server architecture models of the Internet are cogent.

Continue reading →

Category Archives: compression

Intelligent Data Movement and Data Placement dictate the future of AI Data Infrastructure