Unstructured Data Observability with Datadobi StorageMAP

Let’s face it. Data is bursting through its storage seams. And every organization now is storing too much data that they don’t know they have.

By 2025, IDC predicts that 80% the world’s data will be unstructured. IDC‘s report Global Datasphere Forecast 2021-2025 will see the global data creation and replication capacity expand to 181 zettabytes, an unfathomable figure. Organizations are inundated. They struggle with data growth, with little understanding of what data they have, where the data is residing, what to do with the data, and how to manage the voluminous data deluge.

The simple knee-jerk action is to store it in cloud object storage where the price of storage is $0.0000xxx/GB/month. But many IT departments in these organizations often overlook the fact that that the data they have parked in the cloud require movement between the cloud and on-premises. I have been involved in numerous discussions where the customers realized that they moved the data in the cloud moved too frequently. Often it was an erred judgement or short term blindness (blinded by the cheap storage costs no doubt), further exacerbated by the pandemic. These oversights have resulted in expensive and painful monthly API calls and egress fees. Welcome to reality. Suddenly the cheap cloud storage doesn’t sound so cheap after all.

The same can said about storing non-active unstructured data on primary storage. Many organizations have not been disciplined to practise good data management. The primary Tier 1 storage becomes bloated over time, grinding sluggishly as the data capacity grows. I/O processing becomes painfully slow and backup takes longer and longer. Sounds familiar?

The A in ABC

I brought up the ABC mantra a few blogs ago. A is for Archive First. It is part of my data protection consulting practice conversation repertoire, and I use it often to advise IT organizations to be smart with their data management. Before archiving (some folks like to call it tiering, but I am not going down that argument today), we must know what to archive. We cannot blindly send all sorts of junk data to the secondary or tertiary storage premises. If we do that, it is akin to digging another hole to fill up the first hole.

We must know which unstructured data to move replicate or sync from the Tier 1 storage to a second (or third) less taxing storage premises. We must be able to see this data, observe its behaviour over time, and decide the best data management practice to apply to this data. Take note that I said best data management practice and not best storage location in the previous sentence. There has to be a clear distinction that a data management strategy is more prudent than to a “best” storage premises. The reason is many organizations are ignorantly thinking the best storage location (the thought of the “cheapest” always seems to creep up) is a good strategy while ignoring the fact that data is like water. It moves from premises to premises, from on-prem to cloud, cloud to other cloud. Data mobility is a variable in data management.

Continue reading

Ridding consumer storage mindset for Enterprise operations

I cut my teeth in Enterprise Storage for 3 decades. On and off, I get the opportunity to work on Cloud Storage as well, mostly more structured storage infrastructure services such as blocks and files, in cloud offerings on AWS, Azure and Alibaba Cloud. I am familiar with S3 operations (mostly the CRUD operations and HTTP headers stuff) too, although I have yet to go deep with S3 with Restful API. And I really wanted to work on stuff with the S3 Select when the opportunity arises. (Note: Homelab project to-do list)

Along with the experience is the enterprise mindset of designing and crafting storage infrastructure and data management practices that evolve around data. Understanding the characteristics of data and the behaviours data in motion is part of my skills repertoire, and I continue to have conversations with organizations, small and large alike every day of the week.

This week’s blog was triggered by an article by Tech Republic® Jack Wallen‘s interview with Fedora project leader Matthew Miller. I have been craning my neck waiting for the full release of Fedora 36 (which now has been pushed to May 10th 2022), and the Tech Republic®’s article, “The future of Linux: Fedora project leader weighs in” touched me. Let me set the context of my expanded commentaries here.

History of my open source experience- bringing Enterprise to the individual

I have been working with open source software for a long time. My first Linux experience was Soft Landing Linux in the early 90s. It was a bunch of diskettes I purchased online while dabbling with FreeBSD® on the sides. Even though my day job was on the SunOS, and later Solaris®, having the opportunity to build stuff and learn the enterprise ways with Sun Microsystems® hardware and software were difficult at my homelab. I did bring home a SPARCstation® 2 once but the CRT monitor almost broke my computer table at that time.

Having open source software on 386i (before x86) architecture was great (no matter how buggy they were) because I got to learn hardcore enterprise technology at home. I am a command line person, so the desktop experience does not bother me much because my OS foundation is there. Open source gave me a world I could master my skills as an individual. For an individual like me, my mindset is always on the Enterprise.

The Tech Republic interview and my reflections

I know the journey open source OSes has taken at the server (aka Enterprise) level. They are great, and are getting better and better. But at the desktop (aka consumer) level, the Linux desktop experience has been an arduous one even though the open source Linux desktop experience is so much better now. This interview reflected on that.

There were a few significant points that were brought up. Those poignant moments explained about the free software in open source projects, how consumers glazed over (if I get what Matt Miller meant) the cosmetics of the open source software without the deeper meaningful objectives of the software had me feeling empty. Many assumed that just because the software is open source, it should be free or of low costs and continue to apply a consumer mindset to the delivery and the capability of the software.

Case in point is the way I have been seeing many TrueNAS®/FreeNAS™ individuals who downloaded the free software and using them in consumer ways. That is perfectly fine but when they want to migrate their consumer experience with the TrueNAS® software to their critical business operations, things suddenly do not look so rosy anymore. From my experience, having built enterprise-grade storage solutions with open source software like ZFS on OpenSolaris/OpenIndiana, FreeNAS™ and TrueNAS® for over a decade plus gaining plenty of experience on many proprietary and software-defined storage platforms along this 30 year career, the consumer mindsets do not work well in enterprise missions.

And over the years, I have been seeing this newer generation of infrastructure people taking less and less interest in learning the enterprise ways or going deep dive into the workings of the open source platforms I have mentioned. Yet, they have lofty enterprise expectations while carrying a consumer mindset. More and more, I am seeing a greying crew of storage practitioners with enterprise experiences dealing with a new generation of organizations and end users with consumer practices and mindsets.

Open Source Word Cloud

Continue reading

All the Sources and Sinks going to Object Storage

The vocabulary of sources and sinks are beginning to appear in the world of data storage as we witness the new addition of data processing frameworks and the applications in this space. I wrote about this in my blog “Rethinking data. processing frameworks systems in real time” a few months ago, introducing my take on this budding new set of I/O characteristics and data ecosystem. I also started learning about the Kappa Architecture (and Lambda as well), a framework designed to craft and develop a set of amalgamated technologies to handle stream processing of a series of data in relation to time.

In Computer Science, sources and sinks are considered external entities that often serve as connectors of input and output of disparate systems. They are often not in the purview of data storage architects. Also often, these sources and sinks are viewed as black boxes, and their inner workings are hidden from the views of the data storage architects.

Diagram from https://developer.here.com/documentation/get-started/dev_guide/shared_content/topics/olp/concepts/pipelines.html

The changing facade of data stream processing presents the constant motion of data, the continuous data being altered as it passes through the many integrated sources and sinks. We are also see much of the data processed in-memory as much as possible. Thus, the data services from a traditional storage model of SAN and NAS may straggle with the requirements demanded by this new generation of data stream processing.

As the world of traditional data storage processing is expanding into data streams processing and vice versa, and the chatter of sources and sinks can no longer be ignored.

Continue reading

The young report card on Decentralized Storage

I kept this blog in my queue for over 4 months. I was reluctant to publish it because I thought the outrageous frenzies of NFTs (non-fungible tokens), metaverses and web3 were convoluting the discussions on the decentralized storage topic. 3 weeks back, a Google Trends search for these 3 opaque terms over 90 days showed that the worldwide fads were waning. Here was the Google Trends output on April 2, 2022:

Google Trends on NFT, metaverse and web3

Decentralized storage intrigues me. I like to believe in its potential and I often try to talk to people to strengthen the narratives, and support its adoption where it fits. But often, the real objectives of decentralized storage are obfuscated by the polarized conversations about cryptocurrencies that are pegged to their offerings, NFTs (non-fungible tokens), DAOs (decentralized autonomous organizations) and plenty of hyperboles with bewildering facts as well.

But I continue to seek sustainable conversations about decentralized storage without the sway of the NFTs or the cryptos. After dipping in my toes and experiencing with HODLers, and looking at the return to sanity, I believe we can discuss decentralized storage with better clarity now. The context is to position decentralized storage to the mainstream, specifically to business organizations already immersed in centralized storage. Here is my fledgling report card on decentralized storage.

Continue reading

HODLing Decentralized Storage is not zero sum

I have been dipping my toes into decentralized storage. I wrote about “Crossing the Chasm” last month where most early technologies have to experience to move into the mainstream adoption. I believe the same undertaking is going on for decentralized storage and the undercurrents are beginning to feel like a tidal wave. However, the clarion calls and the narratives around decentralized storage are beginning to sound the same after several months on researching the subject.

Salient points of decentralized storage

I have summarized a bunch of these arguments for decentralized storage. They are:

  • Democratization of cloud storage services separate from the hyperscaling behemoths of Web2
  • Inherent data security with default encryption, immutability and blockchain-ed. (most decentralized storage are blockchain-based. A few are not)
  • Data privacy with the security key for data decryption and authentication with the data owner(s)
  • No centralized control of data storage services, prices, market transparency and sovereignty
  • Green with more efficient energy consumption compared to Bitcoin
  • Data durability with data sharding creating no single point of failure and maintaining continuous data access services with geo content dispersal

Rocket fuel – The cryptos

Most early adoptions of a new technology require some sort of bliztscaling momentum to break free from the gravity of the old one. The cryptocurrencies pegged to many decentralized storage platforms are the rocket fuel to power the conversations and the narratives of the decentralized storage today. I probably counted over a hundred of these types of cryptocurrencies, with more jumping into the bandwagon as the gravy train moves ahead.

The table below is part of a TechTarget Search Storage article “7 Decentralized Storage Networks compared“. I found this article most enlightening.

7 Decentralized Storage Compared

Continue reading

Celebrating MinIO

Essentially MinIO is a web server …

I vaguely recalled Anand Babu Periasamy (AB as he is known), the CEO of MinIO saying that when I first met him in 2017. I was fresh “playing around” with MinIO and instantly I fell in love with software technology. Wait a minute. Object storage wasn’t supposed to be so easy. It was not supposed to be that simple to set up and use, but MinIO burst into my storage universe like the birth of the Infinity Stones. There was a eureka moment. And I was attending one of the Storage Field Days in the US shortly after my MinIO discovery in late 2017. What an opportunity!

I could not recall how I made the appointment to meeting MinIO, but I recalled myself taking an Uber to their cosy office on University Avenue in Palo Alto to meet. Through Andy Watson (one of the CTOs then), I was introduced to AB, Garima Kapoor, MinIO’s COO and his wife, Frank Wessels, Zamin (one of the business people who is no longer there) and Ugur Tigli (East Coast CTO) who was on the Polycom. I was awe struck.

Last week, MinIO scored a major Series B round funding of USD103 million. It was delayed by the pandemic because I recalled Garima telling me that the funding was happening in 2020. But I think the delay made it better, because the world now is even more ready for MinIO than ever before.

Continue reading

A conceptual distributed enterprise HCI with open source software

Cloud computing has changed everything, at least at the infrastructure level. Kubernetes is changing everything as well, at the application level. Enterprises are attracted by tenets of cloud computing and thus, cloud adoption has escalated. But it does not have to be a zero-sum game. Hybrid computing can give enterprises a balanced choice, and they can take advantage of the best of both worlds.

Open Source has changed everything too because organizations now has a choice to balance their costs and expenditures with top enterprise-grade software. The challenge is what can organizations do to put these pieces together using open source software? Integration of open source infrastructure software and applications can be complex and costly.

The next version of HCI

Hyperconverged Infrastructure (HCI) also changed the game. Integration of compute, network and storage became easier, more seamless and less costly when HCI entered the market. Wrapped with a single control plane, the HCI management component can orchestrate VM (virtual machine) resources without much friction. That was HCI 1.0.

But HCI 1.0 was challenged, because several key components of its architecture were based on DAS (direct attached) storage. Scaling storage from a capacity point of view was limited by storage components attached to the HCI architecture. Some storage vendors decided to be creative and created dHCI (disaggregated HCI). If you break down the components one by one, in my opinion, dHCI is just a SAN (storage area network) to HCI. Maybe this should be HCI 1.5.

A new version of an HCI architecture is swimming in as Angelfish

Kubernetes came into the HCI picture in recent years. Without the weights and dependencies of VMs and DAS at the HCI server layer, lightweight containers orchestrated, mostly by, Kubernetes, made distribution of compute easier. From on-premises to cloud and in between, compute resources can easily spun up or down anywhere.

Continue reading

How well do you know your data and the storage platform that processes the data

Last week was consumed by many conversations on this topic. I was quite jaded, really. Unfortunately many still take a very simplistic view of all the storage technology, or should I say over-marketing of the storage technology. So much so that the end users make incredible assumptions of the benefits of a storage array or software defined storage platform or even cloud storage. And too often caveats of turning on a feature and tuning a configuration to the max are discarded or neglected. Regards for good storage and data management best practices? What’s that?

I share some of my thoughts handling conversations like these and try to set the right expectations rather than overhype a feature or a function in the data storage services.

Complex data networks and the storage services that serve it

I/O Characteristics

Applications and workloads (A&W) read and write from the data storage services platforms. These could be local DAS (direct access storage), network storage arrays in SAN and NAS, and now objects, or from cloud storage services. Regardless of structured or unstructured data, different A&Ws have different behavioural I/O patterns in accessing data from storage. Therefore storage has to be configured at best to match these patterns, so that it can perform optimally for these A&Ws. Without going into deep details, here are a few to think about:

  • Random and Sequential patterns
  • Block sizes of these A&Ws ranging from typically 4K to 1024K.
  • Causal effects of synchronous and asynchronous I/Os to and from the storage

Continue reading

OpenZFS with Object Storage

At AWS re:Invent last week, Amazon Web Services announced Amazon FSx for OpenZFS. This is the 4th managed service under the Amazon FSx umbrella, joining NetApp® ONTAP™, Lustre and Windows File Server. The highly scalable OpenZFS filesystem can provide high throughput and IOPS bandwidth to Amazon EC2, ECS, EKS and VMware® Cloud on AWS.

I am assuming the AWS OpenZFS uses EBS as the block storage backend, given the announcement that it can deliver 4GB/sec of throughput and 160,000 IOPS from the “drives” without caching. How the OpenZFS is provisioned to the AWS clients is well documented in this blog here. It is an absolutely joy (for me) to see the open source OpenZFS filesystem getting the validation and recognization from AWS. This is one hell of a filesystem.

But this blog isn’t about AWS FSx for OpenZFS with block storage. It is about what is coming, and eventually AWS FSx for OpenZFS could expand into AWS’s proficient S3 storage as well.  Can OpenZFS integrate with an S3 object storage backend? This blog looks into the burning question.

In the recently concluded OpenZFS Developer Summit 2021, one of the topics was “ZFS on Object Storage“, and the short answer is a resounding YES!

OpenZFS Developer Summit 2021

Continue reading

Storage Elephant Compute Birds

Data movement is expensive. Not just costs, but also latency and resources as well. Thus there were many narratives to move compute closer to where the data is stored because moving compute is definitely more economical than moving data. I borrowed the analogy of the 2 animals from some old NetApp® slides which depicted storage as the elephant, and compute as birds. It was the perfect analogy, because the storage is heavy and compute is light.

“Close up of a white Great Egret perching on top of an African Elephant aa Amboseli national park, Kenya”

Before the animals representation came about I used to use the term “Data locality, Data Mobility“, because of past work on storage technology in the Oil & Gas subsurface data management pipeline.

Take stock of your data movement

I had recent conversations with an end user who has been paying a lot of dollars keeping their “backup” and “archive” in AWS Glacier. The S3 storage is cheap enough to hold several petabytes of data for years, because the IT folks said that the data in AWS Glacier are for “backup” and “archive”. I put both words in quotes because they were termed as “backup” and “archive” because of their enterprise practice. However, the face of their business is changing. They are in manufacturing, oil and gas downstream, and the definitions of “backup” and “archive” data has changed.

For one, there is a strong demand for reusing the past data for various reasons and these datasets have to be recalled from their cloud storage. Secondly, their data movement activities still mimicked what they did in the past during their enterprise storage days. It was a classic lift-and-shift when they moved to the cloud, and not taking stock of  their data movements and the operations they ran on these datasets. Still ongoing, their monthly AWS cost a bomb.

Continue reading