The vocabulary of sources and sinks are beginning to appear in the world of data storage as we witness the new addition of data processing frameworks and the applications in this space. I wrote about this in my blog “Rethinking data. processing frameworks systems in real time” a few months ago, introducing my take on this budding new set of I/O characteristics and data ecosystem. I also started learning about the Kappa Architecture (and Lambda as well), a framework designed to craft and develop a set of amalgamated technologies to handle stream processing of a series of data in relation to time.
In Computer Science, sources and sinks are considered external entities that often serve as connectors of input and output of disparate systems. They are often not in the purview of data storage architects. Also often, these sources and sinks are viewed as black boxes, and their inner workings are hidden from the views of the data storage architects.
The changing facade of data stream processing presents the constant motion of data, the continuous data being altered as it passes through the many integrated sources and sinks. We are also see much of the data processed in-memory as much as possible. Thus, the data services from a traditional storage model of SAN and NAS may straggle with the requirements demanded by this new generation of data stream processing.
As the world of traditional data storage processing is expanding into data streams processing and vice versa, and the chatter of sources and sinks can no longer be ignored.
MQs and Pubs-Subs Plumbing
The emergence of message queues and publisher/subscriber (pub-sub) data processing frameworks has enveloped storage infrastructure and data services. These are part of a larger data ecosystem where the movement and quick processing of data define the demands of a new generation of data, one that creates information and value in real time, on-the-spot.
I am not in a position to explain the differences between a message queue and a pub/sub subsystem. I am still a novice here but for the interested, you can read one of the many articles about both here.
The data stream processing frameworks universe has expanded rapidly. For instance InfluxDB’s Telegraf has hundreds of server agents, Confluent’s Kafka has scores of connectors and Elasticsearch has many integration points with different sources and sinks. A simple read into the TICK (Telegraf InfluxDB Chronograf Kapacitor) or the Elasticsearch ELK (Elasticsearch Logstash Kibana) stack reveals the messaging passes and queues, and publishing/subscribing mechanisms to attend and process the data streams.
Given this new adjunctive segment, we are beginning to see the rise of a new sub-discipline of data engineering within the storage infrastructure and data management segment. Welcoming the Data Storage Plumber!
Sources and sinks still need storage
From a data storage architect point of view and incorporating data engineering and plumbing, these sources and sinks still use storage. In a less intimidating way, sources and sinks are little bit like ETL (extract, transform load) subsystems. For non-database/data warehouse people, the use of Unix pipes with grep, sed and awk comes to the similarity we are looking for in explaining sources and sinks. But while ETL workloads are often seen as intermediaries between 2 data premises, often with different schemas and contexts, and run in batch jobs, sources and sinks require more agile storage. Time is of the essence in this new “data kitchen” of sources and sinks. The types of data storage services in its traditional form of block-based or file-based storage are considered laggards because data stream processing require quick assembling and dismantling of storage repositories on-the-fly, often through API-based connectors and infrastructure-as-code (IAC).
So, the data storage “fixtures” are still very much needed but procured through methods different from storage LUNs (logical unit numbers) or an exposed mount point via a folder/file hierarchy. The new data storage “fixtures” curation methods are called through APIs, to support the scaffolding of sources and sinks.
Object storage is a natural fit and others aren’t
Object storage through natural selection has become the dominant storage repository over the SAN and NAS architectures for the described sources and sinks. So, to argue the case, why aren’t SAN and NAS a good fit to deal with data stream processing? Here is how I view this examination.
Block-based SAN are too restrictive and rigid. Provisioning, even with automation, is still too clunky. Ensuring the LUN (logical unit number) volume configured correctly in the Fibre Channel zones on-the-fly or having the IQN (iSCSI Qualified Name) numbers easily re-provisioned whenever there are instant changes in the data stream is just too darn complicated. Furthermore, with the SAN framework, the data stored at the block level is too far removed from the data stream applications, piled with layers upon layers of data hierarchies before it can be of value.
Unstructured data seems to be a prick point for SAN. What about file-based NAS frameworks? The POSIX (Portable Operating System Interface) political correctness was meaningful during the Unix balkanization wars in the 90s and transitioned Unix into Linux very well. But we are in a cloud world now, and operating systems have become pretty much invisible let alone the requirement to be POSIX-compliant across all clouds. Plus the fact to POSIX filesystems would struggle when they try to transcend from cloud to cloud.
The design mindsets of SAN and NAS were on about designing and provisioning storage infrastructures to fit as per requirement. Only SAN and NAS frameworks are deployed, the applications and workloads that they served were supposed to stay within a range of a status quo environment. The fluidity and real time changes in data stream processing would create multiplexes of multiverses that would be impossible to operate and maintain when it comes to SAN and NAS.
Adding to that design mindsets I mentioned, data storage architects have to look at storage architects that are designed to be durable in the constant flux, resilient to frequent breakdowns in data communications, and transcendence from premises to clouds easily and back without the baggage of inflexibility.
The shifting mindsets have been changing from “Design to Succeed” to “Design to Fail”. Enterprise data storage architects over compensate in designing data storage and data processing frameworks, resulting in unequal load balancing of usage and processing. The flip side is to always think that the design of the data ecosystem will have to fail at some point and provisions are made to respond to these failures and work around these failures to continue the data processing services.
Furthermore, SAN and NAS designs are seen as a subsystems beneath the applications layer, designed to serve data separately. Data stream applications would prefer the data storage to be adjunct and side-by-side to many of these data stream applications that are heavily utilizing in-memory architectures for high speed processing, with object storage having the best of qualities and the most natural fit for these types of applications.
Not all object storage platforms are the same
The voices of sources and sinks can only get louder from here. Their rise to prominence in the data stream applications into the world of data storage helps in a creating a new universe of data storage and data processing. I see no near potential data storage architectures that can address what data stream applications are looking for other than object storage. It is time we strengthen the narratives of object storage.
One very important to note here is that not all object storage are the same. Industry analysts have already acknowledged this divergence and prominent technology analyst firm GigaOM has rightly pointed out this dichotomy. One is Enterprise Object Storage and the other High Performance Object Storage.
Object storage should no longer to look upon as the “cheap-and-deep” archive storage that was once considered slow and clumsy. Still advocated by the earlier (and archaic) object storage technologies like Hitachi HCP (acquired from Archivas in 2007 and previously known as Hitachi Content ARCHIVE Platform or HCAP) or NetApp StorageGRID (acquired from Bycast in 2010), these architectures are dated and meant for the spinning hard disk drives era. And adding flash SSDs to the bulky architecture does not count as high performance object storage because their object storage architectures are designed for the “cheap-and-deep” archiving category of the market segment.
Furthermore, many of the sources and sinks integrations are located at the edge and end points. Having a thick architecture like the technologies in the Enterprise Object Storage category defeats the purpose of being agile and nimble in data stream applications.
The right type of object storage to address this data stream applications is the high performa object storage. It is time to look at an object storage with a high performance core engine, small footprint and integrates extremely well with data stream applications. MinIO, an object storage technology I have been following (and practising occasionally) for about 5 years now, already has amassed a large integrated universe with many data stream processing frameworks and more are coming to their party. Built with blazing fast GoASM and Go, it is extremely efficient, and the server component is less than 50MB (yes, Megabytes!). And yes, MinIO is strictly consistent with strong data persistency.
- [ Note ] One of the challenges in early distributed object storage is consistency. This is where CAP (Consistency, Availability, Partition Tolerance) theorem comes into play. Object storage can be eventual consistent or strict consistent.
MinIO has also made smart integration frameworks with Kubernetes, VMware ®Tanzu, Splunk and scores of cloud native applications such as PrestoDB, CockroachDB, Apache Spark. InfluxDB, Apache Kafka and so many more.
It’s a changing world. We as data storage architects are taking on additional roles now and this includes being a storage plumber as well. And these are sources and sinks that are fast becoming part of our world as well.