Rethinking data processing frameworks systems in real time

“Row, row, row your boat, gently down the stream…”

Except the stream isn’t gentle at all in the data processing’s new context.

For many of us in the storage infrastructure and data management world, the well known framework is storing and retrieve data from a storage media. That media could be a disk-based storage array, a tape, or some cloud storage where the storage media is abstracted from the users and the applications. The model of post processing the data after the data has safely and persistently stored on that media is a well understood and a mature one. Users, applications and workloads (A&W) process this data in its resting phase, retrieve it, work on it, and write it back to the resting phase again.

There is another model of data processing that has been bubbling over the years and now reaching a boiling point. Still it has not reached its apex yet. This is processing the data in flight, while it is still flowing as it passes through processing engine. The nature of this kind of data is described in one 2018 conference I chanced upon a year ago.

letgo marketplace processing numbers in 2018

* NRT = near real time

From a storage technology infrastructure perspective, this kind of data processing piqued my curiosity immensely. And I have been studying this burgeoning new data processing model in my spare time, and where it fits, bringing the understanding back into the storage infrastructure and data management side.

No storage can handle that

From a traditional mindset, few storage, on-premises or in the clouds, can handle that kind of volume and that kind of workload efficiently, and be able to respond with less than a second of latency. In describing this type of data processing, we are really seeing the 4Vs of Big Data, which are Volume, Velocity, Variety and Veracity.

The 4Vs of Big Data

If there was a way to describe this kind of data, it would be an F5 tornado on the Fujita scale. The sheer magnitude of its force will decimate any traditional storage platforms, unless you have a super computing class storage infrastructure.

The storage to meet this type of data processing has to be ultra low latency, able to handle very high ingestion rate, and agile to absorb complex I/O data patterns.

The 4 areas of learning

I made up these 4 areas purely as my learning structure

Ingestion and Messaging Queues Processing
Streams Processing
Next generation database processing
Visualization

Ingestion and Message Queues Processing

To consume the voluminous myriad of data at a high ingress rate would inundate any storage platforms, there has to be a subsystem to handle the workloads. This could be coming from different sources, different data types, different characteristics as well as the intensity and capacity of the workloads.

Integration with different sources require different type of connectors. There could be from regular databases, logs from different devices and software, search and click streams and more. Data formats such as Apache Avro and Apache Parquet are alien to me, as well as a ton of other emerging formats.

The actions of collecting, aggregating and moving large volumes of data in high speed are part of the vocabulary now. Messaging queuing is part of this framework, linking applications and datastores. In my radar, there are open source projects such as Apache Flume, commercial ones such as Influxdata® Telegraf (part of the TICK stack), Elastic Logstash and Elastic Beats (both part of ELK stack) and a few more.

Streams Processing

Here we look at the streams processing pipeline. The large volume of data coming from the previous ingestion processing subsystems has to be linked up with some sort of “sorting” mechanism for the data to be “assigned” to the right pipeline to be processed.

Here publisher/subscriber subsystems appear to be getting popular but there are other streams processing and data analytics engine as well. I am picking little learning from Apache Spark, Apache Kafka, InfluxData Kapacitor, LinkedIn® Apache Samza and many more.

While learning this segment, the Men In Black II post office scene where alien with many hands rapid sorting letters come to mind.

Next generation database processing

Moving along, the data has to be stored in optimized temporal databases or some sort of wide column NoSQL databases and/or key value data stores.

TSDB (time series database) stores many kinds of data with timestamps (measured up to milliseconds in some) to effectively aggregate and query logs, metrics, and other measured data for analytics. NoSQL databases have more flexible schemas and better optimized to handle non structured data and key-value data stores have simpler access to data through a hashed key value.

Visualization

At the other end of data processing, now we have to see patterns, the valleys, the peaks of streaming data in near real time. At the visualization point, there are many frameworks and new application sets that deliver data panels that can provide reports, presentations, dashboards and graphs. Here we can observe trends, changes, movement, course, shifts and visualize in real time, and even set what-ifs scenarios, alerts, notifications and more.

As 4Vs (volume, velocity, variety, veracity) data gets distilled and filtered through these data processing frameworks, the end game is Data Insight. With insight, organizations will be able to understand the data collected and processed and take prudent and actionable to do many things. Seeing gaps in strategies, solving problems, alerted of possible impact, improvement to the processing, fixing issues and many, many more can be derived from visualization of data and transform these into relevant and applicable duties. This is what Digital Transformation is about, in my opinion. Why else would you call it Data Driven Insights?

Wrapping it all up for 2021

I am an absolute novice in this new generation of data processing. Many things I wrote here could be very wrong and my thoughts and learning of these frameworks and applications are very rudimentary. But I am trying to make sense in a changing data world, and this nascent learning structure of mine will serve as my scaffolding to grasp these new concepts, and turn these new abstracts into more concrete understanding. Metaphysical to physical, perhaps? Maybe?

The next obvious question is how do all these link back to storage, persistent or transient? I have been accumulating bits and pieces here and there to assemble the knowledge and hopefully, experience as well, to equip myself with mastering of these new data processing concepts. I still have a long, long way to go. I am open to new learnings, and new teachings too.

Wrapping up this last blog of 2021, my 370th to be exact. Happy New Year!