[ Disclosure: I was invited by GestaltIT as a delegate to their Storage Field Day 19 event from Jan 22-24, 2020 in the Silicon Valley USA. My expenses, travel, accommodation and conference fees were covered by GestaltIT, the organizer and I was not obligated to blog or promote the vendors’ technologies presented at this event. The content of this blog is of my own opinions and views ]
Cloud computing will have challenges processing data at the outer reach of its tentacles. Edge Computing, as it melds with the Internet of Things (IoT), needs a different approach to data processing and data storage. Data generated at source has to be processed at source, to respond to the event or events which have happened. Cloud Computing, even with 5G networks, has latency that is not sufficient to how an autonomous vehicle react to pedestrians on the road at speed or how a sprinkler system is activated in a fire, or even a fraud detection system to signal money laundering activities as they occur.
Furthermore, not all sensors, devices, and IoT end-points are connected to the cloud at all times. To understand this new way of data processing and data storage, have a look at this video by Jay Kreps, CEO of Confluent for Kafka® to view this new perspective.
Data is continuously and infinitely generated at source, and this data has to be compiled, controlled and consolidated with nanosecond precision. At Storage Field Day 19, an interesting open source project, Pravega, was introduced to the delegates by DellEMC. Pravega is an open source storage framework for streaming data and is part of Project Nautilus.
Rise of streaming time series Data
Processing data at source has a lot of advantages and this has popularized Time Series analytics. Many time series and streams-based databases such as InfluxDB, TimescaleDB, OpenTSDB have sprouted over the years, along with open source projects such as Apache Kafka®, Apache Flink and Apache Druid.
The data generated at source (end-points, sensors, devices) is serialized, timestamped (as event occurs), continuous and infinite. These are the properties of a time series data stream, and to make sense of the streaming data, new data formats such as Avro, Parquet, Orc pepper the landscape along with the more mature JSON and XML, each with its own strengths and weaknesses.
You can learn more about these data formats in the 2 links below:
DIY is difficult
Many time series projects started as DIY projects in many organizations. And many of them are still DIY projects in production systems as well. They depend on tribal knowledge, and these databases are tied to an unmanaged storage which is not congruent to the properties of streaming data.
At the storage end, the technologies today still rely on the SAN and NAS protocols, and in recent years, S3, with object storage. Block, file and object storage introduce layers of abstraction which may not be a good fit for streaming data.
Streams as first-class storage citizens
Pravega storage framework is based on many of the properties of time series data streams, with strong influence from Apache Flink. Some of the required properties to construct Pravega include:
- Named partitions where different “swim lanes” are divided by each application stream identifier and data streams can be independently accessed and processed according to the application rate speeds and service level objectives
- Durable where the data streams must stored persistently and accessed in a strongly consistent manner, guaranteed to be processed only-once. It must be scaled in lock-step with the storage and processing components of the ecosystem and integrated with checkpoints schematics of the required framework.
- Append-only. The storage must be low-latency for the append-only write I/O and read I/O at the tail end of the data stream sequence. It must also scale to the read I/O operations for the historical data streams as well.
- Infinite serialization of data where the storage framework addresses the high-speed, high-volume chains of data streams that link different jobs with the different streams semantics, dynamically and elastically.
With the advent Pravega, data streams are promoted to become a new class of storage primitives and may be placed as a first class storage protocols such as Fibre Channel or SMB or NFS. The intent is clear and its ambitious ecosystem is shared below.
Project Nautilus is the composite software platform for both streaming real-time data and historical batch data across many industries. Several use cases in Manufacturing, Healthcare, Financial and even Smart Buildings were shared by DellEMC.
The Pravega storage framework is at the heart of it. To the north, it handles the ingestion and storage of streaming data, and supplying the streaming data in real time or batches to unified analytics platforms such as Flink and Apache Spark via storage connectors. To the south, it integrates and interacts with persistent storage infrastructure platforms such as Isilon and ECS. The software components of Project Nautilus are containerized and runs on Pivotal Container Services (PKS).
When Pravega and Project Nautilus were unveiled in 2017, it may appear to be a novel idea for storage technologists. The meteoric rise of Edge Computing and IoT are pushing the envelope and the present storage infrastructure platforms and frameworks are beginning to sound and look archaic for streaming data.
Both Pravega and Project Nautilus are disruptive technologies to new ways to store, manage and process data from the infrastructure’s end. DellEMC is challenging a new breed of data processing platforms such Confluent for Kafka®, InfluxData InfluxDB™ TICK and TIG stacks, which invade from the developers’ end.
The time to bring this data streaming storage framework and platform to fruition is now, as streams are becoming more prevalent at the Edge.
[ Note: While this blog was in the queue to be published, Dell EMC announced their “data center in a box” solution for edge storage and analytics. Announcement here. ]