Pravega Archives - Storage Gaga

All the Sources and Sinks going to Object Storage

By cfheoh | April 25, 2022 - 8:00 am |April 24, 2022 Algorithm, Analytics, API, Artificial Intelligence, Big Data, Confluent, Containers, Data, Data Management, Deep Learning, Digital Transformation, Edge Computing, Filesystems, Google Anthos, HDS, Hitachi Vantara, Hyperconvergence, InfluxDB, Kubernetes, Linux, Machine Learning, Minio, NAS, NetApp, Object Storage, Pravega, Storage Area Network

Leave a comment

The vocabulary of sources and sinks are beginning to appear in the world of data storage as we witness the new addition of data processing frameworks and the applications in this space. I wrote about this in my blog “Rethinking data. processing frameworks systems in real time” a few months ago, introducing my take on this budding new set of I/O characteristics and data ecosystem. I also started learning about the Kappa Architecture (and Lambda as well), a framework designed to craft and develop a set of amalgamated technologies to handle stream processing of a series of data in relation to time.

In Computer Science, sources and sinks are considered external entities that often serve as connectors of input and output of disparate systems. They are often not in the purview of data storage architects. Also often, these sources and sinks are viewed as black boxes, and their inner workings are hidden from the views of the data storage architects.

Diagram from https://developer.here.com/documentation/get-started/dev_guide/shared_content/topics/olp/concepts/pipelines.html

The changing facade of data stream processing presents the constant motion of data, the continuous data being altered as it passes through the many integrated sources and sinks. We are also see much of the data processed in-memory as much as possible. Thus, the data services from a traditional storage model of SAN and NAS may straggle with the requirements demanded by this new generation of data stream processing.

As the world of traditional data storage processing is expanding into data streams processing and vice versa, and the chatter of sources and sinks can no longer be ignored.

Continue reading →

Rethinking data processing frameworks systems in real time

By cfheoh | December 27, 2021 - 8:00 am |December 26, 2021 Algorithm, Amazon Web Services, Analytics, API, Artificial Intelligence, Confluent, Containers, Data, Data Management, Data Privacy, Data Protection, Data Security, Digital Transformation, Google, Hadoop, Hadoop Clusters, InfluxDB, Machine Learning, MapReduce, Microsoft Azure, Pravega, Scale-out architecture

Leave a comment

“Row, row, row your boat, gently down the stream…”

Except the stream isn’t gentle at all in the data processing’s new context.

For many of us in the storage infrastructure and data management world, the well known framework is storing and retrieve data from a storage media. That media could be a disk-based storage array, a tape, or some cloud storage where the storage media is abstracted from the users and the applications. The model of post processing the data after the data has safely and persistently stored on that media is a well understood and a mature one. Users, applications and workloads (A&W) process this data in its resting phase, retrieve it, work on it, and write it back to the resting phase again.

There is another model of data processing that has been bubbling over the years and now reaching a boiling point. Still it has not reached its apex yet. This is processing the data in flight, while it is still flowing as it passes through processing engine. The nature of this kind of data is described in one 2018 conference I chanced upon a year ago.

letgo marketplace processing numbers in 2018

* NRT = near real time

From a storage technology infrastructure perspective, this kind of data processing piqued my curiosity immensely. And I have been studying this burgeoning new data processing model in my spare time, and where it fits, bringing the understanding back into the storage infrastructure and data management side.

Continue reading →

How well do you know your data and the storage platform that processes the data

By cfheoh | December 20, 2021 - 7:30 am |December 20, 2021 100Gigabit Ethernet, 10Gigabit Ethernet, Algorithm, Analytics, Appliance, Backup, Big Data, Business Continuity, Cloud, Clusters, Composable Infrastructure, compression, Confluent, Data Archiving, Data Availability, Data Fabric, Data Management, Data Privacy, Data Protection, Data Security, Deduplication, Digital Transformation, Disaster Recovery, Edge Computing, Filesystems, Hyperconvergence, ILM, Industry 4.0, InfluxDB, iRODS, Machine Learning, NAS, NFS, NVMe, Object Storage, Performance Caching, Pravega, Reliability, SATA, Scale-out architecture, Security, Software Defined Storage, Storage Area Network, Storage Optimization, Storage Tiering, Unified Storage, VDI, Virtualization

Leave a comment

Last week was consumed by many conversations on this topic. I was quite jaded, really. Unfortunately many still take a very simplistic view of all the storage technology, or should I say over-marketing of the storage technology. So much so that the end users make incredible assumptions of the benefits of a storage array or software defined storage platform or even cloud storage. And too often caveats of turning on a feature and tuning a configuration to the max are discarded or neglected. Regards for good storage and data management best practices? What’s that?

I share some of my thoughts handling conversations like these and try to set the right expectations rather than overhype a feature or a function in the data storage services.

Complex data networks and the storage services that serve it

I/O Characteristics

Applications and workloads (A&W) read and write from the data storage services platforms. These could be local DAS (direct access storage), network storage arrays in SAN and NAS, and now objects, or from cloud storage services. Regardless of structured or unstructured data, different A&Ws have different behavioural I/O patterns in accessing data from storage. Therefore storage has to be configured at best to match these patterns, so that it can perform optimally for these A&Ws. Without going into deep details, here are a few to think about:

Random and Sequential patterns
Block sizes of these A&Ws ranging from typically 4K to 1024K.
Causal effects of synchronous and asynchronous I/Os to and from the storage

Continue reading →

The Starbucks model for Storage-as-a-Service

By cfheoh | October 11, 2021 - 8:00 am |October 9, 2021 Amazon Web Services, API, Appliance, Artificial Intelligence, Big Data, Cloud, Data Archiving, Data Availability, Data Corruption, Data Fabric, Data Management, Data Privacy, Data Protection, Data Security, Deep Learning, Digital Transformation, FreeNAS, Google Anthos, ILM, iXsystems, Kubernetes, Nextcloud, Pravega, Pure Storage, Software Defined Storage, TrueNAS

Leave a comment

Starbucks™ is not a coffee shop. It purveys beyond coffee and tea, and food and puts together the yuppie beverages experience. The intention is to get the customers to stay as long as they can, and keep purchasing the Starbucks’ smorgasbord of high margin provisions in volume. Wifi, ambience, status, coffee or tea with your name on it (plenty of jokes and meme there), energetic baristas and servers, fancy coffee roasts and beans et. al. All part of the Starbucks™-as-a-Service pleasurable affair that intends to lock the customer in and have them keep coming back.

The Starbucks experience

Data is heavy and they know it

Unlike compute and network infrastructures, storage infrastructures holds data persistently and permanently. Data has to land on a piece of storage medium. Coupled that with the fact that data is heavy, forever growing and data has gravity, you have a perfect recipe for lock-in. All storage purveyors, whether they are on-premises data center enterprise storage or public cloud storage, and in between, there are many, many methods to keep the data chained to a storage technology or a storage service for a long time. The storage-as-a-service is like tying the cow to the stake and keeps on milking it. This business model is very sticky. This stickiness is also a lock-in mechanism.

Continue reading →

The Edge is coming! The Edge is coming!

By cfheoh | October 12, 2020 - 9:15 am |October 11, 2020 100Gigabit Ethernet, Analytics, Big Data, Containers, Data, Deep Learning, Edge Computing, Flash, Industry 4.0, InfluxDB, Linux, Machine Learning, Mellanox, Mellanox Technologies, Minio, nVidia, NVMe, Pravega, SNIA, Solid State Devices

Leave a comment

Actually, Edge Computing is already here. It has been here on everyone’s lips for quite some time, but for me and for many others, Edge Computing is still a hodgepodge of many things. The proliferation of devices, IoT, sensor, end points being pulled into the ubiquitous term of Edge Computing has made the scope ever changing, and difficult to pin down. And it is this proliferation of edge devices that will generate voluminous amount of data. Obvious questions emerge:

How to do you store all the data?
How do you process all the data?
How do you derive competitive value from the data from these edge devices?
How do you securely transfer and share the data?

From the storage technology perspective, it might be easier to observe what are the traits of the data generated on the edge device. In this blog, we also observe what could some new storage technologies out there that could be part of the Edge Computing present and future.

Edge Computing overview – Cloud to Edge to Endpoint

Storage at the Edge

The mantra of putting compute as close to the data and processing it where it is stored is the main crux right now, at least where storage of the data is concerned. The latency to the computing resources on the cloud and back to the edge devices will not be conducive, and in many older settings, these edge devices in factory may not be even network enabled. In my last encounter several years ago, there were more than 40 interfaces, specifications and protocols, most of them proprietary, for the edge devices. And there is no industry wide standard for these edge devices too.

Continue reading →

DellEMC Project Nautilus Re-imagine Storage for Streams

By cfheoh | February 24, 2020 - 5:56 am |February 25, 2020 Algorithm, Analytics, API, Artificial Intelligence, Big Data, Cloud, Confluent, Data, Data Management, Deep Learning, Dell, DellEMC, Edge Computing, EMC, Fog Computing, Industry 4.0, InfluxDB, IoT, Isilon, Kubernetes, Linux, Machine Learning, Pravega, Storage Field Day, Tech Field Day

2 Comments

[ Disclosure: I was invited by GestaltIT as a delegate to their Storage Field Day 19 event from Jan 22-24, 2020 in the Silicon Valley USA. My expenses, travel, accommodation and conference fees were covered by GestaltIT, the organizer and I was not obligated to blog or promote the vendors’ technologies presented at this event. The content of this blog is of my own opinions and views ]

Cloud computing will have challenges processing data at the outer reach of its tentacles. Edge Computing, as it melds with the Internet of Things (IoT), needs a different approach to data processing and data storage. Data generated at source has to be processed at source, to respond to the event or events which have happened. Cloud Computing, even with 5G networks, has latency that is not sufficient to how an autonomous vehicle react to pedestrians on the road at speed or how a sprinkler system is activated in a fire, or even a fraud detection system to signal money laundering activities as they occur.

Furthermore, not all sensors, devices, and IoT end-points are connected to the cloud at all times. To understand this new way of data processing and data storage, have a look at this video by Jay Kreps, CEO of Confluent for Kafka® to view this new perspective.

Data is continuously and infinitely generated at source, and this data has to be compiled, controlled and consolidated with nanosecond precision. At Storage Field Day 19, an interesting open source project, Pravega, was introduced to the delegates by DellEMC. Pravega is an open source storage framework for streaming data and is part of Project Nautilus.

Rise of streaming time series Data

Processing data at source has a lot of advantages and this has popularized Time Series analytics. Many time series and streams-based databases such as InfluxDB, TimescaleDB, OpenTSDB have sprouted over the years, along with open source projects such as Apache Kafka®, Apache Flink and Apache Druid.

The data generated at source (end-points, sensors, devices) is serialized, timestamped (as event occurs), continuous and infinite. These are the properties of a time series data stream, and to make sense of the streaming data, new data formats such as Avro, Parquet, Orc pepper the landscape along with the more mature JSON and XML, each with its own strengths and weaknesses.

You can learn more about these data formats in the 2 links below:

DIY is difficult

Many time series projects started as DIY projects in many organizations. And many of them are still DIY projects in production systems as well. They depend on tribal knowledge, and these databases are tied to an unmanaged storage which is not congruent to the properties of streaming data.

At the storage end, the technologies today still rely on the SAN and NAS protocols, and in recent years, S3, with object storage. Block, file and object storage introduce layers of abstraction which may not be a good fit for streaming data.

Continue reading →

Category Archives: Pravega

All the Sources and Sinks going to Object Storage

Rethinking data processing frameworks systems in real time