Data Processing Archives

And great AI starts with good Data Management

By cfheoh | November 20, 2023 - 7:00 am |November 15, 2023 Algorithm, Analytics, API, Artificial Intelligence, Backup, Big Data, Business Continuity, Composable Infrastructure, Containers, Data, Data Archiving, Data Management, Data Protection, Data Security, Deep Learning, Digital Transformation, Disaster Recovery, DMTF, iRODS, Machine Learning, Object Storage, Reliability, Security, Software-defined Datacenter, Storage Tiering, Uncategorized, Virtualization

AI is driving up costs in data processing

A few recent articles drew my focus into the cost of data processing.

Here is one posted by a friend on Facebook. It is titled “The world is running out of data to feed AI, experts warn.”

My first reaction was, “How can we run out of data“? We have so much data in the world today that the 175 zettabytes predicted by IDC when we reach 2025 might be grossly inaccurate. According to Exploding Topics, it is estimated that we create 328.77TB of data per day, 120 zettabytes per year. While I cannot vouch for the accuracy of the numbers, the numbers are humongous.

Continue reading →

DellEMC Project Nautilus Re-imagine Storage for Streams

By cfheoh | February 24, 2020 - 5:56 am |February 25, 2020 Algorithm, Analytics, API, Artificial Intelligence, Big Data, Cloud, Confluent, Data, Data Management, Deep Learning, Dell, DellEMC, Edge Computing, EMC, Fog Computing, Industry 4.0, InfluxDB, IoT, Isilon, Kubernetes, Linux, Machine Learning, Pravega, Storage Field Day, Tech Field Day

2 Comments

[ Disclosure: I was invited by GestaltIT as a delegate to their Storage Field Day 19 event from Jan 22-24, 2020 in the Silicon Valley USA. My expenses, travel, accommodation and conference fees were covered by GestaltIT, the organizer and I was not obligated to blog or promote the vendors’ technologies presented at this event. The content of this blog is of my own opinions and views ]

Cloud computing will have challenges processing data at the outer reach of its tentacles. Edge Computing, as it melds with the Internet of Things (IoT), needs a different approach to data processing and data storage. Data generated at source has to be processed at source, to respond to the event or events which have happened. Cloud Computing, even with 5G networks, has latency that is not sufficient to how an autonomous vehicle react to pedestrians on the road at speed or how a sprinkler system is activated in a fire, or even a fraud detection system to signal money laundering activities as they occur.

Furthermore, not all sensors, devices, and IoT end-points are connected to the cloud at all times. To understand this new way of data processing and data storage, have a look at this video by Jay Kreps, CEO of Confluent for Kafka® to view this new perspective.

Data is continuously and infinitely generated at source, and this data has to be compiled, controlled and consolidated with nanosecond precision. At Storage Field Day 19, an interesting open source project, Pravega, was introduced to the delegates by DellEMC. Pravega is an open source storage framework for streaming data and is part of Project Nautilus.

Rise of streaming time series Data

Processing data at source has a lot of advantages and this has popularized Time Series analytics. Many time series and streams-based databases such as InfluxDB, TimescaleDB, OpenTSDB have sprouted over the years, along with open source projects such as Apache Kafka®, Apache Flink and Apache Druid.

The data generated at source (end-points, sensors, devices) is serialized, timestamped (as event occurs), continuous and infinite. These are the properties of a time series data stream, and to make sense of the streaming data, new data formats such as Avro, Parquet, Orc pepper the landscape along with the more mature JSON and XML, each with its own strengths and weaknesses.

You can learn more about these data formats in the 2 links below:

DIY is difficult

Many time series projects started as DIY projects in many organizations. And many of them are still DIY projects in production systems as well. They depend on tribal knowledge, and these databases are tied to an unmanaged storage which is not congruent to the properties of streaming data.

At the storage end, the technologies today still rely on the SAN and NAS protocols, and in recent years, S3, with object storage. Block, file and object storage introduce layers of abstraction which may not be a good fit for streaming data.

Continue reading →

Considerations of Hadoop in the Enterprise

By cfheoh | September 9, 2016 - 10:10 pm |September 10, 2016 10Gigabit Ethernet, Data Management, Deduplication, Filesystems, Flash, Hadoop, Hadoop Clusters, High Performance Computing, MapReduce, NetApp, Performance Caching, RAID, Reliability, Server SAN, Software Defined Storage, Solid State Devices, Storage Optimization, Storage Tiering, Virtualization

1 Comment

I am guilty. I have not been tendering this blog for quite a while now, but it feels good to be back. What have I been doing? Since leaving NetApp 2 months or so ago, I have been active in the scenes again. This time I am more aligned towards data analytics and its burgeoning impact on the storage networking segment.

I was intrigued by an article posted by a friend of mine in Facebook. The article (circa 2013) was titled “Never, ever do this to Hadoop”. It described the author’s gripe with the SAN bigots. I have encountered storage professionals who throw in the SAN solution every time, because that was all they know. NAS, to them, was like that old relative smelled of camphor oil and they avoid NAS like a plague. Similar DAS was frowned upon but how things have changed. The pendulum has swung back to DAS and new market segments such as VSANs and Hyper Converged platforms have been dominating the scene in the past 2 years. I highlighted this in my blog, “Praying to the Hypervisor God” almost 2 years ago.

I agree with the author, Andrew C. Oliver. The “locality” of resources is central to Hadoop’s performance.

Consider these 2 models:

In the model on your left (Moving Data to Compute), the delivery process from Storage to Compute is HEAVY. That is because data has dependencies; data has gravity. However, if you consider the model on your right (Moving Compute to Data), delivering data processing to the storage layer is much lighter. Compute or data processing is transient, and the data in the compute layer is volatile. Once compute’s power is turned off, everything starts again from a clean slate, hence the volatile stage.

Continue reading →

Tag Archives: Data Processing

And great AI starts with good Data Management

AI is driving up costs in data processing

DellEMC Project Nautilus Re-imagine Storage for Streams

Rise of streaming time series Data

DIY is difficult

Considerations of Hadoop in the Enterprise

Recent Posts

Sponsored Ads

Google Adsense

Recent Comments

Google Adsense

AI is driving up costs in data processing

Share this:

Rise of streaming time series Data

DIY is difficult

Share this:

Share this:

Recent Posts

Sponsored Ads

Google Adsense

Recent Comments

Google Adsense