Last week was consumed by many conversations on this topic. I was quite jaded, really. Unfortunately many still take a very simplistic view of all the storage technology, or should I say over-marketing of the storage technology. So much so that the end users make incredible assumptions of the benefits of a storage array or software defined storage platform or even cloud storage. And too often caveats of turning on a feature and tuning a configuration to the max are discarded or neglected. Regards for good storage and data management best practices? What’s that?
I share some of my thoughts handling conversations like these and try to set the right expectations rather than overhype a feature or a function in the data storage services.
I/O Characteristics
Applications and workloads (A&W) read and write from the data storage services platforms. These could be local DAS (direct access storage), network storage arrays in SAN and NAS, and now objects, or from cloud storage services. Regardless of structured or unstructured data, different A&Ws have different behavioural I/O patterns in accessing data from storage. Therefore storage has to be configured at best to match these patterns, so that it can perform optimally for these A&Ws. Without going into deep details, here are a few to think about:
- Random and Sequential patterns
- Block sizes of these A&Ws ranging from typically 4K to 1024K.
- Causal effects of synchronous and asynchronous I/Os to and from the storage
Here are a few anecdotes to share:
- One end user asking 50,000 IOPS for a backup storage because the competitor told her that backup is an I/O intensive job. She ate it up.
- Another end user wants to run a deep learning and machine learning application on drone images on AWS S3 object storage because it is the cheapest storage.
- An end user complained that their TrueNAS® CORE on some ASUS x86 server with hard disk drives was very slow because each user has an average of 200,000 mails from the Postfix mail server. FYI, Postfix uses maildir format (as opposed to mbox format), and one email equates to one file. The hard disks in the configured vdev are not meant to take that kind of I/O load.
Data paths
Between the compute and the storage location lies the network. It can be local NVMe on the PCIe bus, or in the local area network or over a wider area network. I often look from the “Data locality, data mobility” point of view. Having the data close to compute reduces the turn around response time via the network, near or far. That is the data locality part, which defines overall performance.
On the other hand, there are other aspects to understanding data. Correlations, dependencies are 2 of the most important considerations related to data processing, and these tie into storage services as well. I even, sometimes, think about tail latency as well, because the fast storage processing platform still have to deal with the slowest compute functions.
Understanding how data traverses through the data paths, end-to-end, from A&W to storage processing and back, should be part of every data framework design, both at the infrastructure and the software architecture level.
Data reduction
I often link Data footprint reduction (DFR) in storage to deduplication, compression and thin provisioning. And this conversation comes up a lot, especially deduplication. Somehow, many have this notion that turning on (or always) deduplication will magically shrink the storage capacity without any impact or degradation. A couple of anecdotes below:
In one conversation last week, one user wants to run a 50TB Oracle® database with over 1,000 concurrent users, and wants the storage deduplication turned on. It is not a problem enabling deduplication at the Oracle database level because it has deduplication insight of the records and LOBs (large objects). However, at the storage deduplication layer, there is no knowledge of Oracle files. With no dedupe insight, the storage will dedupe everything. When the database wants to access the storage deduped data, these have to be rehydrated (undeduped) for the processing. With over 1,000 concurrent users, the I/Os induced by the dedupe-undedupe operations could be amplified and impact the performance of Oracle®.
Similarly, I also encountered another conversation about a cloud gaming company which wants to run over 1,000 VDI desktops where the virtual desktops will run high resolution games and also content development with 3D Maya, Adobe Premiere and Davinci Resolve. Asking for storage side deduplication would require a rocket fueled storage that could have save storage capacity but degrade storage performance tremendously.
Long story short. Don’t assume deduplication will save your money all the time. There are always trade-offs.
Data demographics
Don’t assign profiles to the data if you don’t have a plan for the data. Every piece of data is different, and thus to better manage the data, certain profiles are assigned to consolidate protection, security and compliance practices. However, data is always changing, and putting data into certain definitions can be detrimental if the use of the data in its lifecycle is not well understood.
There were several conversations about repatriating data from the cloud to on-premises storage. The usual argument is cost. The usual culprit to blame is S3 storage being so damn cheap. But what many users do not seem to understand is that the data movement variable costs money. It is not like in a local network where getting data out of the storage is free. Thus, some thoughts have to go into how the data is stored in the cloud is being used after it has been stored.
One company I was talking said it started with them using AWS S3 storage for “backup”. I put “Backup” in inverted commas because that was their definition of that data. The purpose of backup is protection and only restore when required. However, this company’s “backup” became a repository of on-premises applications to do operations analysis about how well their businesses were performing. So, they download the “backup” data from the cloud storage, gigabytes of them daily, for this operational function.
You can see where disjoint is. The “Backup” data was constantly used for production analytics, and yet they still treated it as if it was for backup.
This situation goes back to the enterprise storage where we often have to introduce file tiering to move obsoleted data (based on age and usage of the data) to a cheaper storage medium. Tier 3 if you will.
Thus data demographics, and assigning the right identity and definition of the data requires prudent thoughts in the name of data management.
New Data behaviours
The fast changing digital transformation initiatives are instigating rapid changes in data behaviours again. From the present data processing at rest, post processing model with the data storage services platform, we are beginning to hear the clarion call of data processing in flight, streaming processing model. Streams of data processed in real time, and explored, analyzed, presented, notified even before it lands and stored persistently on a storage services platform. This is the new world of streams processing and time-series data frameworks, calculated with the element of time in minutes, seconds and even milliseconds.
Data processing platforms such as Confluent/Apache Kafka, Vectorized.io/Red Panda, Datastax/Yahoo/Apache Pulsar, and time series database such as OpenTSDB, TimeScale, InfluxData/InfluxDB and more are now spotting the data storage landscape. How will these new generation of data services affect I/Os and how will a storage platform responds to these data patterns? Thoughts abound …
Happy Holidays
I felt writing this was a relief to my pent-up frustration. Many takes the easy way to define their data and storage requirements, and set up their expectations beyond the clarity of their plans. Perhaps some of my deep thoughts here will rub off on people who read this.
Anyway, next week will be Christmas, and the following the New Year. Happy holidays!