Somewhere, there is a misconception that data processing is cheap. That stems from the well-known pricings of the capacities of public cloud storage that are a fraction of cents per month. But data in storage has to be worked upon, and has to be built up and protected to increase its value. Data has to be processed, moved, shared, and used by applications. Data induce workloads. Nobody keeps data stored forever and never be used again. Nobody buys storage just for capacity alone.
We have a great saying in the industry. No matter, where the data moves, it will land in a storage. So, it is clear that data does not exist in ether. And yet, I often see how little attention and prudence and care, when it comes to data infrastructure and data management technologies, the very components that are foundational to great data.
Great data management for Great AI
AI is driving up costs in data processing
A few recent articles drew my focus into the cost of data processing.
I was listening to several storage luminaries in the GestaltIT’s podcast “No one understands Storage anymore” a few of weeks ago. Around the minute of 11.09 in the podcast, Dr. J. Metz, SNIA® Chair, brought up this is powerful quote “Storage does not mean Capacity“. It struck me, not in a funny way. It is what it is, and it something I wanted to say to many who do not understand the storage solutions they are purchasing. It exemplifies what is wrong in the many organizations today in their understanding of investing in a storage infrastructure project.
This is my pet peeve. The first words uttered in most, if not all storage requirements in my line of work are, “I want this many Terabytes of storage“. There are no other details and context of what the other requirement factors are, such as availability, performance, future growth, etc. Or even the goals to achieve when purchasing a storage system and operating it. What is the improvement they are looking for?What are the problems to solve?
Where is the OKR?
It pains me to say this. For the folks who have in the IT industry for years, both end users and IT purveyors alike, most are absolutely clueless about OKR (Objectives and Key Results) for their storage infrastructure project. Many cannot frame the data challenges they are facing, and they have no idea where to go next. There is no alignment. There is no strategy. Even worse, there is no concept of how their storage infrastructure investments will improve their business and operations.
Just the other day, one company director from a renown IT integrator here in Malaysia came calling. He has been in the IT industry since 1989 (I checked his Linkedin profile), asking to for a 100TB storage quote. I asked a few questions about availability, performance, scalability; the usual questions a regular IT guy would ask. He has no idea, and instead of telling me he didn’t know, he gave me a runaround of this and that. Plenty of yada, yada nonsense.
In the end, I told him to buy a consumer grade storage appliance from Taiwan. I will just let him make a fool of himself in front of his customer since he didn’t want to take accountability of ensuring his customer get a proper enterprise storage solution in good faith. His customer is probably in the same mould as well.
Defensive Strategies as Data Foundations
A strong storage infrastructure foundation is vital for good Data Credibility. If you do the right things for your data, there is Data Value, and it will serve your business well. Both Data Credibility and Data Value create confidence. And Confidence equates Trust.
In order to create the defensive strategies let’s look at storage Availability, Protection, Accessibility, Management Security and Compliance. These are 6 of the 8 data points of the A.P.P.A.R.M.S.C. framework.
Offensive Strategies as Competitive Advantage
Once we have achieved stability of the storage infrastructure foundation, then we can turn over and drive towards storage Performance, Recovery, plus things like Scalability and Agility.
With a strong data infrastructure foundation, the organization can embark on the offensive, and begin their business transformation journey, knowing that their data is well run, protection, and performs.
Alignment with Data and Business Goals
Why are the defensive and offensive strategies requiring alignment to business goals?
The fact is simple. It is about improving the business and operations, and setting OKRs is key to measure the ROI (return of investment) of getting the storage systems and the solutions in place. It is about switching the cost-fearing (negative) mindset to a profit-conviction (positive) mindset.
For example, maybe the availability of the data to the business is poor. Maybe there is the need to have access to the data 24×7, because the business is going online. The simple measurable fact is we can move availability from 95% uptime to 99.99% uptime with an HA storage system.
Perhaps there are concerns about recoverability in the deluge of ransomware threats. Setting new RPO goals from 24 hours to 4 hours is a measurable objective to enhance data resiliency.
Or getting the storage systems to deliver higher performance from 350 IOPS to 5000 IOPS for the database.
What I am saying here is these data points are measurable, and they can serve as checkpoints for business and operational improvements. From a management perspective, these can be used as KPI (key performance index) to define continuous improvement of Data Confidence.
Furthermore, it is easy when a OKR dashboard is used to map the improvement markers when organizations use storage to move from point A to point B, where B equates to a new success milestone. The alignment sets the paths to the business targets.
Storage does not mean only Capacity.
The sad part is what the OKRs and the measured goals alignments are glaringly missing in the minds of many organizations purchasing a storage infrastructure and data management solution. The people tasked to source a storage technology solution are not placing a set of goals and objectives. Capacity appears to be the only thing on their mind.
I am about to meet a procurement officer of a customer soon. She asked me this question “Why is your storage so expensive?” over email. I want to change her mindset, just like the many officers and C-levels who hold the purse strings.
Let’s frame the use storage infrastructure in the real world. Nobody buys a storage system just to keep data in there much like a puddle keeps stagnant water. Sooner or later the value of the data in the storage evaporates or the value becomes dull if the data is not used well in any ways, shape or form.
Storage systems and the interconnected pathways from on premises, to the next premises, to the edge and to the clouds serve the greater good for Data. Data is used, shared, shaped, improved, enhanced, protected, moved, and more to deliver Value to the Business.
Storage capacity is just one of the few factors to consider when investing in a storage infrastructure solution. In fact, capacity is probably the least important piece when considering a storage solution to achieve the company’s OKRs. If we think about it deeper, setting the foundation for Data in the defensive manner will help elevate value of the data to be promoted with the offensive strategies to gain the competitive advantage.
Storage infrastructure and storage solutions along with data management platforms may appear to be a cost to the annual budgets. If you know set the OKRs, define A to get to B, alignment the goals, storage infrastructure and the data management platforms and practices are investments that are worth their weight in gold. That is my guarantee.
On the flip side, ignoring and avoiding OKRs, and set the strategies without prudence will yield its comeuppance. Technical debts will prevail.
On the road, seat belt saves lives. So does the motorcycle helmet. But these 2 technologies alone are probably not well received and well applied daily unless there is a strong ecosystem and culture about road safety. For decades, there have been constant and unrelenting efforts to enforce the habits of putting on the seat belt or the helmet. Statistics have shown they reduce road fatalities, but like I said, it is the safety culture that made all this happen.
On the digital front, the ransomware threats are unabated. In fact, despite organizations (and individuals), both large and small, being more aware of cyber-hygiene practices more than ever, the magnitude of ransomware attacks has multiplied. Threat actors still see weaknesses and gaps, and vulnerabilities in the digital realms, and thus, these are lucrative ventures that compliment the endeavours.
Time to look at Data Management
The Cost-Benefits-Risks Conundrum of Data Management
And I have said this before in the past. At a recent speaking engagement, I brought it up again. I said that ransomware is not a cybersecurity problem. Ransomware is a data management problem. I got blank stares from the crowd.
I get it. It is hard to convince people and companies to embrace a better data management culture. I think about the Cost-Benefits-Risk triangle while I was analyzing the lack of data management culture used in many organizations when combating ransomware.
I get it that Cybersecurity is big business. Even many of the storage guys I know wanted to jump into the cybersecurity bandwagon. Many of the data protection vendors are already mashing their solutions with a cybersecurity twist. That is where the opportunities are, and where the cool kids hang out. I get it.
Cybersecurity technologies are more tangible than data management. I get it when the C-suites like to show off shiny new cybersecurity “toys” because they are allowed to brag. Oh, my company has just implemented security brand XXX, and it’s so cool! They can’t be telling their golf buddies that they have a new data management culture, can they? What’s that?
Let’s face it. Data is bursting through its storage seams. And every organization now is storing too much data that they don’t know they have.
By 2025, IDC predicts that 80% the world’s data will be unstructured. IDC‘s report Global Datasphere Forecast 2021-2025 will see the global data creation and replication capacity expand to 181 zettabytes, an unfathomable figure. Organizations are inundated. They struggle with data growth, with little understanding of what data they have, where the data is residing, what to do with the data, and how to manage the voluminous data deluge.
The simple knee-jerk action is to store it in cloud object storage where the price of storage is $0.0000xxx/GB/month. But many IT departments in these organizations often overlook the fact that that the data they have parked in the cloud require movement between the cloud and on-premises. I have been involved in numerous discussions where the customers realized that they moved the data in the cloud moved too frequently. Often it was an erred judgement or short term blindness (blinded by the cheap storage costs no doubt), further exacerbated by the pandemic. These oversights have resulted in expensive and painful monthly API calls and egress fees.Welcome to reality. Suddenly the cheap cloud storage doesn’t sound so cheap after all.
The same can said about storing non-active unstructured data on primary storage. Many organizations have not been disciplined to practise good data management. The primary Tier 1 storage becomes bloated over time, grinding sluggishly as the data capacity grows. I/O processing becomes painfully slow and backup takes longer and longer. Sounds familiar?
The A in ABC
I brought up the ABC mantra a few blogs ago. A is for Archive First. It is part of my data protection consulting practice conversation repertoire, and I use it often to advise IT organizations to be smart with their data management. Before archiving (some folks like to call it tiering, but I am not going down that argument today), we must know what to archive. We cannot blindly send all sorts of junk data to the secondary or tertiary storage premises. If we do that, it is akin to digging another hole to fill up the first hole.
We must know which unstructured data to move replicate or sync from the Tier 1 storage to a second (or third) less taxing storage premises. We must be able to see this data, observe its behaviour over time, and decide the best data management practice to apply to this data. Take note that I said best data management practice and not best storage location in the previous sentence. There has to be a clear distinction that a data management strategy is more prudent than to a “best” storage premises. The reason is many organizations are ignorantly thinking the best storage location (the thought of the “cheapest” always seems to creep up) is a good strategy while ignoring the fact that data is like water. It moves from premises to premises, from on-prem to cloud, cloud to other cloud. Data mobility is a variable in data management.
Last week was consumed by many conversations on this topic. I was quite jaded, really. Unfortunately many still take a very simplistic view of all the storage technology, or should I say over-marketing of the storage technology. So much so that the end users make incredible assumptions of the benefits of a storage array or software defined storage platform or even cloud storage. And too often caveats of turning on a feature and tuning a configuration to the max are discarded or neglected. Regards for good storage and data management best practices? What’s that?
I share some of my thoughts handling conversations like these and try to set the right expectations rather than overhype a feature or a function in the data storage services.
Complex data networks and the storage services that serve it
Applications and workloads (A&W) read and write from the data storage services platforms. These could be local DAS (direct access storage), network storage arrays in SAN and NAS, and now objects, or from cloud storage services. Regardless of structured or unstructured data, different A&Ws have different behavioural I/O patterns in accessing data from storage. Therefore storage has to be configured at best to match these patterns, so that it can perform optimally for these A&Ws. Without going into deep details, here are a few to think about:
Random and Sequential patterns
Block sizes of these A&Ws ranging from typically 4K to 1024K.
Causal effects of synchronous and asynchronous I/Os to and from the storage
Data movement is expensive. Not just costs, but also latency and resources as well. Thus there were many narratives to move compute closer to where the data is stored because moving compute is definitely more economical than moving data. I borrowed the analogy of the 2 animals from some old NetApp® slides which depicted storage as the elephant, and compute as birds. It was the perfect analogy, because the storage is heavy and compute is light.
“Close up of a white Great Egret perching on top of an African Elephant aa Amboseli national park, Kenya”
Before the animals representation came about I used to use the term “Data locality, Data Mobility“, because of past work on storage technology in the Oil & Gas subsurface data management pipeline.
Take stock of your data movement
I had recent conversations with an end user who has been paying a lot of dollars keeping their “backup” and “archive” in AWS Glacier. The S3 storage is cheap enough to hold several petabytes of data for years, because the IT folks said that the data in AWS Glacier are for “backup” and “archive”. I put both words in quotes because they were termed as “backup” and “archive” because of their enterprise practice. However, the face of their business is changing. They are in manufacturing, oil and gas downstream, and the definitions of “backup” and “archive” data has changed.
For one, there is a strong demand for reusing the past data for various reasons and these datasets have to be recalled from their cloud storage. Secondly, their data movement activities still mimicked what they did in the past during their enterprise storage days. It was a classic lift-and-shift when they moved to the cloud, and not taking stock of their data movements and the operations they ran on these datasets. Still ongoing, their monthly AWS cost a bomb.
The world has pretty much settled that hybrid cloud is the way to go for IT infrastructure services today. Straddled between the enterprise data center and the infrastructure-as-a-service in public cloud offerings, hybrid clouds define the storage ecosystems and architecture of choice.
As storage practitioners, we are often faced with certain “dogmatic” arguments which were often a mix of measured actuality and marketing magic – aka FUD (fear, uncertainty, doubt). Time and again, we are thrown a curve ball, like “Oh, your competitor can do this. Can you?” Suddenly you are feeling pinned to a corner, and the pressure to defend your turf rises. You fumbled; You have no answer; Game over!
I experienced these hearty objections many times over. The best experience was one particular meeting I had during my early days with NetApp® in 2000. I was only 1-2 months with the company, still wet between the ears with the technology. I was pitching the SnapMirror® to Ericsson Malaysia when the Scandinavian manager said, “I think you are lying!“. I was lost without a response. I fumbled spectacularly although I couldn’t remember if we won or lost that opportunity.
Here are a few I often encountered. Let’s play the game of What If …?
In a recent conversation with an iXsystems™ reseller in Hong Kong, the topic of Storage Tiering was brought up. We went about our banter and I brought up the inter-array tiering and the intra-array tiering piece.
After that conversation, I started thinking a lot about intra-array tiering, where data blocks within the storage array were moved between fast and slow storage media. The general policy was simple. Find all the least frequently access blocks and move them from a fast tier like the SSD tier, to a slower tier like the spinning drives with different RPM speeds. And then promote the data blocks to the faster media when accessed frequently. Of course, there were other variables in the mix besides storage media and speeds.
My mind raced back 10 years or more to my first encounter with Compellent and 3PAR. Both were still independent companies then, and I had my first taste of intra-array tiering
It is obvious that the storageless jargon wants to ride on the hype of serverless computing, an abstraction method of computing resources where the allocation and the consumption of resources are defined by pieces of programmatic code of the running application. The “calling” of the underlying resources are based on the application’s code, and thus, rendering the computing resources invisible, insignificant and not sexy.
Among the 3 main infrastructure technology – compute, network, storage, storage technology is a bit of a science and a bit of dark magic. It is complex and that is what makes storage technology so beautiful. The constant innovation and technology advancement continue to make storage as a data services platform relentlessly interesting.
Cloud, Kubernetes and many data-as-a-service platforms require strong persistent storage. As defined by NIST Definition of Cloud Computing, the 4 of the 5 tenets – on-demand self-service, resource pooling, rapid elasticity, measured service – demand storage to be abstracted. Therefore, I am all for abstraction of storage resources from the data services platform.
But the storageless jargon is doing a great disservice. It is not helping. It does not lend its weight glorifying the innovations of storage. In fact, IMHO, it felt like a weighted anchor sinking storage into the deepest depth, invisible, insignificant and not sexy. I am here dutifully to promote and evangelize storage innovations, and I am duly unimpressed with such a jargon.