Storage Performance Considerations for AI Data Paths

The hype of Deep Learning (DL), Machine Learning (ML) and Artificial Intelligence (AI) has reached an unprecedented frenzy. Every infrastructure vendor from servers, to networking, to storage has a word to say or play about DL/ML/AI. This prompted me to explore this hyped ecosystem from a storage perspective, notably from a storage performance requirement point-of-view.

One question on my mind

There are plenty of questions on my mind. One stood out and that is related to storage performance requirements.

Reading and learning from one storage technology vendor to another, the context of everyone’s play against their competitors seems to be  “They are archaic, they are legacy. Our architecture is built from ground up, modern, NVMe-enabled“. And there are more juxtaposing, but you get the picture – “We are better, no doubt“.

Are the data patterns and behaviours of AI different? How do they affect the storage design as the data moves through the workflow, the data paths and the lifecycle of the AI ecosystem?

Continue reading

The full force of Western Digital

[Preamble: I have been invited by GestaltIT as a delegate to their Tech Field Day for Storage Field Day 18 from Feb 27-Mar 1, 2019 in the Silicon Valley USA. My expenses, travel and accommodation were covered by GestaltIT, the organizer and I was not obligated to blog or promote their technologies presented at this event. The content of this blog is of my own opinions and views]

3 weeks after Storage Field Day 18, I was still trying to wrap my head around the 3-hour session we had with Western Digital. I was like a kid in a candy store for a while, because there were too much to chew and I couldn’t munch them all.

From “Silicon to System”

Not many storage companies in the world can claim that mantra – “From Silicon to Systems“. Western Digital is probably one of 3 companies (the other 2 being Intel and nVidia) I know of at present, which develops vertical innovation and integration, end to end, from components, to platforms and to systems.

For a long time, we have always known Western Digital to be a hard disk company. It owns HGST, SanDisk, providing the drives, the Flash and the Compact Flash for both the consumer and the enterprise markets. However, in recent years, through 2 eyebrow raising acquisitions, Western Digital was moving itself up the infrastructure stack. In 2015, it acquired Amplidata. 2 years later, it acquired Tegile Systems. At that time, I was wondering why a hard disk manufacturer was buying storage technology companies that were not its usual bread and butter business.

Continue reading

WekaIO controls their performance destiny

[Preamble: I have been invited by GestaltIT as a delegate to their Tech Field Day for Storage Field Day 18 from Feb 27-Mar 1, 2019 in the Silicon Valley USA. My expenses, travel and accommodation were covered by GestaltIT, the organizer and I was not obligated to blog or promote their technologies presented at this event. The content of this blog is of my own opinions and views]

I was first introduced to WekaIO back in Storage Field Day 15. I did not blog about them back then, but I have followed their progress quite attentively throughout 2018. 2 Storage Field Days and a year later, they were back for Storage Field Day 18 with a new CTO, Andy Watson, and several performance benchmark records.

Blowout year

2018 was a blowout year for WekaIO. They have experienced over 400% growth, placed #1 in the Virtual Institute IO-500 10-node performance challenge, and also became #1 in the SPEC SFS 2014 performance and latency benchmark. (Note: This record was broken by NetApp a few days later but at a higher cost per client)

The Virtual Institute for I/O IO-500 10-node performance challenge was particularly interesting, because it pitted WekaIO against Oak Ridge National Lab (ORNL) Summit supercomputer, and WekaIO won. Details of the challenge were listed in Blocks and Files and WekaIO Matrix Filesystem became the fastest parallel file system in the world to date.

Control, control and control

I studied WekaIO’s architecture prior to this Field Day. And I spent quite a bit of time digesting and understanding their data paths, I/O paths and control paths, in particular, the diagram below:

Starting from the top right corner of the diagram, applications on the Linux client (running Weka Client software) and it presents to the Linux client as a POSIX-compliant file system. Through the network, the Linux client interacts with the WekaIO kernel-based VFS (virtual file system) driver which coordinates the Front End (grey box in upper right corner) to the Linux client. Other client-based protocols such as NFS, SMB, S3 and HDFS are also supported. The Front End then interacts with the NIC (which can be 10/100G Ethernet, Infiniband, and NVMeoF) through SR-IOV (single root IO virtualization), bypassing the Linux kernel for maximum throughput. This is with WekaIO’s own networking stack in user space. Continue reading

Bridges to the clouds and more – NetApp NDAS

[Preamble: I have been invited by GestaltIT as a delegate to their Tech Field Day for Storage Field Day 18 from Feb 27-Mar 1, 2019 in the Silicon Valley USA. My expenses, travel and accommodation were covered by GestaltIT, the organizer and I was not obligated to blog or promote their technologies presented at this event. The content of this blog is of my own opinions and views]

The NetApp Data Fabric Vision

The NetApp Data Fabric vision has always been clear to me. Maybe it was because of my 2 stints with them, and I got well soaked in their culture. 3 simple points define the vision.

  • The Data Fabric is THE data singularity. Data can be anywhere – on-premises, the clouds, and more.
  • Have bridges, paths and workflows management to the Data, to move the data to wherever the data may be.
  • Work with technology partners to build tools and data systems to elevate the value of the data

That is how I see it. I wrote about the Transcendence of the Data Fabric vision 3+ years ago, and I emphasized the importance of the Data Pipeline in another NetApp blog almost a year ago. The introduction of NetApp Data Availability Services (NDAS) in the recently concluded Storage Field Day 18 was no different as NetApp constructs data bridges and paths to the AWS Cloud.

NetApp Data Availability Services

The NDAS feature is only available with ONTAP 9.5. With less than 5 clicks, data from ONTAP primary systems can be backed up to the secondary ONTAP target (running the NDAS proxy and the Copy to Cloud API), and then to AWS S3 buckets in the cloud.

Continue reading

Own the Data Pipeline

[Preamble: I was a delegate of Storage Field Day 15 from Mar 7-9, 2018. My expenses, travel and accommodation were paid for by GestaltIT, the organizer and I was not obligated to blog or promote the technologies presented at this event. The content of this blog is of my own opinions and views]

I am a big proponent of Go-to-Market (GTM) solutions. Technology does not stand alone. It must be in an ecosystem, and in each industry, in each segment of each respective industry, every ecosystem is unique. And when we amalgamate data, the storage infrastructure technologies and the data management into the ecosystem, we reap the benefits in that ecosystem.

Data moves in the ecosystem, from system to system, north to south, east to west and vice versa, random, sequential, ad-hoc. Data acquires different statuses, different roles, different relevances in its lifecycle through the ecosystem. From it, we derive the flow, a workflow of data creating a data pipeline. The Data Pipeline concept has been around since the inception of data.

To illustrate my point, I created one for the Oil & Gas – Exploration & Production (EP) upstream some years ago.

 

Continue reading

Cohesity SpanFS – a foundational shift

[Preamble: I was a delegate of Storage Field Day 15 from Mar 7-9, 2018. My expenses, travel and accommodation were paid for by GestaltIT, the organizer and I was not obligated to blog or promote the technologies presented at this event. The content of this blog is of my own opinions and views]

Cohesity SpanFS impressed me. Their filesystem was designed from ground up to meet the demands of the voluminous cloud-scale data, and yes, the sheer magnitude of data everywhere needs to be managed.

We all know that primary data is always the more important piece of data landscape but there is a growing need to address the secondary data segment as well.

Like a floating iceberg, the piece that is sticking out is the more important primary data but the larger piece beneath the surface of the water, which is the secondary data, is becoming more valuable. Applications such as file shares, archiving, backup, test and development, and analytics and insights are maturing as the foundational data management frameworks and fast becoming the bedrock of businesses.

The ability of businesses to bounce back after a disaster; the relentless testing of large data sets to develop new competitive advantage for businesses; the affirmations and the insights of analyzing data to reduce risks in decision making; all these are the powerful back engine applicability that thrust businesses forward. Even the ability to search for the right information in a sea of data for regulatory and compliance reasons is part of the organization’s data management application.

Continue reading

DellEMC SC progressing well

[Preamble: I was a delegate of Storage Field Day 14. My expenses, travel and accommodation were paid for by GestaltIT, the organizer and I was not obligated to blog or promote the technologies presented at this event. The content of this blog is of my own opinions and views]

I haven’t had a preview of the Compellent technology for a long time. My buddies at Impact Business Solutions were the first to introduce the Compellent technology called Data Progression to the local Malaysian market and I was invited to a preview back then. Around the same time, I also recalled another rather similar preview invitation by PTC Singapore for the 3PAR technology called Adaptive Provisioning (it is called Adaptive Optimization now).

Storage tiering was on the rise in the 2009-2010 years. Both Compellent and 3PAR were neck and neck leading the conversation and mind share of storage tiering, and IBM easyTIER and EMC FAST (Fully Automated Storage Tiering) were nowhere to be seen or heard. Vividly, the Compellent Data Progression technology was much more elegant compared to the 3PAR technology. While both intelligent storage tiering technologies were equally good, I took that the 3PAR founders were ex-Sun Microsystems folks, and Unix folks sucked at UX. In this case, Compellent’s Data Progression was a definitely a leg up better than 3PAR.

History aside, this week I have the chance to get a new preview of the Compellent technology again. Compellent was now rebranded as the SC series and was positioned as the mid-range storage arrays of DellEMC. And together with the other Storage Field Day 14 delegates, I have the pleasure to experience the latest SC Data Progression technology update, as well their latest SC All-Flash.

In Data Progression, one interesting feature which caught my attention was the RAID Tiering. This was a dynamic auto expand and auto contract set of RAID tiersRAID 10 and RAID 5/6 in the Fast Tier and RAID 5/6 in the Lower Tier. RAID 10, RAID 5 and RAID 6 on the same set of drives (including SSDs), and depending on the “hotness” of the data, the location of the data blocks switched between the several RAID tiers in the Fast Tier. Over a longer period, the data blocks would relocate transparently to the Capacity Tier from the Fast Tier.

The Data Progression technology is extremely efficient. The movement of the data between the RAID Tiers and between the Performance/Capacity Tiers are in pages instead of blocks, making the write penalty and bandwidth to a negligible minimum.

The Storage Field Day 14 delegates were also privileged to be the first to get into the deep dive of the new All-Flash SC, just days of the announcement of the All-Flash SC. The All Flash SC redefines and refines the Data Progression to the next level. Among the new optimization, NAND Flash in the SC (both SLCs and MLCs, read-intensive and write-intensive) set the Data Progression default page size from 2MB to 512KB. These smaller 512KB pages enabled reduced bandwidth for tiering between the write-intensive and the read-intensive tier.

I didn’t get the latest SC family photos yet, but I managed to grab a screenshot of the announcement from The Register of the new DellEMC SC Series.

I was very encouraged with the DellEMC Midrange Storage presentation. Besides giving us a fantastic deep dive about the DellEMC SC All-Flash Storage, I was also very impressed by the candid and straightforward attitude of the team, led by their VP of Product Management, Pierluca Chiodelli. An EMC veteran, he was taking up the hard questions onslaught by the SFD14 delegate like a pro. His team’s demeanour was critical in instilling confidence and trust in how the bloggers and the analysts viewed Dell EMC merger, and how the SC and the Unity series would pan out in the technology roadmap.

Unlike the fiasco I went through with the DellEMC Forum 2017 in Malaysia, where I was disturbed with 3 calls in 3 consecutive days by DellEMC Malaysia, I was left with a profound respect for this DellEMC Storage team. They strongly supported their position within the DellEMC storage universe, and imparted their confidence in their technology solution in the marketplace.

Without a doubt, in my point of view, this DellEMC Mid-Range Storage team was the best I have enjoyed in Storage Field Day 14. Thank you.

Commvault UDI – a new CPUU

[Preamble: I am a delegate of Storage Field Day 14. My expenses, travel and accommodation are paid for by GestaltIT, the organizer and I am not obligated to blog or promote the technologies presented at this event. The content of this blog is of my own opinions and views]

I am here at the Commvault GO 2017. Bob Hammer, Commvault’s CEO is on stage right now. He shares his wisdom and the message is clear. IT to DT. IT to DT? Yes, Information Technology to Data Technology. It is all about the DATA.

The data landscape has changed. The cloud has changed everything. And data is everywhere. This omnipresence of data presents new complexity and new challenges. It is great to get Commvault acknowledging and accepting this change and the challenges that come along with it, and introducing their HyperScale technology and their secret sauce – Universal Dynamic Index.

Continue reading

Pure Electric!

I didn’t get a chance to attend Pure Accelerate event last month. From the blogs and tweets of my friends, Pure Accelerate was an awesome event. When I got the email invitation for the localized Pure Live! event in Kuala Lumpur, I told myself that I have to attend the event.

The event was yesterday, and I was not disappointed. Coming off a strong fiscal Q1 2018, it has appeared that Pure Storage has gotten many things together, chugging full steam at all fronts.

When Pure Storage first come out, I was one of the early bloggers who took a fancy of them. My 2011 blog mentioned the storage luminaries in their team. Since then, they have come a long way. And it was apt that on the same morning yesterday, the latest Gartner Magic Quadrant for Solid State Arrays 2017 was released.

Continue reading