The All-Important Storage Appliance Mindset for HPC and AI projects

I am strong believer of using the right tool to do the job right. I have said this before 2 years ago, in my blog “Stating the case for a Storage Appliance approach“. It was written when I was previously working for an open source storage company. And I am an advocate of the crafter versus assembler mindset, especially in the enterprise and high- performance storage technology segments.

I have joined DDN. Even with DDN that same mindset does not change a bit. I have been saying all along that the storage appliance model should always be the mindset for the businesses’ peace-of-mind.

My view of the storage appliance model began almost 25 years. I came into NAS systems world via Sun Microsystems®. Sun was famous for running NFS servers on general Sun Solaris servers. NFS services on Unix systems. Back then, I remember arguing with one of the Sun distributors about the tenets of running NFS over 100Mbit/sec Ethernet on Sun servers. I was drinking Sun’s Kool-Aid big time.

When I joined Network Appliance® (now NetApp®) in 2000, my worldview of putting software on general purpose servers changed. Network Appliance®, had one product family, the FAS700 (720, 740, 760) family. All NetApp® did was to serve NFS services in the beginning. They were the NAS filers and nothing else.

I was completed sold on the appliance way with NetApp®. Firstly, it was my very first time knowing such network storage services could be provisioned with an appliance concept. This was different from Sun. I was used to managing NFS exports on a Sun SPARCstation 20 to Unix clients in the network.

Secondly, my mindset began to shape that “you have to have the right tool to the job correctly and extremely well“. Well, the toaster toasts bread very well and nothing else. And the fridge (an analogy used by Dave Hitz, I think) does what it does very well too. That is what the appliance does. You definitely cannot grill a steak with a bread toaster, just like you can’t run an excellent, ultra-high performance storage services to serve the demanding AI and HPC applications on a general server platform. You have to have a storage appliance solution for High-Speed Storage.

That little Network Appliance® toaster award given out to exemplary employees stood vividly in my mind. The NetApp® tagline back then was “Fast, Simple, Reliable”. That solidifies my mindset for the high-speed storage in AI and HPC projects in present times.

DDN AI400X2 Turbo Appliance

Costs Benefits and Risks

I like to think about what the end users are thinking about. There are investments costs involved, and along with it, risks to the investments as well as their benefits. Let’s just simplify and lump them into Cost-Benefits-Risk analysis triangle. These variables come into play in the decision making of AI and HPC projects.

All AI and HPC projects are expensive. They are here to solve the world’s problems. Finding breakthroughs for cancer research, accurately predicting climate change, solving the universes’ mysteries in space, Generative AI and many more. You have to have the right tool to do these complex jobs. You certainly can’t put in a QNAP® NAS (God forbid!) to serve hundreds of GB/sec throughput to a supercomputer. And you can’t put a software-only storage services on a branded OEM whitebox to power a software-defined storage solution for AI and HPC ultra demands.

These projects are high risk as well. However, even costlier are the risks when not deploying a highly optimized storage appliance solution, along with the unpredictable cost of operating it. The early investment costs bites, but the deeper pains of making risky bets on software-defined storage solutions for AI and HPC will linger for years.

Which is why Nvidia® DGX solutions are highly sought after. Which is why Apple Macbooks are prized over Windows OS on general x86, ARM and now RISC-V laptops. Which is also the absolute reason to pair the best storage technology that can deliver the highest read and write performance, the lowest latency, the best reliability and resiliency, the most secure, the most optimized in the best software and hardware package of a storage appliance.

Demanding customers know which solutions are the best. They demand white glove deployment services and support that can link them directly to the entire vendor solution’s engineering stack. They want to minimize their risks of their expensive AI and HPC projects to reap the greatest benefits. They want the project’s peace-of-mind. They want predictability, not uncertainty.

Software-defined storage views in AI and HPC. 

Many can argue that software-defined storage is more flexible. It is more open with lesser vendor lock-ins. Yes, from an enterprise perspective, this argument has its merits. But from the AI and HPC projects point of view, flexibility and openness of software-defined storage come with other often overlooked costs and risks.

Software-defined storage is often less optimized with the hardware it runs on. It has been designed to run on general purpose x86 (sometimes ARM), lesser optimized hardware with commodity parts and firmwares. The optimization can only go as far as it is allowed to with a hardware platform supplied by others who are not well versed in the HPC world.

I often hear about Weka having an N+2 redundancy. That is great but we have to consider that when Weka designed the N+2, it was designed knowing full well that the server hardware that runs its software defined storage technology is inherently less reliable compared to a purpose-built HA active-active storage appliance. That redundancy on the software layer is great but the N+2 at the hardware level, well, not so great. The hardware redundancy with those Weka-type (not Weka’s fault) of solution stack is shaky.

I mentioned that NetApp® earlier. In high-speed storage for HPC, NetApp® does not deploy its famed ONTAP™ storage appliances. Instead, it has chosen to pair its E-series with BeeGFS®, a software-defined storage and Lustre® filesystem look-a-like. Thus, the NetApp® has brought its lesser-known brethren to the high- performance game that demands even higher performance than their enterprise FAS, and AFF.

Because their solution is a 2-vendor combination, it is hard to say of the Level-3 and Level-4 support SLAs are from NetApp® or ThinkParq, the creator of BeeGFS®. While I respect the deep collaboration of both companies, supporting complex HPC and AI problems would still require handoff between the deep engineering and development of both sides. Such support collaboration schemes require the mobilization and assembly of both teams and resources that could introduce latency to problems resolution. As we all know, problem ownership and accountability (i.e. one neck to choke) need to wade through both companies’ escalation workflows and processes. Rhetoric sounds good when communicated but in real life, that notion is tested again and again when “it” hits the fan. Especially when AI and HPC projects have a priority 1 meltdown. The latency to problem resolution will be immensely costly.

Another more inconspicuous feature for high-speed storage in the software-defined storage argument comes from a 3-legged stool story. I have mentioned this before in one of my blogs years ago. I recalled my email conversation with Shahar Frank, the key architect of the ExtremIO storage technology. It was just before EMC® acquired ExtremIO.

The 3 legs in the 3-legged stool are Performance, Scalability and Reliability. I remembered Shahar saying very clearly that to make a storage technology run at very high speed and make the technology scale was not difficult. The most difficult part, according to Shahar, was making it very reliable for the other 2 parameters – Performance and Scalability – to do very well. Reliability in designing a high-speed, highly scalable storage is very, very hard.

In most new entrants to the HPC storage market segment, and even more so, high-speed storage for AI accelerated computing, the deep integration to make reliability as the apex tenet of high-speed storage is less spoken. Many continue to tout performance and scalability as their main selling points, leaving reliability and resiliency as secondary feature, where it should be the other way around. My example of Weka’s N+2 highlights the obfuscation of the lesser reliable general purpose hardware underneath.

In the storage appliance model, reliability is baked into the entire storage technology.

Oh, he works for DDN now. 

Oh, many will say that I am writing this blog because I work for DDN now, or in the past, iXsystems™, or NetApp®, or Hitachi Data Systems®, or EMC®. I want to look good with my bosses, with my new employer, so I am drinking the DDN Kool-Aid now. Yadda, yadda, yadda … Not so fast.

For many less risky projects, software-defined storage solutions are OK. I run TrueNAS®, Proxmox and a slew of open-source storage and backup solutions over the years. They are fun, but these software-defined storage solutions are not capable of performing for the accelerated computing applications demanding extreme high throughput and I/Os. Small medium businesses use COTS x86 servers and try to be frugal with their storage purchases. Downtime and data loss are less costly to them relatively.

Enterprise customers demand more. They are willing to pay for top-grade storage solutions along with premier support. They want to mitigate the risks of their businesses by investing and partnering with the right storage solutions vendors, even though they can be expensive. Downtime and data loss are costly but relative to AI and HPC, they are tepid in comparison.

The third category is the AI and HPC customers. Their demands and requirements for high-speed storage is way above enterprise storage. Their investments are nuclear relative to enterprise customers. So are the risks. Thus, it is my strong opinion that these types of customers invest in the purpose-built, mature HPC and AI storage appliance solutions for their projects.

Furthermore, the new generation of AI customers is less technical than HPC customers. The risks are greater, because highly experienced and deeply skilled talents are less available to them. With the world of AI going at light speed, putting their bets on software-defined storage on x86 OEM boxes, or software-defined enterprise storage for AI and HPC requirements is akin to driving a regular car to compete in an F1 race. It is a totally different mindset.

That is why I would stick with the optimized, purpose-built high-speed storage appliance solutions. They are usually accompanied by unique tightly coupled hardware and software configurations for security, manageability, performance and more. The storage appliance is more efficient in its design. They are attended to with white glove support and professional services.

I have been fortunate to be in the storage technology segment for over 30 years. I am here because I make a stand. I fight in what I believe is right. The storage appliance mindset is right, and it will win for the high-profile AI and HPC projects.

Tagged , , , , , , , , , , , , , . Bookmark the permalink.

About cfheoh

I am a technology blogger with 30 years of IT experience. I write heavily on technologies related to storage networking and data management because those are my areas of interest and expertise. I introduce technologies with the objectives to get readers to know the facts and use that knowledge to cut through the marketing hypes, FUD (fear, uncertainty and doubt) and other fancy stuff. Only then, there will be progress. I am involved in SNIA (Storage Networking Industry Association) and between 2013-2015, I was SNIA South Asia & SNIA Malaysia non-voting representation to SNIA Technical Council. I currently employed at iXsystems as their General Manager for Asia Pacific Japan.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.