[Preamble: I have been invited by GestaltIT as a delegate to their Tech Field Day for Storage Field Day 18 from Feb 27-Mar 1, 2019 in the Silicon Valley USA. My expenses, travel and accommodation were covered by GestaltIT, the organizer and I was not obligated to blog or promote their technologies presented at this event. The content of this blog is of my own opinions and views]
Vast Data coming out bash!
The delegates of Storage Field Days were always the lucky bunch. We have witnessed several storage technology companies coming out of stealth at these Tech Field Days. The recent ones in memory for me were Excelero and Hammerspace. But to have one where the venerable storage doyen, Mr. Howard Marks, Vast Data new tech evangelist, to introduce the deep dive of Vast Data technology was something special.
For those who knew Howard, he is fiercely independent, very storage technology smart, opinionated and not easily impressed. As a storage technology connoisseur myself, I believe Howard must have seen something special in Vast Data. They must be doing something extremely unique and impressive that someone like Howard could not resist, and made him jump to the vendor side. This sets the tone of my blog.
The Vast Data architecture (not so deep dive)
From high level, Vast Data is separated into Compute Nodes and the Databoxes. The Compute Nodes host the Vast Universal File System (UFS) in containers but are stateless. The metadata is housed in the Intel Optane 3D-Xpoint medium in the Databoxes, resulting in a loosely coupled cluster, with single global namespace across tens of thousands of Compute Nodes and thousands of Databoxes. Here is an architecture diagram of how the Vast Data pieces fit.
The UFS datastore is disaggregated, and spread across all responsible Compute Nodes (called Element Stores), to define the scaling framework of the Vast Data technology. In between, the chosen network is NVMe-over-Fabric. ROCEv2 and TCP are probably most viable transport to grow.
I like their Data Reduction
One of their key technologies which has impressed me is their Data Footprint Reduction (DFR) technology. Yes, I still use the word “footprint” (for legacy reasons) in there because I wrote about this concept called Native Format Optimization like 6 years ago. It was, what I believed, the truest form of data reduction technology and getting the best of data. It very much reminded me of Ocarina Networks and the POC (proof-of-concept) my team and I did at Petronas and their SEG-Y and SEG-D data. The reduction was so astounding that Ocarina just blew away NetApp and Data Domain at the time. It was end of 2009.
Unlike most rigid (both fixed and variable blocks) confines of DFR, the Vast Data reduction would dedupe the variable GB chunked data with a unique hashed fingerprint and then cluster “close enough” deduped data chunks. This technique resembled a master raw block with its incremental forever offsprings. And this, of course, was processed at the Intel Optane layer, resulted in fast writes and reads. Because the deduped happened after the writes of the blocks have persisted, I would term this post production dedupe but the medium of the Optane’s SCM (Storage Class Memory) probably made it inline deduplication speed as well. Oh how times have changed!
Here’s a look at Vast Data’s data reduction technology.
The deduplicated blocks, compressed locally at the databoxes for fast access, and could even perform data reduction on other already optimized data. As it was mentioned in this article, Commvault’s already deduped/compressed data could be further reduced with Vast Data’s reduction technology. Vast Data shared the compounded reduction result.
Will Vast Data continue to be the special one?
The days are still too early to tell. NVMeoF (NVMe-over-Fabrics) adoption is still at the doorsteps. NVMe over ROCEv2/iWARP/TCP/Fibre Channel are still being sorted out by invested storage vendors.The embrace of QLC (Quad Level Cell) is slowly gaining momentum as SCM (Storage Class Memory) because this dominant performance tier, and the new interconnects like CCIX, Gen-Z are also coming into the picture.
Where does Vast Data want to play? The QLCs in their databoxes certainly mean that it will be general data workloads at the webscale level, where massive data lakes are becoming a problem. The data protection features and future replication definitely tell the tale that what Vast Data has engineered today is really for the future. In a year or 2, many webscalers will reach breaking points with their invested infrastructures and data management services.
Commercial HPC workloads are also gaining ground, with Machine Learning/Deep Learning/Artificial Intelligence applications leading the onslaught. I believe data tools and pipeline frameworks should bring greater integration (and I stress on the word “integration”), to new and future workloads.
At the Storage Field Day session, there were discussions addressing the secondary data/data protection market. With the lines blurring between primary and secondary, there may be just one tier of data, and one that will be a massive data lake for both storage and analytics.
I especially liked the loose coupling and stateless compute nodes because modern edge computing demands faster, yet simpler data storage and compute technology. Automation with disagregated and composable frameworks at the edge is certainly a needed feature play. Partnerships with cloud/edge service providers seem to be a logical path to Vast Data’s expansion plans.
I have great vibes with Vast Data, and I think with Mr. Howard Marks as Chief Rah-Rah, the path to become the special one is well lit. These executive team present at Storage Field Day were Renen Hallak (CEO and Co-Founder), Jeff Denworth (VP of Products, and Co-Founder) and Michael Wing (President).
Great start to Vast Data. They must be something special.