Crash consistent data recovery for ZFS volumes

While TrueNAS® CORE and TrueNAS® Enterprise are more well known for its NAS (network attached storage) prowess, many organizations are also confidently placing their enterprise applications such as hypervisors and databases on TrueNAS® via SANs (storage area networks) as well. Both iSCSI and Fibre Channel™ (selected TrueNAS® Enterprise storage models) protocols are supported well.

To reliably protect these block-based applications via the SAN protocols, ZFS snapshot is the key technology that can be dependent upon to restore the enterprise applications quickly. However, there are still some confusions when it comes to the state of recovery from the ZFS snapshots. On that matter, this situations are not unique to the ZFS environments because as with many other storage technologies, the confusion often stem from the (mis)understanding of the consistency state of the data in the backups and in the snapshots.

Crash Consistency vs Application Consistency

To dispel this misunderstanding, we must first begin with the understanding of a generic filesystem agnostic snapshot. It is a point-in-time copy, just like a data copy on the tape or in the disks or in the cloud backup. It is a complete image of the data and the state of the data at the storage layer at the time the storage snapshot was taken. This means that the data and metadata in this snapshot copy/version has a consistent state at that point in time. This state is frozen for this particular snapshot version, and therefore it is often labeled as “crash consistent“.

In the event of a subsystem (application, compute, storage, rack, site, etc) failure or a power loss, data recovery can be initiated using the last known “crash consistent” state, i.e. restoring from the last good backup or snapshot copy. Depending on applications, operating systems, hypervisors, filesystems and the subsystems (journals, transaction logs, protocol resiliency primitives etc) that are aligned with them, some workloads will just continue from where it stopped. It may already have some recovery mechanisms or these workloads can accept data loss without data corruption and inconsistencies.

Some applications, especially databases, are more sensitive to data and state consistencies. That is because of how these applications are designed. Take for instance, the Oracle® database. When an Oracle® database instance is online, there is an SGA (system global area) which handles all the running mechanics of the database. SGA exists in the memory of the compute along with transaction logs, tablespaces, and open files that represent the Oracle® database instance. From time to time, often measured in seconds, the state of the Oracle® instance and the data it is processing have to be synched to non-volatile, persistent storage. This commit is important to ensure the integrity of the data at all times.

Continue reading

The burgeoning world of NVMe

When I wrote this article “Let’s smoke this storage peace pipe” 5 years ago, I quoted:

NVMe® and NVM®eF‰, as it evolves, can become the Great Peacemaker and bringing both divides and uniting them into a single storage fabric.

I envisioned NVMe® and NVMe®oF™ setting the equilibrium at the storage architecture level, finishing the great storage fabric into one. This balance in the storage ecosystem at the storage interface specifications and language-protocol level has rapidly unifying storage today, and we are already seeing the end-to-end NVMe paths directly from the PCIe bus of one host to another, via networks over Ethernet (with RoCE, iWARP, and TCP flavours) and Fibre Channel™. Technically we can have an end point device, example a tablet, talking the same NVMe language to its embedded storage as well as a cloud NVMe storage in an exascale storage far, far away. In the past, there were just too many bridges, links, viaducts, aqueducts, bypasses, tunnels, flyovers to cross just to deliver a storage command, or a data in a formats, encased and encoded (and decoded) in so many different ways.

Colours in equilibrium, like the rainbow

Simple basics of NVMe®

SATA (Serial Attached ATA) and SAS (Serial Attached SCSI) are not optimized for solid state devices. besides legacy stuff like AHCI (Advanced Host Controller Interface) in SATA, and archaic SCSI-3 primitives in SAS, NVM® has so much to offer. It can achieve very high bandwidth and support 65,535 I/O queues, each with a queue depth of 65,535. The queue depth alone is a massive jump compared to SAS which has a queue depth limit of 256.

A big part of this is how NVMe® handles I/O processing. It has a submission queue (SQ) and a completion queue (CQ), and together they are know as a Queue Pair (QP). The NVMe® controller handles tens of thousands at I/Os (reads and writes) simultaneously, alerted to switch between each SQ and CQ very quickly using the MSI or MSI-X interrupt. Think of MSI and MSI-X as a service bell, a hardware register that informs the NVM® controller when there are requests in the SQ, and informs the hosts that there are completed requests in the CQ. There will be plenty of “dings” by the MSI-X service register but the NVMe® controller can perform it very well, with some smart interrupt coalescing.

NVMe I/O processing

NVMe® 1.1, as I recalled, used to be have 3 admin commands and 10 base commands, which made it very lightweight compared to SCSI-3. However, newer commands were added to NVMe® 2.0 specifications included command sets fo key-value operations and zoned named space.

Continue reading

The future of Fibre Channel in the Cloud Era

The world has pretty much settled that hybrid cloud is the way to go for IT infrastructure services today. Straddled between the enterprise data center and the infrastructure-as-a-service in public cloud offerings, hybrid clouds define the storage ecosystems and architecture of choice.

A recent Blocks & Files article, “Broadcom server-storage connectivity sales down but recovery coming” caught my attention. One segment mentioned that the server-storage connectivity sales was down 9% leading me to think “Is this a blip or is it a signal that Fibre Channel, the venerable SAN (storage area network) protocol is on the wane?

Fibre Channel Sign

Thus, I am pondering the position of Fibre Channel SANs in the cloud era. Where does it stand now and in the near future? Continue reading

Windows SMB synchronous writes with OpenZFS

Sometimes I get really pissed off with myself because I have taken a bigoted view, and ended up with eggs on my face. The past week was like that, and the problem was gnawing me on the inside all week, because I was determined to balance my equilibrium by finding the answer.

Early in the week, I was having a conversation with a potential customer. It evolved around the missing 10 seconds or so of the video footage between the users of a popular video editing software. The company had 70% Windows users, and 30% users on the Mac, both sides accessing the NAS device. The issue was the editors on the Windows side will store the raw and edited files to the NAS, but when the Mac users read them, they will often find 10 seconds or so of the stored video files missing.

The likeliest culprit of this problem is the way the SMB protocol write I/O behaves in Windows and in MacOS. Windows SMB, by default, writes I/O asynchronously while SMB on MacOS writes I/O synchronously.

I had a strong conviction I had the answer to this issue but this was not a TrueNAS®, It was another brand of NAS that I did not have knowledge of, and so, I left the conversation feeling quite embarrassed because I had the answer only on the TrueNAS® server side, not on the Windows client side. Bigotry blinded me. Hmmph! 

SMB (Server Message Block) client-server model

Continue reading

Layers in Storage – For better or worse

Storage arrays and storage services are built upon by layers and layers beneath its architecture. The physical components of hard disk drives and solid states are abstracted into RAID volumes, virtualized into other storage constructs before they are exposed as shares/exports, LUNs or objects to the network.

Everyone in the storage networking industry, is cognizant of the layers and it is the foundation of knowledge and experience. The public cloud storage services side is the same, albeit more opaque. Nevertheless, both have layers.

In the early 2000s, SNIA® Technical Council outlined a blueprint of the SNIA® Shared Storage Model, a framework describing layers and properties of a storage system and its services. It was similar to the OSI 7-layer model for networking. The framework helped many industry professionals and practitioners shaped their understanding and the development of knowledge in their respective fields. The layering scheme of the SNIA® Shared Storage Model is shown below:

SNIA Shared Storage Model – The layering scheme

Storage vendors layering scheme

While SNIA® storage layers were generic and open, each storage vendor had their own proprietary implementation of storage layers. Some of these architectures are simple, but some, I find a bit too complex and convoluted.

Here is an example of the layers of the Automated Volume Management (AVM) architecture of the EMC® Celerra®.

EMC Celerra AVM Layering Scheme

I would often scratch my head about AVM. Disks were grouped into RAID groups, which are LUNs (Logical Unit Numbers). Then they were defined as Celerra® dvols (disk volumes), and stripes of the dvols were consolidated into a storage pool.

From the pool, a piece of a storage capacity construct, called a slice volume, were combined with other slice volumes into a metavolume which eventually was presented as a file system to the network and their respective NAS clients. Explaining this took an effort because I was the IP Storage product manager for EMC® between 2007 – 2009. It was a far cry from the simplicity of NetApp® ONTAP 7 architecture of RAID groups and volumes, and the WAFL® (Write Anywhere File Layout) filesystem.

Another complicated layered framework I often gripe about is Ceph. Here is a look of how the layers of CephFS is constructed.

Ceph Storage Layered Framework

I work with the OpenZFS filesystem a lot. It is something I am rather familiar with, and the layered structure of the ZFS filesystem is essentially simpler.

Storage architecture mixology

Engineers are bizarre when they get too creative. They have a can do attitude that transcends the boundaries of practicality sometimes, and boggles many minds. This is what happens when they have their own mixology ideas.

Recently I spoke to two magnanimous persons who had the idea of providing Ceph iSCSI LUNs to the ZFS filesystem in order to use the simplicity of NAS file sharing capabilities in TrueNAS® CORE. From their own words, Ceph NAS capabilities sucked. I had to draw their whole idea out in a Powerpoint and this is the architecture I got from the conversation.

There are 3 different storage subsystems here just to provide NAS. As if Ceph layers aren’t complicated enough, the iSCSI LUNs from Ceph are presented as Cinder volumes to the KVM hypervisor (or VMware® ESXi) through the Cinder driver. Cinder is the persistent storage volume subsystem of the Openstack® project. The Cinder volumes/hypervisor datastore are virtualized as vdisks to the respective VMs installed with TrueNAS® CORE and OpenZFS filesystem. From the TrueNAS® CORE, shares and exports are provisioned via the SMB and NFS protocols to Windows and Linux respectively.

It works! As I was told, it worked!

A.P.P.A.R.M.S.C. considerations

Continuing from the layered framework described above for NAS, other aspects beside the technical work have to be considered, even when it can work technically.

I often use a set of diligent data storage focal points when considering a good storage design and implementation. This is the A.P.P.A.R.M.S.C. Take for instance Protection as one of the points and snapshot is the technology to use.

Snapshots can be executed at the ZFS level on the TrueNAS® CORE subsystem. Snapshots can be trigged at the volume level in Openstack® subsystem and likewise, rbd snapshots at the Ceph subsystem. The question is, which snapshot at which storage subsystem is the most valuable to the operations and business? Do you run all 3 snapshots? How do you execute them in succession in a scheduled policy?

In terms of performance, can it truly maximize its potential? Can it churn out the best IOPS, and deliver at wire speed? What is the latency we can expect with so many layers from 3 different storage subsystems?

And supporting this said architecture would be a nightmare. Where do you even start the troubleshooting?

Those are just a few considerations and questions to think about when such a layered storage architecture along. IMHO, such a design was over-engineered. I was tempted to say “Just because you can, doesn’t mean you should

Elegance in Simplicity

Einstein (I think) quoted:

Einstein’s quote on simplicity and complexity

I am not saying that having too many layers is wrong. Having a heavily layered architecture works for many storage solutions out there, where they are often masked with a simple and intuitive UI. But in yours truly point of view, as a storage architecture enthusiast and connoisseur, there is beauty and elegance in simple designs.

The purpose here is to promote better understanding of the storage layers, and how they integrate and interact with each other to deliver the data services to the network. In the end, that is how most storage architectures are built.

 

Kubernetes Persistent Storage Managed Well

[ Disclosure: This is a StorPool Storage sponsored blog ]

StorPool Storage – Distributed Storage

There is a rapid adoption of Kubernetes in the enterprise and in the cloud. The push for digital transformation to modernize businesses for a cloud native world in the next decade has lifted both containerized applications and the Kubernetes container orchestration platform to an unprecedented level. The application landscape, especially the enterprise, is looking at Kubernetes to address these key areas:

  • Scale
  • High performance
  • Availability and Resiliency
  • Security and Compliance
  • Controllable Costs
  • Simplified

The Persistent Storage Question

Enterprise applications such as relational databases, email servers, and even the cloud native ones like NoSQL, analytics engines, demand a single data source of truth. Fundamentals properties such as ACID (atomicity, consistency, isolation, durability) and BASE (Basic Availability, Soft State, Eventual Consistency) have to have persistent storage as the foundational repository for the data. And thus, persistent storage have rallied under Container Storage Interface (CSI), and fast becoming a de facto standard for Kubernetes. At last count, there are more than 80 CSI drivers from 60+ storage and cloud vendors, each providing block-level storage to Kubernetes pods.

However, at this juncture, Kubernetes is still very engineering-centric. Persistent storage is equally as challenging, despite all the new developments and hype around it.

Continue reading

A Paean to NFS

It is certainly encouraging to see both NAS protocols, NFS and SMB, featured well in the latest VMware® vSAN 7 Update 1 release. The NFS v3 and v4.1 support was already in vSAN 7.0 when it was earlier announced as part of its Native File Services for vSAN. But some years ago, NFS was not always the primary storage protocol of choice. SAN protocols, Fibre Channel and iSCSI, were almost always designated to serve enterprise applications. At the client side, Windows became prominent, and the SMB/CIFS protocol dominated the landscape of the desktop. This further pushed NFS into the back closet.

NFS or Network File System has its naysayers. The venerable, but often maligned distributed network file protocol is 36 years today. In storage vendors such as NetApp®, VAST Data, Pure Storage FlashBlade, and Dell EMC Isilon, NFS is still positioned as the primary file protocol for manufacturing testers on the shop floor, EDA/eCAD applications, seismic and subsurface applications in Oil & Gas and many more. In another development, just like its presence in the vSAN Native Services,, NFS has also quietly embedded itself into many storage platforms to serve the data platform services within the respective framework itself.

And I have experienced NFS from the client side to the enterprise applications and more, and I take this opportunity to pay tribute.

NFS (Network File System) client server network

NFS (Network File System) client server network

Continue reading

FreeNAS 11.2 & 11.3 eBook

[ Full disclosure: I work for iXsystems™ Inc. This eBook was 3/4 completed when I joined on July 1, 2020 ]

I am releasing my FreeNAS™ eBook today. It was completed about 4 weeks ago, but I wanted the release date to be significant which is August 31, 2020.

FreeNAS logo

Why August 31st? Because today is Malaysia’s Independence Day.

Why the book?

I am an avid book collector. To be specific, IT and storage technology related books. Since I started working on FreeNAS™ several years ago, I wanted to find a book to learn. But the FreeNAS™ books in the market are based on an old version of FreeNAS™. And the FreeNAS™ documentation is a User Guide where it explains every feature without going deeper with integration of real life networking services, and situational applications such as SMB or NFS client configuration.

Since I have been doing significant amount of feature “testings” of FreeNAS™ from version 9.10 till the present version 11,3 on Virtualbox™, I have decided to fill that gap. I have decided to write a cookbook-style FreeNAS™ on Virtualbox™ that covers most of the real-life integration work with various requirements including Active Directory, cloud integration and so on. All for extending beyond the FreeNAS™ documentation.

Continue reading

Tiger Bridge extending NTFS to the cloud

[Disclosure: I was invited by GestaltIT as a delegate to their Storage Field Day 19 event from Jan 22-24, 2020 in the Silicon Valley USA. My expenses, travel, accommodation and conference fees were covered by GestaltIT, the organizer and I was not obligated to blog or promote the vendors’ technologies to be presented at this event. The content of this blog is of my own opinions and views]

The NTFS File System has been around for more than 3 decades. It has been the most important piece of the Microsoft Windows universe, although Microsoft is already replacing it with ReFS (Resilient File System) since Windows Server 2012. Despite best efforts from Microsoft, issues with ReFS remain and thus, NTFS is still the most reliable and go-to file system in Windows.

First reaction to Tiger Technology

When Tiger Technology was first announced as a sponsor to Storage Field Day 19, I was excited of the company with such a cool name. Soon after, I realized that I have encountered the name before in the media and entertainment space.


Continue reading

The waning light of OpenStack Swift

I was at the 9th Openstack Malaysia anniversary this morning, celebrating the inception of the OpenInfra brand. The OpenInfra branding, announced almost a year ago, represented a change of the maturing phase of the OpenStack project but many have been questioning its growing irrelevance. The foundational infrastructure components – Compute (Nova), Image (Glance), Object Storage (Swift) – are being shelved further into the back closet as the landscape evolved in recent years.

The writing is on the wall

Through the storage lens, I already griped about the conundrum of OpenStack storage in Malaysia in last year’s 8th anniversary. And at the thick of this conundrum is OpenStack Swift. The granddaddy of OpenStack storage has not gotten much attention from technology vendors and service providers alike. For one, storage vendors have their own object storage offering, and has little incentive to place OpenStack Swift into their technology development. Continue reading