Ryussi MoSMB – High performance SMB

I am back in the Silicon Valley as a Storage Field Day 12 delegate.

One of the early presenters was Ryussi, who was sharing a proprietary SMB server implementation of Linux and Unix systems. The first thing which comes to my mind was why not SAMBA? It’s free; It works; It has the 25 years maturity. But my experience with SAMBA, even in the present 4.x, does have its quirks and challenges, especially in the performance of large file transfers.

One of my customers uses our FreeNAS box. It’s a 50TB box for computer graphics artists and a rendering engine. After running the box for about 3 months, one case escalated to us was the SMB shares couldn’t be mapped all of a sudden. All the Windows clients were running version 10. Our investigation led us to look at the performance of SMB in the SAMBA 4 of FreeNAS.

This led to other questions such as the vfs_aio_pthread, FreeBSD/FreeNAS implementation of asynchronous I/O to overcome the performance weaknesses of the POSIX AsyncIO interface. The FreeNAS forum is flooded with sightings of missing SMB service that during large file transfer. Without getting too deep into the SMB performance issue, we decided to set the “Server Minimum Protocol” and “Server Maximum Protocol” to be SMB 2.1. The FreeNAS box at the customer has been stable now for the past 5 months.

Continue reading

The big boys better be flash friendly

An interesting article came up in the news this week. The article, from the ever popular The Register, mentioned 3 up and rising storage stars, Nimble Storage, Tintri and Tegile, and their assault on a flash strategy “blind spot” of the big boys, notably EMC and NetApp.

I have known about Nimble Storage and Tintri for a couple of years now, and I did take some time to read up on their storage technology offering. Tegile is new to me when it appeared on my radar after SearchStorage.com announced as the Gold Winner of the enterprise storage category for 2012.

The Register article intriqued me because it implied that these traditional storage vendors such as EMC and NetApp are probably doing a “band-aid” when putting together their flash storage strategy. And typically, I see these strategic concepts introduced by these 2 vendors:

  1. Have a server-side cache strategy by putting a PCIe card on the hosting server
  2. Have a network-based all-flash caching area
  3. Have a PCIe-based flash card on the storage system
  4. Have solid state drives (SSDs) in its disk shelves enclosures

In (1), EMC has VFCache (the server side caching software has been renamed to XtremSW Cache and under repackaging with the Xtrem brand name) and NetApp has it FlashAccel solution. Previously, as I was informed, FlashAccel was using the FusionIO ioTurbine solution but just days ago, NetApp expanded the LSI Nytro WarpDrive into its FlashAccel solution as well. The main objective of a server-side caching strategy using flash is to accelerate mostly read-based I/O operations for specific application workloads at the server side.

Continue reading

Protogon File System

I was out shopping yesterday and I was tempted to have lunch at Bar-B-Q Plaza, a popular Thai, Japanese-style hot plate barbeque restaurant in this neck of the woods. The mascot of this restaurant is Bar-B-Gon, a dragon-like character and it is obviously a word play of barbeque and dragon.

As I was reading the news this morning about the upcoming Windows Server 8 launch, I found out that ever popular, often ridiculed NTFS (NT File System) of Windows will be going away. It will be replaced by Protogon, a codename for the new file system that Microsoft is about to release. Protogon? A word play of prototype and dragon?

The new file system, with backward compatibility with NTFS, will be called ReFS or Resilient File System. And the design objectives of what Microsoft calls “next generation” file system are clear and adept to the present day requirements. I notably mentioned present day requirements for a reason, because when I went through the key features of ReFS, the concepts and the ideas are not exactly “next generation“. Many of these features are already present with most storage vendors we know of, but perhaps for the people in the Windows world, these features might sound “next generation” to them.

ReFS, to me, is about time. NTFS has been around for a long, long time. It was first known in the wild in the 1993, and gain prominence and wide acceptance in Windows 2000 as the “enterprise-ready” file system. Indeed it was, because that was the time Microsoft Windows started its dominance into the data centers when the Unix vendors were still bickering about their version of open standards. Active Directory (AD) and NTFS were the 2 key technologies that slowly, but surely, removed Unix’s strengths in the data centers.

But over the years, as the storage networking technologies like SAN and NAS were developing and maturing, I see the NTFS being little developed to meet the strengths of these storage networking technologies and relevant protocols in the data world. When I did  a little bit of system administration on Windows (2000, 2003 notably), I could feel that NTFS was developed with direct-attached storage (DAS) or internal disks in mind. Definitely not full taking advantage of the strengths of Fibre Channel or iSCSI SAN. It was only in Windows Server 2008, that I felt Microsoft finally had enough pussyfooting with SAN and NAS, and introduced a more decent disk storage management that incorporates features that works well natively with SAN. Now, Microsoft can no longer sit quietly without acknowledging the need to build enterprise-ready technologies related to storage networking and data management. And the core in the new Microsoft Windows Server 8 engine for that is the ReFS.

One of the key technology objectives in the design of ReFS is backward compatibility. Windows has a huge market to address and they cannot just shove NTFS away. The way they did was to maintain the upper level API and file semantics and having a new core file system engine as shown in the diagram below:

ReFS is positioned with resiliency in mind. Here are a few resilient features:

  • Ability to isolate fault and perform data salvation on parts of the file system without taking the entire file system or volume offline. The goal of REFS here is to be ONLINE and serving data all the time!
  • Checksumming data and metadata for integrity. It verifies all data, and in some cases, auto-correcting corrupted data
  • Optional integrity streams that ensures protection for all forms of file-level data corruption. When enabled, whenever a file is changed, the modified copy is written to a different area of the disk than that of the original file. This way, even if the write operation is interrupted and the modified file is lost, the original file is still intact. (Doesn’t this sounds like COW with snapshots?) When combined with Storage Spaces (we will talk about this later), which can store a copy of all files in a storage array on more than one physical disk, ReFS gives Windows a way to automatically find and open an uncorrupted version of a file In the event that a file on one of the physical disks becomes corrupted. Microsoft does not recommend integrity streams for applications or systems with a specific type of storage layout or applications which want better control in the disk storage, for example databases.
  • Data scrubbing for latent disk errors. There is an tool, integrity.exe which runs and manages the data scrubbing and integrity policies. The file attribute, FILE_ATTRIBUTE_NO_SCRUB_DATA, will allow certain applications to skip this options and have these applications control integrity policies beyond what ReFS has to offer.
  • Shared storage pools across machines for additional fault tolerance and load balancing (ala Oracle RAC perhaps?)
  • Protection against bit rot. Silent data corruption, which I have blogged about many, many moons ago.

End-to-end resilient architecture is the goal in mind.

From a file structure standpoint, here’s how ReFS looks like:

ReFS is Copy-on-Write (COW). As you know, I am a big fan of any file systems but COW is one that I am most familiar with. NetApp’s Data ONTAP, Oracle Solaris, ZFS and the upcoming Linux BTRFS are all implementations of COW. Similar to BTRFS, ReFS uses a B+ tree implementation and as described in Wikipedia,

ReFS uses B+ trees for all on-disk structures including metadata and file data. The file size, total volume size, number of files in a directory and number of directories in a volume are limited by 64-bit numbers, which translates to maximum file size of 16 Exbibytes, maximum volume size of 1 Yobibyte (with 64 KB clusters), which allows large scalability with no practical limits on file and directory size (hardware restrictions still apply). Metadata and file data are organized into tables similar to relational database. Free space is counted by a hierarchal allocator which includes three separate tables for large, medium, and small chunks. File names and file paths are each limited to a 32 KB Unicode text string.

In ReFS, Microsoft introduces Storage Spaces. And the concept is very, very similar to what ZFS is, with the seamless implementation of a volume manager, RAID management, and highly resilient file system. And ZFS is 10 years old. So much for ReFS being “next generation“.  But here is a series of screenshots of how Storage Spaces looks like:

And similar to this “flexible volume management” ala ONTAP FlexVol and ZFS file systems, you can add disk drives on the fly, and grow your volumes online and real time.

ReFS inherits many of the NTFS features as it inches towards the Windows Server 8 launch date. Some of the features mentioned were the BitLocker encryption, Access Control List (ACL) for security (naturally), Symbolic Links, Volume Snapshots, File IDs and Opportunistic Locking (Oplocks).

ReFS is intended to scale to as what Microsoft says, “to extreme limits“. Here is a table describing those limits:

ReFS new technology will certainly bring Windows to the stringent availability and performance requirements of modern day file systems, but the storage networking world is also evolving into the cloud computing space. Object-based file systems are also getting involved as market trends dictate new requirements and file systems, in order to survive, must continue to evolve.

Microsoft’s file system, NTFS took a long time to come to this present version, ReFS, but can Microsoft continue to innovate to change the rules of the data storage game? We shall see …

btrfs – a better butter from a better COW?

There’s better a lot of chatter about the default file system in the recently released Fedora 16. Fedora 16 was released on November 8, 2011. For months, there were rumours that the default file system for Fedora 16 will be btrfs (btree file system, or better known as butter-FS). But after the release, the default file system of Fedora 16 is still ext4, much to the dismay of many. btrfs holds a lot of potential because in the space of less than 4 years, it has moved up the hierarchy to be one of the top file systems in the Linux ecosystem.

File systems in Linux has not seen had a knight in shining armour for the long time. The file systems kings of the Linux world were ext2/3 for the RedHat and Debian flavoured distros and reiserfs for the SuSE  flavoured distros. ext4 is now default in most Linux distros but the concept of  ext2/3/4 has not changed much since its inception decades ago.

At the same time, reiserfs  had a lot of promise as well, but its development and progress have lost it lustre after its principal developer, Han Reiser was convicted of the murder of his wife a few years ago. If you are KPC (Malaysian Chinese colloquialism, meaning busybody), you can read the news here.

btrfs is going to be the new generation of file systems for Linux and even Ted T’so, the CTO of Linux Foundation and principal developer admitted that he believed btrfs is the better direction because “it offers improvements in scalability, reliability, and ease of management”.

For those who has studied computer science, B-Tree is a data structure that is used in databases and file systems. B-Tree is an excellent data structure to store billions and billions of objects/data and is able to provide fast data retrieval in logarithmic time. And the B-Tree implementation is already present in some of the file systems such as JFS, XFS and ReiserFS. However, these file systems are not shadow-paging filesystems (popularly known as copy-on-write file systems).

You see, B-Tree, in its native form of implementation, is very incompatible with COW file systems. In fact, the implementation was thought of impossible, until someone by the name of Ohad Rodeh came along. He presented a paper in Usenix FAST ’07 which described the challenges of combining the B-Tree concept with shadow paging file systems. The solution, as he described it, was to introduce insert/remove key into the tree structure, and removing the dependency of intra-leaf linking.

Chris Mason, one of the developers of reiserfs, took Ohad’s idea and created a shadow-paging file system based on the B-Tree idea.

Traditional file systems tend to follow the idea of the Berkeley Fast File Systems. Different cylinder groups have its own inode, bitmap and disk blocks. The used space in one cylinder group cannot be shared to another cylinder group, resulting in wastage. At the same time, performance can be an issue as the disk read/head frequently have to move to the inodes to find out where the next used or free blocks will be. In a way, it looks like the diagram below.

Chris Mason’s idea of btrfs made the file system looked like this:

Today, a quick check in btrfs wiki page, shows that the main Btrfs features available at the moment include:

  • Extent based file storage
  • 2^64 byte == 16 EiB maximum file size
  • Space-efficient packing of small files
  • Space-efficient indexed directories
  • Dynamic inode allocation
  • Writable snapshots, read-only snapshots
  • Subvolumes (separate internal filesystem roots)
  • Checksums on data and metadata
  • Compression (gzip and LZO)
  • Integrated multiple device support
  • RAID-0, RAID-1 and RAID-10 implementations
  • Efficient incremental backup
  • Background scrub process for finding and fixing errors on files with redundant copies
  • Online filesystem defragmentation

And Chris Mason and his team have still plenty more to offer for btrfs. It will likely be the default file system in Fedora 17 and at the rate it is going, could be the file system of choice for RedHat, Debian and SuSE distros very soon.

There’s been a lot of comparisons between btrfs and ZFS, since both are part of Oracle now. ZFS is obviously a much more mature file system, with more enterprise features and more robust, (Incidentally, ZFS just celebrated its 10th birthday on Halloween 2011 – see Matt Ahren’s blog) and the btrfs is the rising star in the Linux world. But at this moment, the 2 file systems are set apart in their market positioning and deployment.

ZFS is based on the CDDL (Common Development and Distribution License) while btrfs is based on GNU GPL, open source licensing. There are controversies surrounding CDDL licensing scheme, while is incompatible with GNU GPL scheme.

Oracle can count itself very lucky to have 2 of the most promising and prominent COW file systems around. It will be interesting to see what Oracle will do next. As a proponent of innovation, community and sharing, I sincerely hope that both file systems will continue to thrive in Oracle’s brutal, sales-driven organization. We certainly don’t want to see controversies about dual ownership BS of Oracle and mySQL that could end in 2015.