The engineering of Elastifile

[Preamble: I was a delegate of Storage Field Day 12. My expenses, travel and accommodation were paid for by GestaltIT, the organizer and I was not obligated to blog or promote the technologies presented in this event]

When it comes to large scale storage capacity requirements with distributed cloud and on-premise capability, object storage is all the rage. Amazon Web Services started the object-based S3 storage service more than a decade ago, and the romance with object storage started.

Today, there are hundreds of object-based storage vendors out there, touting features after features of invincibility. But after researching and reading through many design and architecture papers, I found that many object-based storage technology vendors began to sound the same.

At the back of my mind, object storage is not easy when it comes to most applications integration. Yes, there is a new breed of cloud-based applications with RESTful CRUD API operations to access object storage, but most applications still rely on file systems to access storage for capacity, performance and protection.

These CRUD and CRUD-like APIs are the common semantics of interfacing object storage platforms. But many, many real-world applications do not have the object semantics to interface with storage. They are mostly designed to interface and interact with file systems, and secretly, I believe many application developers and users want a file system interface to storage. It does not matter if the storage is on-premise or in the cloud.

Let’s not kid ourselves. We are most natural when we work with files and folders.

Implementing object storage also denies us the ability to optimally utilize Flash and solid state storage on-premise when the compute is in the cloud. Similarly, when the compute is on-premise and the flash-based object storage is in the cloud, you get a mismatch of performance and availability requirements as well. In the end, there has to be a compromise.

Another “feature” of object storage is its poor ability to handle transactional data. Most of the object storage do not allow modification of data once the object has been created. Putting a NAS front (aka a NAS gateway) does not take away the fact that it is still object-based storage at the very core of the infrastructure, regardless if it is on-premise or in the cloud.

Resiliency, latency and scalability are the greatest challenges when we want to build a true globally distributed storage or data services platform. Object storage can be resilient and it can scale, but it has to compromise performance and latency to be so. And managing object storage will not be as natural as to managing a file system with folders and files.

Enter Elastifile.

Continue reading

Captain Dynamo Storage System

My research on file systems brought me to an very interesting piece of article. It is titled “Dynamo: Amazon’s Highly Available Key-Value Store” dated 2007.

Yes, this is an internal storage systems designed and developed in Amazon to scale and support Amazon Web Services (AWS). It is a very complex piece of technology and the paper is highly technical (not for the faint of heart). And of all places, Amazon is probably the last place you think you would find such smart technology, but it’s true. AWS engineers are slowly revealing the many of their innovations (think Amazon Silk browser technology).

And it appears that many of the latest cloud-based computing and services companies such as Amazon, Google and many others have been developing new methods of storing data objects. These methods are very different from the traditional methods of storing data, and many are no longer adopting the relational database model (RDBMS) to scale their business.

The traditional 3-tier architecture often adopted by web-based (before the advent of “cloud”), is evolving. As shown in the diagram below:

the foundation tier is usually a relational database (or a distributed relational database), communicating with the back-end storage (usually a SAN).

All that is changing because the relational database model is not keeping up with the tremendous pace of the proliferation of web-based and cloud-based objects or unstructured data. As explained by Alex Iskold, a writer of ReadWriteWeb, there are scalability issues with the conventional relational database.


Before I get to the scalability issues mentioned in the above diagram, let me set the floor for discussion.

For theoretical schoolers of relational database, the term ACID defines and guarantees the transactional reliability of relational databases. ACID stands for Atomicity, Consistency, Isolation and Durability. According to Wikipedia, “transactions provide an “all-or-nothing” proposition, stating that each work-unit performed in a database must either complete in its entirety or have no effect whatsoever. Further, the system must isolate each transaction from other transactions, results must conform to existing constraints in the database, and transactions that complete successfully must get written to durable storage.”

ACID has been the cornerstone of relational database from the very beginning. But as the demands of greater scalability and greater distribution of data, all 4 components of ACID – Atomicity, Consistency, Isolation, Durability – can no longer hold true. Hence, the CAP Theorem.

CAP Theorem (aka Brewer’s Theorem) stands for Consistency, Availability and Partition Tolerance. In the ACM (Association of Computing Machinery) conference in 2000, Eric Brewer of University of California, Berkeley delivered the theorem. It states that it is impossible for a distributed computer system (or a database system) to simultaneously guarantee all 3 components – Consistency, Availability and Partition Tolerance.

Therefore, as the database systems become more and more distributed in cyberspace, the ACID theorem begins to break down. All 4 components of ACID cannot be guaranteed simultaneously anymore as the database systems begin to become more and more distributed.

So when we get back to the diagram, both the concepts on left and right – Master/Slave OR Multiple Peers – will put a tremendous strain on the single, non-distributed relational database.

New data models are surfacing to handling the very distributed data sets. Distributed object-based  “file systems” and NoSQL type of databases are some of the unconventional data storage “systems” that are beginning to surface as viable alternatives to the relational database method in cyberspace. And one of them is the Amazon Dynamo Storage System. (ADSS)

ADSS is a highly available, Amazon-proprietary key-value distributed data store. ADSS has both the properties of distributed hash table and a database and it is used internally to power various Cloud Services in Amazon Web Services (AWS).


It behaves like a relational database where it stores data objects to be retrieved. However, the data objects are not stored in a table format of a conventional relational database. Instead, the data is stored in a distributed hash table and data content or value is retrieved with a key, hence a key-value data model.

The data content is stored and retrieved through a simple put and get interface, much like how RESTful would do it. From the article in ReadWriteWeb, here’s how Dynamo works:

  • Physical nodes are thought of as identical and organized into a ring.
  • Virtual nodes are created by the system and mapped onto physical nodes, so that hardware can be swapped for maintenance and failure.
  • The partitioning algorithm is one of the most complicated pieces of the system, it specifies which nodes will store a given object.
  • The partitioning mechanism automatically scales as nodes enter and leave the system.
  • Every object is asynchronously replicated to N nodes.
  • The updates to the system occur asynchronously and may result in multiple copies of the object in the system with slightly different states.
  • The discrepancies in the system are reconciled after a period of time, ensuring eventual consistency.
  • Any node in the system can be issued a put or get request for any key

The Dynamo architecture addresses the CAP Theorem well. It is highly available, where nodes, either physical or virtual,  can be easily swapped without affected the storage services. It is also high performance, nodes (again physical or virtual) can be added to boost the performance. The high performance and highly available components addresses the “A” piece of CAP.

Its distributed nature also allows it to scale to billions and billions of data objects and hence meets the “P” requirement of CAP. The Partitioning Tolerance is definitely there.

However, as stated by CAP Theorem, you can’t have all 3 happening at the same time. Therefore, the “C” or Consistency piece of CAP has to be compromised. That is why Dynamo has been labeled an “eventually consistency” storage system.

As data is stored into ADSS, the changes of the data is propogated and will be asynchronously replicated to other nodes in the system, eventually making all the data objects and its value consistent. However, given the speed of things in cyberspace and the nature of most Cloud Computing services, the consistency piece could be difficult to accomplish and that is OK because in most of the transactions that are distributed, inconsistency is acceptable.

So that’s a bit about the Amazon Dynamo. Alas, we may never get our grubby hands on this piece of cool data storage and management technology, but knowing that Dynamo is powering AWS and its business is an eye-opener for us into the realm of a new technology evolution.