Daily Archives: October 1, 2011
I have to get this off my chest. Oracle’s Solaris ZFS is better than NetApp’s ONTAP WAFL! There! I said it!
I have been studying both similar Copy-on-Write (COW) file systems at the data structure level for a while now and I strongly believe ZFS is a better implementation of the COW file systems (also known as “shadow-paging” file system) than WAFL. How are both similar and how are both different? The angle we are looking at is not performance but about resiliency and reliability.
(Note: btrfs or “Butter File System” is another up-and-coming COW file system under GPL license and is likely to be the default file system for the coming Fedora 16)
In Computer Science, COW file system are tree-like data structures as shown below. They are different than the traditional Berkeley Fast File System data structure as shown below:
As some of you may know, Berkeley Fast File System is the foundation of some modern day file systems such as Windows NTFS, Linux ext2/3/4, and Veritas VxFS.
COW file system is another school of thought and this type of file system is designed in a tree-like data structure.
In a COW file system or more rightly named shadow-paging file system, the original node of the data block is never modified. Instead, a copy of the node is created and that copy is modified, i.e. a shadow of the original node is created and modified. Since the node is linked to a parent node and that parent node is linked to a higher parent node and so on all the way to the top-most root node, each parent and higher-parent nodes are modified as it traverses through the tree ending at the root node.
The diagram below shows the shadow-paging process in action as modifications of the node copy and its respective parent node copies traverse to the top of the tree data structure. The diagram is from ZFS but the same process applies to WAFL as well.
As each data block of either the leaf node (the last node in the tree) or the parent nodes are being modified, pointers to either the original data blocks or the copied data blocks are modified accordingly relative to the original tree structure, until the last root node at the top of the shadow tree is modified. Then, the COW file system commit is considered complete. Take note that the entire process of changing pointers and modifying copies of the nodes of the data blocks is done is a single I/O.
The root at the top for ZFS is called uberblock and called fsinfo in WAFL. Because an exact shadow of the tree-like file system is created when the data blocks are modified, this also gives birth to how snapshots are created in a COW file system. It’s all about pointers, baby!
Here’s how it looks like with the original data tree and the snapshot data tree once the shadow paging modifications are complete.
However, there are a few key features from the data integrity and reliability point of view where ZFS is better than WAFL. Let me share that with you.
In a nutshell, ZFS is a layered architecture that looks like this
The Data Management Unit (DMU) layer is one implementation that ensures stronger data integrity. The DMU maintains a checksum on the data in each data block by storing the checksum in the parent’s blocks. Thus if something is messed up in the data block (possibly by Silent Data Corruption), the checksum in the parent’s block will be able to detect it and also repair the data corruption if there is sufficient data redundancy information in the data tree.
WAFL will not be able to detect such data corruptions because the checksum is applied at the disk block level and the parity derived during the RAID-DP write does not flag this such discrepancy. An old set of slides I found portrayed this comparison as shown below.
Another cool feature that addresses data resiliency is the implementation of ditto blocks. Ditto blocks stores 3 copies of the metadata and this allows the recovery of lost metadata even if 2 copies of the metadata are deleted.
Therefore, the ability of ZFS to survive data corruption, metadata deletion is stronger when compared to WAFL .This is not discredit NetApp’s WAFL. It is just that ZFS was built with stronger features to address the issues we have with storing data in modern day file systems.
There are many other features within ZFS that have improved upon NetApp’s WAFL. One such feature is the implementation of RAID-Z/Z2/Z3. RAID-Z is a superset implementation of the traditional RAID-5 but with a different twist. Instead of using fixed stripe width like RAID-4 or RAID-DP, RAID-Z/Z2 uses a dynamic variable stripe width. This addressed the parity RAID-4/5 “write hole” flaw, where incomplete or partial stripes will result in a “hole” that leads to file system fragmentation. RAID-Z/Z2 address this by filling up all blocks with variable stripe width. A parity can be calculated and assigned with any striped width, as shown below.
Other really cool stuff are Hybrid Storage Pool and the ability to create software-based caching using fast disk drives such as SSDs. This approach of creating ReadZilla (read caching) and LogZilla (write caching) eliminates the need for proprietary NVRAM as implemented in NetApp’s WAFL.
The only problem is, despite the super cool features of ZFS, most Oracle (not Sun) sales does not have much clue how to sell ZFS storage. NetApp, with its well trained and tuned, sales force is beating Oracle to pulp.