ONTAP vs ZFS

I have to get this off my chest. Oracle’s Solaris ZFS is better than NetApp’s ONTAP WAFL! There! I said it!

I have been studying both similar Copy-on-Write (COW) file systems at the data structure level for a while now and I strongly believe ZFS is a better implementation of the COW file systems (also known as “shadow-paging” file system) than WAFL. How are both similar and how are both different? The angle we are looking at is not performance but about resiliency and reliability.

(Note: btrfs or “Butter File System” is another up-and-coming COW file system under GPL license and is likely to be the default file system for the coming Fedora 16)

In Computer Science, COW file system are tree-like data structures as shown below. They are different than the traditional Berkeley Fast File System data structure as shown below:

As some of you may know, Berkeley Fast File System is the foundation of some modern day file systems such as Windows NTFS, Linux ext2/3/4, and Veritas VxFS.

COW file system is another school of thought and this type of file system is designed in a tree-like data structure.

In a COW file system or more rightly named shadow-paging file system, the original node of the data block is never modified. Instead, a copy of the node is created and that copy is modified, i.e. a shadow of the original node is created and modified. Since the node is linked to a parent node and that parent node is linked to a higher parent node and so on all the way to the top-most root node, each parent and higher-parent nodes are modified as it traverses through the tree ending at the root node.

The diagram below shows the shadow-paging process in action as modifications of the node copy and its respective parent node copies traverse to the top of the tree data structure. The diagram is from ZFS but the same process applies to WAFL as well.

As each data block of either the leaf node (the last node in the tree) or the parent nodes are being modified, pointers to either the original data blocks or the copied data blocks are modified accordingly relative to the original tree structure, until the last root node at the top of the shadow tree is modified. Then, the COW file system commit is considered complete. Take note that the entire process of changing pointers and modifying copies of the nodes of the data blocks is done is a single I/O.

The root at the top for ZFS is called uberblock and called fsinfo in WAFL. Because an exact shadow of the tree-like file system is created when the data blocks are modified, this also gives birth to how snapshots are created in a COW file system. It’s all about pointers, baby!

Here’s how it looks like with the original data tree and the snapshot data tree once the shadow paging modifications are complete.

However, there are a few key features from the data integrity and reliability point of view where ZFS is better than WAFL. Let me share that with you.

In a nutshell, ZFS is a layered architecture that looks like this

The Data Management Unit (DMU) layer is one implementation that ensures stronger data integrity. The DMU maintains a checksum on the data in each data block by storing the checksum in the parent’s blocks. Thus if something is messed up in the data block (possibly by Silent Data Corruption), the checksum in the parent’s block will be able to detect it and also repair the data corruption if there is sufficient data redundancy information in the data tree.

WAFL will not be able to detect such data corruptions because the checksum is applied at the disk block level and the parity derived during the RAID-DP write does not flag this such discrepancy. An old set of slides I found portrayed this comparison as shown below.

Another cool feature that addresses data resiliency is the implementation of ditto blocks. Ditto blocks stores 3 copies of the metadata and this allows the recovery of lost metadata even if 2 copies of the metadata are deleted.

Therefore, the ability of ZFS to survive data corruption, metadata deletion is stronger when compared to WAFL .This is not discredit NetApp’s WAFL. It is just that ZFS was built with stronger features to address the issues we have with storing data in modern day file systems.

There are many other features within ZFS that have improved upon NetApp’s WAFL. One such feature is the implementation of RAID-Z/Z2/Z3. RAID-Z is a superset implementation of the traditional RAID-5 but with a different twist. Instead of using fixed stripe width like RAID-4 or RAID-DP, RAID-Z/Z2 uses a dynamic variable stripe width. This addressed the parity RAID-4/5 “write hole” flaw, where incomplete or partial stripes will result in a “hole” that leads to file system fragmentation. RAID-Z/Z2 address this by filling up all blocks with variable stripe width. A parity can be calculated and assigned with any striped width, as shown below.

Other really cool stuff are Hybrid Storage Pool and the ability to create software-based caching using fast disk drives such as SSDs. This approach of creating ReadZilla (read caching) and LogZilla (write caching) eliminates the need for proprietary NVRAM as implemented in NetApp’s WAFL.

The only problem is, despite the super cool features of ZFS, most Oracle (not Sun) sales does not have much clue how to sell ZFS storage. NetApp, with its well trained and tuned, sales force is beating Oracle to pulp.

Advertisements

About cfheoh

I am a technology blogger with 20 years of IT experience. I write heavily on technologies related to storage networking and data management because that is my area of interest and expertise. I introduce technologies with the objectives to get readers to *know the facts*, and use that knowledge to cut through the marketing hypes, FUD (fear, uncertainty and doubt) and other fancy stuff. Only then, there will be progress. I am involved in SNIA (Storage Networking Industry Association) and presently the Chairman of SNIA Malaysia. My day job is to run 2 companies - Storage Networking Academy and ZedFS Systems. Storage Networking Academy provides open-technology courses in storage networking (foundation to advanced) as well as Cloud Computing. We strives to ensure vendor-neutrality as much as we can in our offerings. At ZedFS Systems, we offer both storage infrastructure and software solution called Zed OpenStorage. We developed Zed OpenStorage based on the Solaris kernel, ZFS file system and DTrace storage analytics.

Posted on October 1, 2011, in Filesystems, Reliability and tagged , , , , , , . Bookmark the permalink. 17 Comments.

  1. Is zfs work perfectly with SSD?

    • Hello Tham

      SSD is perfect for ZFS because with HSP (Hybrid Storage Pools), you can create both Readzilla with Read Bias SSD (NAND Flash) and LogZilla with Write Bias SSD (DRAM-based or NAND Flash with Super Capacitors), leaving SATA for the data volumes.

      This means that you can easily build an x86-based server into a ZFS storage appliance without the proprietary hardware such as NVRAM for NetApp and LCC for EMC VNX. This drives the cost down and significantly cheaper to integrate.

      Hope this helps.

      Thanks
      /Chin Fah

  2. Whether can extend a Zpool on the fly?

    • Hi Tham,

      # zpool add raidz/raidz2/mirror cXtXdX cXtXdX

      and the zpool capacity grows on the fly. You can also add different RAID levels to the zpool but some best practice considerations have to be in place.

      Thanks
      /Chin Fah

  3. And the coolest feature of zfs (which is not mentioned) – it does software dynamic provisioning too! Not to mention it’s 128 bit.

    You can assign a zpool to multiple mount paths and add disks on the fly when they’re running out of space. No more orphaned or extra disk spaces left at /var or /tmp or/opt.

    By far the most impressive filesystem I’ve worked with – but….limited to Solaris (and Linux?) at the moment.

    • Hi Alex

      You are spot on but I am curious how are you involved with ZFS. Is your company using it?

      Reason I asked is not many people use Sun Storage solutions, let alone ZFS. I wish there are more people like you out there appreciating how awesome ZFS is.

      See you on Thursday. Thanks
      /Chin Fah

  4. Kristoffer Egefelt

    Hi Chin,

    I think the primary issue creating a open source filer based on zfs, or maybe even btrfs, is finding hardware which would support redundant controllers.

    I’m trying to create a redundant storage system for VM storage, evaluating gluster and drbd.

    Do you have any input on this?

    Thanks
    Kristoffer

    • Hi Kristoffer

      Thanks for reading my blog. I am surprised it got all the way to you.

      Yes, I agree with you. This is still our challenge when we started shipping ZFS storage appliance 3 months ago. My partner and I, him being the more technical one, are very much Sun/Solaris inclined. And we started testing with Sun Clusters 3.2 (or whatever was left of it after being devoured by Oracle), and we still do not have a working feature yet. Right now, we position replication as an availability feature, and address customers that do not fully need clustering.

      We have experimented with Oracle Solaris, OpenIndiana, Illumos, and NexentaCore but have yet to settle the clustering portion. Our HW partner, Supermicro, has been getting feedbacks from us.

      I like Gluster. I watched a webcast of it and I was impressed. But I have yet to test it out. My partner is kinda like a Solaris bigot, so there are personal challenges as well when it comes to developing our product roadmap. Ha ha.

      Thanks and all the best
      /Chin-Fah

      • Kristoffer Egefelt

        It’s a good blog and you write about a lot of stuff that I google 😉

        Sounds exciting about your own storage appliances – I hope it’s going well.

        If you have any ideas for the clustering part let me know, I have around 12 months of testing before I must decide, so I have time.

        My main problem is that if I want the redundancy of a netapp filer, I need to double the storage servers/harddrives to have cluster functionality with gluster, nexentaCore etc.

        This way the cost of a netapp is not that much higher than buying commodity hardware, because I have to buy 2 of everything which comes with 2 x power costs.

        I guess that two raid controllers connected to the same disks is too complicated for commodity hardware? Are you aware of any HW providers offering this kind of system?

        Thanks
        Kristoffer

      • Hi Kristoffer,

        Sorry for the late reply. It’s been a busy week.

        The clustering bit is something that we get questioned on the field. We want to ensure that the clustering on the HW part remains as simple as possible without the proprietary HW. NetApp uses proprietary NVRAM and the Intel VI interconnect for their clustering. We don’t want that.

        We want to just use a Gigabit or 10Gigabit Interconnect without any special internal PCIe card. We were using Solaris OpenHA for a while but Oracle is really killing the whole open-source thing.

        We were thinking of a newer approach to the clustering – ala-Oracle RAC. HP LeftHand (P4000) has an interesting concept called Network RAID which bypasses the interconnect concept. The clustering is based on network nodes which I believe is easier to do. We are still arguing about it ;-(

        If I come across any specific HW providers providing dual RAID controllers, I will let you know.

        Have a good weekend

        /Chin-Fah

  5. I have a doubt about WAFL:

    WAFL stores metadata in files unlike traditional UFS, which store inodes separately in blocks.
    UFS (and similar FS) thus keep these blocks closer to the data files (hence reducing the seek time for data-metadata access) whereas, in WAFL since this metadata is stored in separate files, these files will be kept in a few particular blocks and hence metadata and data blocks can’t be close to each other, thereby increasing the seek time.
    Thus, from what I can understand, storing metadata in files should have a worse effect on performance.
    But it is known that WAFL performs well for large storage as compare to many other FS.

    Has this got something to do with other aspects of Network Appliance, or am I wrong with the above reasoning.

    I’ll be really grateful to if you could answer it.

    Thanks in advance!

    • Hi Piyush

      There are a few things that helps in your concern. Asynchronous writes and caching of the metadata files in WAFL and coupled with the design of shadow paging file systems helps overcome the seek and latency concern versus the traditional file systems based on the Fast File System design. This is pretty much the same in ZFS as well, because the framework of the file system is based on shadow paging. And new file system coming out in Windows Server 8, ReFS, is also based on the shadow paging design.

      Obviously, there is no perfect file system, but modern day file systems designs are slanting to shadow paging for apparent reasons.

      Hope this helps.

      Thanks
      /Chin-Fah

  6. ZFS lets you add more drives as more VDEVs, yes, but it won’t redistribute the data to take advantage of the new spindles. This will require block pointer rewrite, which has been talked about for at least three years, but not much has happened. Earlier versions of zpool (with openindiana, up to 12 or 18 months ago), adding new drives to an existing system, given you had pretty full VDEVs already, would result in a rather dramatic performance penalty. This has been somewhat mended later to avoid trying to write to full VDEVs, but a solution is still somewhere in the queue. To extend a zpool without adding more overhead, either destroy the pool and recreate it and restore, or replace each drive in the pool with a larger one (given zpool autoexpand=on). The latter will keep the configuration and allow for better performance.

    roy

  1. Pingback: ONTAP vs ZFS | Storage news | Scoop.it

  2. Pingback: Joy(ent) to the World « Storage Gaga

  3. Pingback: Joy(ent) to the World | Storage Gaga

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: