A very interesting report surfaced in front of me today. It is Information Week’s IT Pro ranking of Data Deduplication vendors, just made available a few weeks ago, and it is the overview of the dedupe market so far.
It surveyed over 400 IT professionals from various industries with companies ranging from less than 50 employees to over 10,000 employees and revenues of less than USD5 million to USD1 billion. Overall, it had a good mix of respondents. But the results were quite interesting.
It surveyed 2 segments
- Overall performance – product reliability, product performance, acquisition costs, operations costs etc.
- Technical features – replication, VTL, encryption, iSCSI and FCoE support etc.
After more than a year since Dell acquired Ocarina Networks, it has finally surfaced last week in the form of Dell DX Object Storage 6000G SCN (Storage Compression Node).
Ocarina is a content-aware storage optimization engine, and their solution is one of the best I have seen out there. Its unique ECOsystem technology, as described in the diagram below, is impressive.
Unlike most deduplication and compression solutions out there, Ocarina Networks solution takes storage optimization a step further. Ocarina works at the file level and given the rise and crazy, crazy growth of unstructured files in the NAS space, the web and the clouds, storage optimization is one priority that has to be addressed immediately. It takes a 3-step process – Extract, Correlate and Optimize.
Today’s files are no longer a flat structure of a single object but more of a compounded file where many objects are amalgamated from different sources. Microsoft Office is a perfect example of this. An Excel file would consists of objects from Windows Metafile Formats, XML objects, OLE (Object Linking and Embedding) Compound Storage Objects and so on. (Note: That’s just Microsoft way of retaining monopolistic control). Similarly, a web page is a compound of XML, HTML, Flash, ASP, PHP object codes.
In Step 1, the technology takes files and breaks it down to its basic components. It is kind of like breaking apart every part of a car down to its nuts and bolt and layout every bit on the gravel porch. That is the “Extraction” process and it decodes each file to get the fundamental components of the files.
Once the compounded file object is “extracted”, identified and indexed, each fundamental object is Correlated in Step 2. The correlation is executed with the file and across files under the purview of Ocarina. Matching and duplicated objects are flagged and deduplicated. The deduplication is done at the byte-level, unlike most deduplication solutions that operate at the block-level. This deeper and more granular approach further reduces the capacity of the storage required, making Ocarina one of the most efficient storage optimization solutions currently available. That is why Ocarina can efficiently reduce the size of even zipped and highly encoded files.
It takes this storage optimization even further in Step 3. It applies content-aware compactors for each fundamental object type, uniquely compressing each object further. That means that there are specialized compactors for PDF objects, ZIP objects and so on. They even have compactors for Oil & Gas seismic files. At the time I was exposed to Ocarina Networks and evaluating it, it had about 600+ unique compactors.
After Dell bought Ocarina in July 2010, the whole Ocarina went into a stealth mode. Many already predicted that the Ocarina technology would be integrated and embedded into Dell’s primary storage solutions of Compellent and EqualLogic. It is not there yet, but will likely be soon.
Meanwhile, the first glimpse of Ocarina will be integrated as a gateway solution to Dell DX6000 Object Storage. DX Object Storage is a technology which Dell has OEMed from Caringo. DX6000 Object Storage (I did not read in depth) has the concept of the old EMC Centera, but with a much newer, and more approach based on XML and HTTP REST. It has published an open API and Dell is getting ISV partners to develop their applications to interact with the DX6000 including Commvault, EMC, Symantec, StoredIQ are some of the ISV partners working closely with Dell.
(24/10/2011: Editor note: Previously I associated Dell DX6000 Object Storage with Exanet. I was wrong and I would like to thank Jim Dtuton of Caringo for pointing out my mistake)
Ocarina’s first mission is to reduce the big, big capacities in Big Data space of the DX6000 Object Storage, and the Ocarina ECOsystem technology looks a good bet for Dell as a key technology differentiator.
One of the things that peeved at the HP D2D Workshop a few days ago was this heading in the HP PowerPoint slides – “Deduplication – a fancy form of Compression”. Somehow it bothered me.
I have always placed both deduplication and compression into a bucket I called “Data Reduction“. Some vendors might call it Storage Economics, spinning it in a cooler manner. Either way, both attempt and succeed to reduce the capacity required to store the amount of data and this translates into benefits in storage management and network. With a smaller data set, lesser processing and capacity are required, likely speeding up the performance of the storage array. At the same time, the primary data backup set (you know, the data that you back up every night?) becomes smaller, making backup and restore faster (not necessarily, but you have to rehydrate the data from its reduced state). Another obvious benefit is the ability to transfer the smaller data set over the network more efficiently, compared to its original state and size, making Disaster Recovery more possible and so on.
I have always known that deduplication works with data objects using a differential method. Whether the data object is a file or a chunk of the file, deduplication attempts to differentiate similarities (duplicates), and store one copy of that object and have others referencing to the single object. The differentiation methods commonly used are hashing and delta differential. In hashing, MD-5 and SHA-1 are the popular hashing algorithms used, while in delta differentials, the data objects are compared (usually in a scrutinizing manner) to find the differences. The duplicates or similarities are discarded.
There are many factors involved in deduplication. It could be the types of data, the processing power required to do the deduplication task, and throughput of processing and so on and resulting in the different deduplication ratio and time required to complete the process. I am not going to delve into that as there are many vendors who will be able to articulate this, such as EMC Data Domain, HP D2D/VLS with its StoreOnce technology, Exagrid, Sepaton, Dell Ocarina Networks, NetApp, EMC Centera, CommVault Simpana, Symantec PureDisk, Symantec NetBackup, EMC Avamar and many more.
Meanwhile, compression (especially most commercial compression technology) are based on dictionary coding, a lossless data reduction algorithm. Note that I am using the term encoding rather than compression because factually, encoding is the right word. You can’t squeeze the data into a smaller size like you do with a real life object.
The technique works like this.
- When being encoded, a bit/byte or a set of bytes are compared to a “dictionary” which is a pool of “words” in a data structure maintained by the encoding technology
- If a match is found, the bit/byte or set of bytes is substituted by an “word”, usually a much shorter (hence smaller size) representation form of the bytes being encoded.
- As the encoding process continues, more “dictionary words” are built into the “dictionary” based on the bytes already encoded. This is popularly known as the sliding window implementation.
- The end result is the data is highly encoded (heavily replaced) by “dictionary words” and of a much smaller size.
One of the heavily implemented compression technique is based on the theory and methodology introduced by Lempel-Ziv and further enhanced by the Lempel-Ziv-Welch trio. A very good explanation of LZ method can be found here.
Both deduplication and compression have the same objective – that is to reduce the data size for more efficient storage. But both approach it from a different angle but they are by no means, exclusive. Both can be used to complement each other and further reduce the capacity required to store the data.
Deduplication usually works with larger data objects (chunks, files etc) while compression works harder at the lower level (byte range level). Deduplication is heavily deployed in secondary data sets (or backup) because you can find plenty of duplicates while in primary data sets (the data in production), deduplication and compression are deployed, either in a singular fashion or one after another. Deduplication is usually run as Step 1 and then Compression is run in Step 2.
So far, the only one that has impressed me for the primary data reduction is Ocarina Networks, which uses a 3 step approach in dedupe, compress and using specialized compactors to reduce the data even more. I have seen the ability of Ocarina reducing Schlumberger Geoframe and Petrel seismic data to more than 50%. That was impressive!
Having my bothered state satisfied, I guess having the say of “Deduplication – a fancy form of Compression” is someone else’s cup of tea. I would rather say “Deduplication – a fancy form of Data Reduction Technology” but I am not complaining as much I did before.