Category Archives: Deduplication
I am a bit surprised that primary storage deduplication has not taken off in a big way, unlike the times when the buzz of deduplication first came into being about 4 years ago.
When the first deduplication solutions first came out, it was particularly aimed at the backup data space. It is now more popularly known as secondary data deduplication, the technology has reduced the inefficiencies of backup and helped sparked the frenzy of adulation of companies like Data Domain, Exagrid, Sepaton and Quantum a few years ago. The software vendors were not left out either. Symantec, Commvault, and everyone else in town had data deduplication for backup and archiving.
It was no surprise that EMC battled NetApp and finally won the rights to acquire Data Domain for USD$2.4 billion in 2009. Today, in my opinion, the landscape of secondary data deduplication has pretty much settled and matured. Practically everyone has some sort of secondary data deduplication technology or solution in place.
But then the talk of primary data deduplication hardly cause a ripple when compared a few years ago, especially here in Malaysia. Yeah, the IT crowd is pretty fickle that way because most tend to follow the trend of the moment. Last year was Cloud Computing and now the big buzz word is Big Data.
We are here to look at technologies to solve problems, folks, and primary data deduplication technology solutions should be considered in any IT planning. And it is our job as storage networking professionals to continue to advise customers about what is relevant to their business and addressing their pain points.
I get a bit cheesed off that companies like EMC, or HDS continue to spend their marketing dollars on hyping the trends of the moment rather than using some of their funds to promote good technologies such as primary data deduplication that solve real life problems. The same goes for most IT magazines, publications and other communications mediums, rarely giving space to technologies that solves problems on the ground, and just harping on hypes, fuzz and buzz. It gets a bit too ordinary (and mundane) when they are trying too hard to be extraordinary because everyone is basically talking about the same freaking thing at the same time, over and over again. (Hmmm … I think I am speaking off topic now .. I better shut up!)
We are facing an avalanche of data. The other day, the CEO of Nexenta used the word “data tsunami” but whatever terms used do not matter. There is too much data. Secondary data deduplication solved one part of the problem and now it’s time to talk about the other part, which is data in primary storage, hence primary data deduplication.
What is out there? Who’s doing what in term of primary data deduplication?
NetApp has their A-SIS (now NetApp Dedupe) for years and they are good in my books. They talk to customers about the benefits of deduplication on their FAS filers. (Side note: I am seeing more benefits of using data compression in primary storage but I am not going to there in this entry). EMC has primary data deduplication in their Celerra years ago but they hardly talk much about it. It’s on their VNX as well but again, nobody in EMC ever speak about their primary deduplication feature.
I have always loved Ocarina Networks ECO technology and Dell don’t give much hoot about Ocarina since the acquisition in 2010. The technology surfaced a few months ago in Dell DX6000G Storage Compression Node for its Object Storage Platform, but then again, all Dell talks about is their Fluid Data Architecture from the Compellent division. Hey Dell, you guys are so one-dimensional! Ocarina is a wonderful gem in their jewel case, and yet all their storage guys talk about are Compellent and EqualLogic.
Moving on … I ought to knock Oracle on the head too. ZFS has great data deduplication technology that is meant for primary data and a couple of years back, Greenbytes took that and made a solution out of it. I don’t follow what Greenbytes is doing nowadays but I do hope that the big wave of primary data deduplication will rise for companies such as Greenbytes to take off in a big way. No thanks to Oracle for ignoring another gem in ZFS and wasting their resources on pre-sales (in Malaysia) and partners (in Malaysia) that hardly know much about the immense power of ZFS.
But an unexpected source coming from Microsoft could help trigger greater interest in primary data deduplication. I have just read that the next version of Windows Server OS will have primary data deduplication integrated into NTFS. The feature will be available in Windows 8 and the architectural view is shown below:
The primary data deduplication in NTFS will be a feature add-on for Windows Server users. It is implemented as a filter driver on a per volume basis, with each volume a complete, self describing unit. It is cluster aware, and fully crash consistent on all operations.
The technology is Microsoft’s own technology, built from scratch and will be working to position Hyper-V as an strong enterprise choice in its battle for the server virtualization space with VMware. Mind you, VMware already has a big, big lead and this is just something that Microsoft must do-or-die to keep Hyper-V playing catch-up. Otherwise, the gap between Microsoft and VMware in the server virtualization space will be even greater.
I don’t have the full details of this but I read that the NTFS primary deduplication chunk sizes will be between 32KB to 128KB and it will be post-processing.
With Microsoft introducing their technology soon, I hope primary data deduplication will get some deserving accolades because I think most companies are really not doing justice to the great technologies that they have in their jewel cases. And I hope Microsoft, with all its marketing savviness and adeptness, will do some justice to a technology that solves real life’s data problems.
I bid you good luck – Primary Data Deduplication! You deserved better.
A very interesting report surfaced in front of me today. It is Information Week’s IT Pro ranking of Data Deduplication vendors, just made available a few weeks ago, and it is the overview of the dedupe market so far.
It surveyed over 400 IT professionals from various industries with companies ranging from less than 50 employees to over 10,000 employees and revenues of less than USD5 million to USD1 billion. Overall, it had a good mix of respondents. But the results were quite interesting.
It surveyed 2 segments
- Overall performance – product reliability, product performance, acquisition costs, operations costs etc.
- Technical features – replication, VTL, encryption, iSCSI and FCoE support etc.
I promised last week I will look deeper into HP StoreOnce technology and I did. As I mentioned in my previous blog, HP StoreOnce technology now embedded in its D2D series of secondary, target backup devices that does the job with no fuss and no fancy bells and whistles.
Here’s the lineup of the present HP D2D solutions.
HP Malaysia has constantly reminded me that their D2D deduplication solution is much more price competitive than their competitors and this is something you, the readers, have to find out on your own. But I do believe that they are. Unfortunately they did not have the first mover’s advantage when Data Domain took the industry by storm in 2009, since HP StoreOnce was only launched with much fanfare last year in June 2010. Despite that, there still plenty of room in the IT market to grow, especially in HP’s huge set of customers.
Without the first movers advantage, HP StoreOnce has to differentiate itself from the existing competitors such as EMC Data Domain and Quantum. Labeling their deduplication technology as version 2.0 (whereas the competitors are still at “Version 1.0”?), HP StoreOnce banks on 3 key technologies. They are
- Sparse Indexing
- Intelligent Block Size Management
- Reduction in Disk Fragmentation
Out of these 3, sparse indexing is the most interesting but I will save the best from last. Let’s start with Intelligent Block Size Management.
HP StoreOnce uses a variable chunking method with a smaller granularity of 4K in size and this is managed intelligently, thus achieving a higher deduplication ratio compared to its competitors which either uses a fixed chunking method or with a variable chunking method of larger block sizes in the range of 8K to 32K. The HP Lab’s testing reveals that the space savings was significant when compared with others.
Below are a set of results for a PowerPoint presentation and you can see for yourself.
(NOTE: Please note that the savings/deduplication ratio can be very different and can range from good to bad for different types of data. Video and images files are highly encoded. Seismic and geo-mapping files are highly compressed. It is very likely that most deduplication solutions cannot achieve a high percentage with these types of files)
Point #2 talks about Reduction in Disk Fragmentation. The inherent benefits from Intelligent Block Size Management brings about the Reduction in Disk Fragmentation. The smaller chunks means lesser space wastage, especially when the block size is 4K or lower. HP StoreOnce also uses an intelligent algorithm to place the blocks that are perceived to be related close to one another. Hence this “locality” presence helps and the retrieval and restore process will be faster and more efficient.
Sparse Indexing is where HP StoreOnce touts to be a game changer. Today’s data is already as massive as a mountain, and it’s going to get bigger and growing faster. Using “Version 1.0” type of deduplication, the hashes created are stored in either memory or on disks. However, the massive data sets (especially unstructured data) are already producing massive amounts of hashes. Hashes are used to identify unique data blocks but the avalanche of unstructured data means that most deduplication solutions are generating more and more hashes, making most Version 1.0s hashes sluggish and difficult to retrieve.
Sparse Indexing addresses this hash problem (by the way, HP StoreOnce uses SHA-1 hash) by intelligently sampling a small chunks and creating a very fast index lookup mechanism that stays in the system’s memory all the time. As the engineers at HP Labs put it
Instead of holding every index item in RAM ready for comparison, the HP team keeps just one in every hundred or so items in RAM and puts the rest onto a hard drive. Duplicate data almost always arrives in bursts. In other words, if one chunk of the arriving stream is a duplicate, it is very likely that many following chunks are duplicates. Sparse indexing takes advantage of this phenomenon by storing the sequence of hashes of the stored chunks next to each other on disk. As a result, a ‘hit’ in the sample RAM index can direct the system to an area of the disk where many duplicates are likely to be found.
Sparse Indexing is not unique in the industry, but the engineers at HP Labs have put their thinking hats on and applied it to improve the search and looking up of the hashes in the StoreOnce deduplication technology.
Further savings are also achieved when the deduped data is compressed with the LZ (Lempel-Ziv) compression method before it is stored into the disks.
The HP StoreOnce technology is 100% fully concocted in the renown HP Labs and according to sources, this technology will indeed permeate across all HP StorageWorks (HP has since renamed it to HP Storage) line. With this strategy, HP hopes to address the “fragmented and complicated” (as quoted by HP) deduplication and data protection strategy across the enterprise. By “fragmented and complicated”, they mean that the deduplicated data constant has to be rehydrated and deduped again as the data moves across different IT devices and functions.
In a perfect world, HP wants their StoreOnce technology to be like the diagram below.
However, one very interesting fact that I found was HP does not believe that primary storage deduplication is a good idea. They claim that it complicates the whole thing. Whether HP likes it or not, NetApp has been dishing out primary storage deduplication for several years now and you don’t see their customers unhappy with NetApp about this feature.
In one of the HP Business whitepapers I read, one of the takeaways was
I was like, “Whoa! What’s this?”. I felt bemused about what was mentioned in the whitepaper. After all the best claims of the HP StoreOnce technology, I can’t help but to think that this could be a banana skin on the pavement for HP.
One of the things that peeved at the HP D2D Workshop a few days ago was this heading in the HP PowerPoint slides – “Deduplication – a fancy form of Compression”. Somehow it bothered me.
I have always placed both deduplication and compression into a bucket I called “Data Reduction“. Some vendors might call it Storage Economics, spinning it in a cooler manner. Either way, both attempt and succeed to reduce the capacity required to store the amount of data and this translates into benefits in storage management and network. With a smaller data set, lesser processing and capacity are required, likely speeding up the performance of the storage array. At the same time, the primary data backup set (you know, the data that you back up every night?) becomes smaller, making backup and restore faster (not necessarily, but you have to rehydrate the data from its reduced state). Another obvious benefit is the ability to transfer the smaller data set over the network more efficiently, compared to its original state and size, making Disaster Recovery more possible and so on.
I have always known that deduplication works with data objects using a differential method. Whether the data object is a file or a chunk of the file, deduplication attempts to differentiate similarities (duplicates), and store one copy of that object and have others referencing to the single object. The differentiation methods commonly used are hashing and delta differential. In hashing, MD-5 and SHA-1 are the popular hashing algorithms used, while in delta differentials, the data objects are compared (usually in a scrutinizing manner) to find the differences. The duplicates or similarities are discarded.
There are many factors involved in deduplication. It could be the types of data, the processing power required to do the deduplication task, and throughput of processing and so on and resulting in the different deduplication ratio and time required to complete the process. I am not going to delve into that as there are many vendors who will be able to articulate this, such as EMC Data Domain, HP D2D/VLS with its StoreOnce technology, Exagrid, Sepaton, Dell Ocarina Networks, NetApp, EMC Centera, CommVault Simpana, Symantec PureDisk, Symantec NetBackup, EMC Avamar and many more.
Meanwhile, compression (especially most commercial compression technology) are based on dictionary coding, a lossless data reduction algorithm. Note that I am using the term encoding rather than compression because factually, encoding is the right word. You can’t squeeze the data into a smaller size like you do with a real life object.
The technique works like this.
- When being encoded, a bit/byte or a set of bytes are compared to a “dictionary” which is a pool of “words” in a data structure maintained by the encoding technology
- If a match is found, the bit/byte or set of bytes is substituted by an “word”, usually a much shorter (hence smaller size) representation form of the bytes being encoded.
- As the encoding process continues, more “dictionary words” are built into the “dictionary” based on the bytes already encoded. This is popularly known as the sliding window implementation.
- The end result is the data is highly encoded (heavily replaced) by “dictionary words” and of a much smaller size.
One of the heavily implemented compression technique is based on the theory and methodology introduced by Lempel-Ziv and further enhanced by the Lempel-Ziv-Welch trio. A very good explanation of LZ method can be found here.
Both deduplication and compression have the same objective – that is to reduce the data size for more efficient storage. But both approach it from a different angle but they are by no means, exclusive. Both can be used to complement each other and further reduce the capacity required to store the data.
Deduplication usually works with larger data objects (chunks, files etc) while compression works harder at the lower level (byte range level). Deduplication is heavily deployed in secondary data sets (or backup) because you can find plenty of duplicates while in primary data sets (the data in production), deduplication and compression are deployed, either in a singular fashion or one after another. Deduplication is usually run as Step 1 and then Compression is run in Step 2.
So far, the only one that has impressed me for the primary data reduction is Ocarina Networks, which uses a 3 step approach in dedupe, compress and using specialized compactors to reduce the data even more. I have seen the ability of Ocarina reducing Schlumberger Geoframe and Petrel seismic data to more than 50%. That was impressive!
Having my bothered state satisfied, I guess having the say of “Deduplication – a fancy form of Compression” is someone else’s cup of tea. I would rather say “Deduplication – a fancy form of Data Reduction Technology” but I am not complaining as much I did before.
I had the privilege to attend HP’s D2D workshop yesterday, thanks to the invitation of my old friend, Mr. CC Chung. He is Malaysia’s HP StorageWorks Division Country Manager
I am allowed to assess their D2D solution without fear or favour (I think) and the plush sling bag door gift has nothing to do with my assessment (what do you think? Ha, ha) So here goes.
I based my assessment from these criteria (something I picked up when I was mucking around with Data Domain for 3 months at MTech Security some years ago). The criteria are
- Hash-based chunking granularity vs Single Instance Store (ala-EMC Centera)
- Inline or post-processing
- Source-based or target-based deduplication
- Forward or reverse referencing (though it has little significance – for now)
- Global or Local Deduplication
First of all, most people would ask about how well it dedupes and the technical guy’s answer would be “It depends …“. The sales would probably say “YMMV” (can anyone tell me what this acronym is for?). I believe the advertised rate is 20:1, pretty realistic because as we know in the deduplication world, the longer the data is retained, the higher the ratio can get. It also depends on the type of data to be deduped.
And of course, one of the participants (there are always skeptics) was bickering about how his customer was complaining that the deduplication ratio for a SQL database was lower than what was advertised. My take on this matter – Both the customer and the reseller are at fault! The customer happily took what the sales/pre-sales guy said in verbatim and expected fantastic results. The reseller was ill-equipped to know the D2D solution well and therefore, screwed the customer with realistic numbers for the wrong data type.
To me, as Justin (the HP Solution Architect) was presenting the HP D2D solution, I was ticking my check boxes for these criteria. And in my opinion, the HP D2D solution does the job. HP was telling the attendees that they will be surprised to know the end pricing for the D2D solution. I never got to know the figures and I never asked. But when compared to the king of the deduplication devices, Data Domain, it is likely to be lower.
So, here are the ticks to the HP D2D solution
- In-line deduplication
- Target-based (of course)
- Hash-based chunking with variable length for deduplication granularity
- Local Deduplication
They have several models ranging from the entry-level 2500 series to the 4100 and the 4300 series. After that, HP has another disparate deduplication solution meant for the higher end market called the VLS, and it was not presented in the workshop.
The D2D can be both a VTL and a NAS target dedupe device and the browser-based management GUI was simple and uncluttered. But what interested me was the HP StoreOnce technology, but I did not dig deeper into it. I found a nice video (below) to show a whiteboarding session for HP StoreOnce.
I promised to look deeper into it in a few days time. This week has been such a muck for me but overall, it has been turning up well at the end of the day.
Another thing that was interesting was its sparse indexing for the hashes and there were some dedupe vendors already doing the same thing. But, if you know me, I will research this for knowledge and benefit of all.
After the workshop, HP was so kind to give me an update about their Converged vision, how LeftHand, IBRIX, and 3PAR fit into their strategy and more importantly, their story to the storage market. I will speak more about this in the future. Of course, I will not reveal what’s in store for the future of the D2D solution, but all I can say is, I left the workshop feeling that the solution will do what it is supposed to, nothing more, nothing less. And I meant it in a good way.
I still reserve my opinions about HP because a lot of their storage business are still attached to the server side but hopefully with the upcoming P4000 and P6000 workshops coming up, my opinions may change a little.