The next all-Flash product in my review list is SolidFire. Immediately, the niche that SolidFire is trying to carve out is obvious. It’s not for regular commercial customers. It is meant for Cloud Service Providers, because the features and the technology that they have innovated are quite cloud-intended.
Are they solid (pun intended)? Well, if they have managed to secure a Series B funding of USD$25 million (total of USD$37 million overall) from VCs such as NEA and Valhalla, and also angel investors such as Frank Slootman (ex-Data Domain CEO) and Greg Papadopoulus(ex-Sun Microsystems CTO), then obviously there is something more than meets the eye.
The one thing I got while looking up SolidFire is there is probably a lot of technology and innovation behind their Nodes and their Element OS. They hold their cards very, very close to their chest, and I couldn’t not get much good technology related information from their website or in Google. But here’s a look of how the SolidFire is like:
The SolidFire only has one product model, and that is the 1U SF3010. The SF3010 has 10 x 2.5″ 300GB SSDs giving it a raw total of 3TB per 1U. The minimum configuration is 3 nodes, and it scales to 100 nodes. The reason for starting with 3 nodes is of course, for redundancy. Each SF3010 node has 8GB NVRAM and 72GB RAM and sports 2 x 10GbE ports for iSCSI connectivity, especially when the core engineering talents were from LeftHand Networks. LeftHand Networks product is now HP P4000. There is no Fibre Channel or NAS front end to the applications.
Each node runs 2 x Intel Xeon 2.4GHz 6-core CPUs. The 1U height is important to the cloud provider, as the price of floor space is an important consideration.
Aside from the SF3010 storage nodes, the other important ingredient is their SolidFire Element OS.
Cloud storage needs to be available. The SolidFire Helix Self-Healing data protection is a feature that is capable of handling multiple concurrent failures across all levels of their storage. Data blocks are replicated randomly but intelligently across all storage nodes to ensure that the failure or disruption of access to a particular data block is circumvented with another copy of the data block somewhere else within the cluster. The idea is not new, but effective because solutions such as EMC Centera and IBM XIV employ this idea in their data availability. But still, the ability for self-healing ensures a very highly available storage where data is always available.
To address the efficiency of storage, having 3TB raw in the SF3010 is definitely not sufficient. Therefore, the Element OS always have thin provision, real-time compression and in-line deduplication turned on. These features cannot be turned off and operate at a fine-grained 4K blocks. Also important is the intelligence to reclaim of zeroed blocks, no-reservation, and no data movement in these innovations. This means that there will be no I/O impact, as claimed by SolidFire.
But the one feature that differentiates SolidFire when targeting storage for Cloud Service Providers is their guaranteed volume level Quality of Service (QOS). This is important and SolidFire has positioned their QOS settings into an advantage. As best practice, Cloud Service Providers should always leverage the QOS functionality to improve their storage utilization
The QOS has:
- Minimum IOPS – Lower IOPS means lower performance priority (makes good sense)
- Maximum IOPS
- Burst IOPS – for those performance spikes moments
- Maximum and Burst MB/sec
Together with the Storage Nodes and the Element OS, the whole package is aimed towards a more significant storage platform for Cloud Service Providers(CSPs). Storage has always been a tricky component in Cloud Computing (despite what all the storage vendors might claim), but SolidFire touts that their solution focuses on what matters most for CSPs.
CSPs would want to maximize their investment without losing their edge in the cloud offerings to their customers. SolidFire lists their benefits in these 3 areas:
The edge in cloud storage is definitely solid for SolidFire. Their ability to leverage on their position and steering away from other all-Flash vendors’ battlezone could all make sense, as they aim to gain market share in the Cloud Service Provider space. I only wish they can share more about their technology online.
Fortunately, I found a video by SolidFire’s CEO, Dave Wright which gives a great insight about SolidFire’s technology. Have a look (it’s almost 2 hour long):
[2 hours later]: Phew, I just finished the video above and the technology is solid. Just to summarize,
- No RAID (which is a Godsend for service providers)
- Aiming for USD5.00 or less per Gigabyte (a good number!)
- General availability in Q1 2012
Lots of confidence about the superiority of their technology, as portrayed by their CEO, Dave Wright.
Solid? Yes, Solid!
Nowadays, the capacity of the hard disk drives (HDDs) are really big. 3TB is out and 4TB is in the horizon. What’s next?
For small-medium businesses in Malaysia, depending on their data requirements and applications, 3-10TB is pretty sufficient and with room to grow as well. Therefore, a 6TB requirement can be easily satisfied with 2 x 3TB HDDs.
If I were the customer, why would I buy a storage array, with the software licenses and other stuff that will not only increase my cost of equipment acquisition and data management, it will also increase the complexity of my IT infrastructure? I could just slot HDDs into my existing server, RAID it with RAID-0 (not a good idea but to save costs, most customers would do that) and I have a 6TB volume! It’s cheaper, easier to manage with Windows or Linux, and my system administrator doesn’t have to fuss about lack of storage experience.
And RAID isn’t really keeping up with the tremendous growth of HDD’s capacity as well. In fact, RAID is at risk. RAID (especially RAID 5/6) just cannot continue provide the LUN or volume reliability and data availability because it just takes too damn long to rebuild the volume after the failure of a disk.
Back in the days where HDDs were less than 500GB, RAID-5 would still hold up but after passing the 1TB mark, RAID-6 became more prevalent. But now, that 1TB has ballooned to 3TB and RAID-6 is on shaky ground. What’s next? RAID-7? ZFS has RAID-Z3, triple parity but come on, how many vendors have that? With triple parity or stronger RAID (is there one?), the price of the storage array is going to get too costly.
Experts have been speaking about parity-declustering, but that’s something that a few vendors have right now. Panasas, founded by one of forefathers of RAID, Garth Gibson, comes to mind. In fact, Garth Gibson and Mark Holland of Cargenie-Mellon University’s Parallel Data Lab (PDL) presented a paper about parity-declustering more than 10 years ago.
Let’s get back to our storage fatty. Yes, our storage is getting fat, obese, rotund or whatever you want to call it. And storage vendors have been pushing a concept in hope that storage administrators and customers can take advantage of it. It is called Storage Optimization or Storage Efficiency.
Here are a few ways you can consider to put your storage on a diet.
- Thin Provisioning
- Storage Tiering
- Tapes and SSDs
I promised last week I will look deeper into HP StoreOnce technology and I did. As I mentioned in my previous blog, HP StoreOnce technology now embedded in its D2D series of secondary, target backup devices that does the job with no fuss and no fancy bells and whistles.
Here’s the lineup of the present HP D2D solutions.
HP Malaysia has constantly reminded me that their D2D deduplication solution is much more price competitive than their competitors and this is something you, the readers, have to find out on your own. But I do believe that they are. Unfortunately they did not have the first mover’s advantage when Data Domain took the industry by storm in 2009, since HP StoreOnce was only launched with much fanfare last year in June 2010. Despite that, there still plenty of room in the IT market to grow, especially in HP’s huge set of customers.
Without the first movers advantage, HP StoreOnce has to differentiate itself from the existing competitors such as EMC Data Domain and Quantum. Labeling their deduplication technology as version 2.0 (whereas the competitors are still at “Version 1.0”?), HP StoreOnce banks on 3 key technologies. They are
- Sparse Indexing
- Intelligent Block Size Management
- Reduction in Disk Fragmentation
Out of these 3, sparse indexing is the most interesting but I will save the best from last. Let’s start with Intelligent Block Size Management.
HP StoreOnce uses a variable chunking method with a smaller granularity of 4K in size and this is managed intelligently, thus achieving a higher deduplication ratio compared to its competitors which either uses a fixed chunking method or with a variable chunking method of larger block sizes in the range of 8K to 32K. The HP Lab’s testing reveals that the space savings was significant when compared with others.
Below are a set of results for a PowerPoint presentation and you can see for yourself.
(NOTE: Please note that the savings/deduplication ratio can be very different and can range from good to bad for different types of data. Video and images files are highly encoded. Seismic and geo-mapping files are highly compressed. It is very likely that most deduplication solutions cannot achieve a high percentage with these types of files)
Point #2 talks about Reduction in Disk Fragmentation. The inherent benefits from Intelligent Block Size Management brings about the Reduction in Disk Fragmentation. The smaller chunks means lesser space wastage, especially when the block size is 4K or lower. HP StoreOnce also uses an intelligent algorithm to place the blocks that are perceived to be related close to one another. Hence this “locality” presence helps and the retrieval and restore process will be faster and more efficient.
Sparse Indexing is where HP StoreOnce touts to be a game changer. Today’s data is already as massive as a mountain, and it’s going to get bigger and growing faster. Using “Version 1.0” type of deduplication, the hashes created are stored in either memory or on disks. However, the massive data sets (especially unstructured data) are already producing massive amounts of hashes. Hashes are used to identify unique data blocks but the avalanche of unstructured data means that most deduplication solutions are generating more and more hashes, making most Version 1.0s hashes sluggish and difficult to retrieve.
Sparse Indexing addresses this hash problem (by the way, HP StoreOnce uses SHA-1 hash) by intelligently sampling a small chunks and creating a very fast index lookup mechanism that stays in the system’s memory all the time. As the engineers at HP Labs put it
Instead of holding every index item in RAM ready for comparison, the HP team keeps just one in every hundred or so items in RAM and puts the rest onto a hard drive. Duplicate data almost always arrives in bursts. In other words, if one chunk of the arriving stream is a duplicate, it is very likely that many following chunks are duplicates. Sparse indexing takes advantage of this phenomenon by storing the sequence of hashes of the stored chunks next to each other on disk. As a result, a ‘hit’ in the sample RAM index can direct the system to an area of the disk where many duplicates are likely to be found.
Sparse Indexing is not unique in the industry, but the engineers at HP Labs have put their thinking hats on and applied it to improve the search and looking up of the hashes in the StoreOnce deduplication technology.
Further savings are also achieved when the deduped data is compressed with the LZ (Lempel-Ziv) compression method before it is stored into the disks.
The HP StoreOnce technology is 100% fully concocted in the renown HP Labs and according to sources, this technology will indeed permeate across all HP StorageWorks (HP has since renamed it to HP Storage) line. With this strategy, HP hopes to address the “fragmented and complicated” (as quoted by HP) deduplication and data protection strategy across the enterprise. By “fragmented and complicated”, they mean that the deduplicated data constant has to be rehydrated and deduped again as the data moves across different IT devices and functions.
In a perfect world, HP wants their StoreOnce technology to be like the diagram below.
However, one very interesting fact that I found was HP does not believe that primary storage deduplication is a good idea. They claim that it complicates the whole thing. Whether HP likes it or not, NetApp has been dishing out primary storage deduplication for several years now and you don’t see their customers unhappy with NetApp about this feature.
In one of the HP Business whitepapers I read, one of the takeaways was
I was like, “Whoa! What’s this?”. I felt bemused about what was mentioned in the whitepaper. After all the best claims of the HP StoreOnce technology, I can’t help but to think that this could be a banana skin on the pavement for HP.
One of the things that peeved at the HP D2D Workshop a few days ago was this heading in the HP PowerPoint slides – “Deduplication – a fancy form of Compression”. Somehow it bothered me.
I have always placed both deduplication and compression into a bucket I called “Data Reduction“. Some vendors might call it Storage Economics, spinning it in a cooler manner. Either way, both attempt and succeed to reduce the capacity required to store the amount of data and this translates into benefits in storage management and network. With a smaller data set, lesser processing and capacity are required, likely speeding up the performance of the storage array. At the same time, the primary data backup set (you know, the data that you back up every night?) becomes smaller, making backup and restore faster (not necessarily, but you have to rehydrate the data from its reduced state). Another obvious benefit is the ability to transfer the smaller data set over the network more efficiently, compared to its original state and size, making Disaster Recovery more possible and so on.
I have always known that deduplication works with data objects using a differential method. Whether the data object is a file or a chunk of the file, deduplication attempts to differentiate similarities (duplicates), and store one copy of that object and have others referencing to the single object. The differentiation methods commonly used are hashing and delta differential. In hashing, MD-5 and SHA-1 are the popular hashing algorithms used, while in delta differentials, the data objects are compared (usually in a scrutinizing manner) to find the differences. The duplicates or similarities are discarded.
There are many factors involved in deduplication. It could be the types of data, the processing power required to do the deduplication task, and throughput of processing and so on and resulting in the different deduplication ratio and time required to complete the process. I am not going to delve into that as there are many vendors who will be able to articulate this, such as EMC Data Domain, HP D2D/VLS with its StoreOnce technology, Exagrid, Sepaton, Dell Ocarina Networks, NetApp, EMC Centera, CommVault Simpana, Symantec PureDisk, Symantec NetBackup, EMC Avamar and many more.
Meanwhile, compression (especially most commercial compression technology) are based on dictionary coding, a lossless data reduction algorithm. Note that I am using the term encoding rather than compression because factually, encoding is the right word. You can’t squeeze the data into a smaller size like you do with a real life object.
The technique works like this.
- When being encoded, a bit/byte or a set of bytes are compared to a “dictionary” which is a pool of “words” in a data structure maintained by the encoding technology
- If a match is found, the bit/byte or set of bytes is substituted by an “word”, usually a much shorter (hence smaller size) representation form of the bytes being encoded.
- As the encoding process continues, more “dictionary words” are built into the “dictionary” based on the bytes already encoded. This is popularly known as the sliding window implementation.
- The end result is the data is highly encoded (heavily replaced) by “dictionary words” and of a much smaller size.
One of the heavily implemented compression technique is based on the theory and methodology introduced by Lempel-Ziv and further enhanced by the Lempel-Ziv-Welch trio. A very good explanation of LZ method can be found here.
Both deduplication and compression have the same objective – that is to reduce the data size for more efficient storage. But both approach it from a different angle but they are by no means, exclusive. Both can be used to complement each other and further reduce the capacity required to store the data.
Deduplication usually works with larger data objects (chunks, files etc) while compression works harder at the lower level (byte range level). Deduplication is heavily deployed in secondary data sets (or backup) because you can find plenty of duplicates while in primary data sets (the data in production), deduplication and compression are deployed, either in a singular fashion or one after another. Deduplication is usually run as Step 1 and then Compression is run in Step 2.
So far, the only one that has impressed me for the primary data reduction is Ocarina Networks, which uses a 3 step approach in dedupe, compress and using specialized compactors to reduce the data even more. I have seen the ability of Ocarina reducing Schlumberger Geoframe and Petrel seismic data to more than 50%. That was impressive!
Having my bothered state satisfied, I guess having the say of “Deduplication – a fancy form of Compression” is someone else’s cup of tea. I would rather say “Deduplication – a fancy form of Data Reduction Technology” but I am not complaining as much I did before.
I had the privilege to attend HP’s D2D workshop yesterday, thanks to the invitation of my old friend, Mr. CC Chung. He is Malaysia’s HP StorageWorks Division Country Manager
I am allowed to assess their D2D solution without fear or favour (I think) and the plush sling bag door gift has nothing to do with my assessment (what do you think? Ha, ha) So here goes.
I based my assessment from these criteria (something I picked up when I was mucking around with Data Domain for 3 months at MTech Security some years ago). The criteria are
- Hash-based chunking granularity vs Single Instance Store (ala-EMC Centera)
- Inline or post-processing
- Source-based or target-based deduplication
- Forward or reverse referencing (though it has little significance – for now)
- Global or Local Deduplication
First of all, most people would ask about how well it dedupes and the technical guy’s answer would be “It depends …“. The sales would probably say “YMMV” (can anyone tell me what this acronym is for?). I believe the advertised rate is 20:1, pretty realistic because as we know in the deduplication world, the longer the data is retained, the higher the ratio can get. It also depends on the type of data to be deduped.
And of course, one of the participants (there are always skeptics) was bickering about how his customer was complaining that the deduplication ratio for a SQL database was lower than what was advertised. My take on this matter – Both the customer and the reseller are at fault! The customer happily took what the sales/pre-sales guy said in verbatim and expected fantastic results. The reseller was ill-equipped to know the D2D solution well and therefore, screwed the customer with realistic numbers for the wrong data type.
To me, as Justin (the HP Solution Architect) was presenting the HP D2D solution, I was ticking my check boxes for these criteria. And in my opinion, the HP D2D solution does the job. HP was telling the attendees that they will be surprised to know the end pricing for the D2D solution. I never got to know the figures and I never asked. But when compared to the king of the deduplication devices, Data Domain, it is likely to be lower.
So, here are the ticks to the HP D2D solution
- In-line deduplication
- Target-based (of course)
- Hash-based chunking with variable length for deduplication granularity
- Local Deduplication
They have several models ranging from the entry-level 2500 series to the 4100 and the 4300 series. After that, HP has another disparate deduplication solution meant for the higher end market called the VLS, and it was not presented in the workshop.
The D2D can be both a VTL and a NAS target dedupe device and the browser-based management GUI was simple and uncluttered. But what interested me was the HP StoreOnce technology, but I did not dig deeper into it. I found a nice video (below) to show a whiteboarding session for HP StoreOnce.
I promised to look deeper into it in a few days time. This week has been such a muck for me but overall, it has been turning up well at the end of the day.
Another thing that was interesting was its sparse indexing for the hashes and there were some dedupe vendors already doing the same thing. But, if you know me, I will research this for knowledge and benefit of all.
After the workshop, HP was so kind to give me an update about their Converged vision, how LeftHand, IBRIX, and 3PAR fit into their strategy and more importantly, their story to the storage market. I will speak more about this in the future. Of course, I will not reveal what’s in store for the future of the D2D solution, but all I can say is, I left the workshop feeling that the solution will do what it is supposed to, nothing more, nothing less. And I meant it in a good way.
I still reserve my opinions about HP because a lot of their storage business are still attached to the server side but hopefully with the upcoming P4000 and P6000 workshops coming up, my opinions may change a little.