Category Archives: Big Data
I am a bit surprised that primary storage deduplication has not taken off in a big way, unlike the times when the buzz of deduplication first came into being about 4 years ago.
When the first deduplication solutions first came out, it was particularly aimed at the backup data space. It is now more popularly known as secondary data deduplication, the technology has reduced the inefficiencies of backup and helped sparked the frenzy of adulation of companies like Data Domain, Exagrid, Sepaton and Quantum a few years ago. The software vendors were not left out either. Symantec, Commvault, and everyone else in town had data deduplication for backup and archiving.
It was no surprise that EMC battled NetApp and finally won the rights to acquire Data Domain for USD$2.4 billion in 2009. Today, in my opinion, the landscape of secondary data deduplication has pretty much settled and matured. Practically everyone has some sort of secondary data deduplication technology or solution in place.
But then the talk of primary data deduplication hardly cause a ripple when compared a few years ago, especially here in Malaysia. Yeah, the IT crowd is pretty fickle that way because most tend to follow the trend of the moment. Last year was Cloud Computing and now the big buzz word is Big Data.
We are here to look at technologies to solve problems, folks, and primary data deduplication technology solutions should be considered in any IT planning. And it is our job as storage networking professionals to continue to advise customers about what is relevant to their business and addressing their pain points.
I get a bit cheesed off that companies like EMC, or HDS continue to spend their marketing dollars on hyping the trends of the moment rather than using some of their funds to promote good technologies such as primary data deduplication that solve real life problems. The same goes for most IT magazines, publications and other communications mediums, rarely giving space to technologies that solves problems on the ground, and just harping on hypes, fuzz and buzz. It gets a bit too ordinary (and mundane) when they are trying too hard to be extraordinary because everyone is basically talking about the same freaking thing at the same time, over and over again. (Hmmm … I think I am speaking off topic now .. I better shut up!)
We are facing an avalanche of data. The other day, the CEO of Nexenta used the word “data tsunami” but whatever terms used do not matter. There is too much data. Secondary data deduplication solved one part of the problem and now it’s time to talk about the other part, which is data in primary storage, hence primary data deduplication.
What is out there? Who’s doing what in term of primary data deduplication?
NetApp has their A-SIS (now NetApp Dedupe) for years and they are good in my books. They talk to customers about the benefits of deduplication on their FAS filers. (Side note: I am seeing more benefits of using data compression in primary storage but I am not going to there in this entry). EMC has primary data deduplication in their Celerra years ago but they hardly talk much about it. It’s on their VNX as well but again, nobody in EMC ever speak about their primary deduplication feature.
I have always loved Ocarina Networks ECO technology and Dell don’t give much hoot about Ocarina since the acquisition in 2010. The technology surfaced a few months ago in Dell DX6000G Storage Compression Node for its Object Storage Platform, but then again, all Dell talks about is their Fluid Data Architecture from the Compellent division. Hey Dell, you guys are so one-dimensional! Ocarina is a wonderful gem in their jewel case, and yet all their storage guys talk about are Compellent and EqualLogic.
Moving on … I ought to knock Oracle on the head too. ZFS has great data deduplication technology that is meant for primary data and a couple of years back, Greenbytes took that and made a solution out of it. I don’t follow what Greenbytes is doing nowadays but I do hope that the big wave of primary data deduplication will rise for companies such as Greenbytes to take off in a big way. No thanks to Oracle for ignoring another gem in ZFS and wasting their resources on pre-sales (in Malaysia) and partners (in Malaysia) that hardly know much about the immense power of ZFS.
But an unexpected source coming from Microsoft could help trigger greater interest in primary data deduplication. I have just read that the next version of Windows Server OS will have primary data deduplication integrated into NTFS. The feature will be available in Windows 8 and the architectural view is shown below:
The primary data deduplication in NTFS will be a feature add-on for Windows Server users. It is implemented as a filter driver on a per volume basis, with each volume a complete, self describing unit. It is cluster aware, and fully crash consistent on all operations.
The technology is Microsoft’s own technology, built from scratch and will be working to position Hyper-V as an strong enterprise choice in its battle for the server virtualization space with VMware. Mind you, VMware already has a big, big lead and this is just something that Microsoft must do-or-die to keep Hyper-V playing catch-up. Otherwise, the gap between Microsoft and VMware in the server virtualization space will be even greater.
I don’t have the full details of this but I read that the NTFS primary deduplication chunk sizes will be between 32KB to 128KB and it will be post-processing.
With Microsoft introducing their technology soon, I hope primary data deduplication will get some deserving accolades because I think most companies are really not doing justice to the great technologies that they have in their jewel cases. And I hope Microsoft, with all its marketing savviness and adeptness, will do some justice to a technology that solves real life’s data problems.
I bid you good luck – Primary Data Deduplication! You deserved better.
My research on file systems brought me to an very interesting piece of article. It is titled “Dynamo: Amazon’s Highly Available Key-Value Store” dated 2007.
Yes, this is an internal storage systems designed and developed in Amazon to scale and support Amazon Web Services (AWS). It is a very complex piece of technology and the paper is highly technical (not for the faint of heart). And of all places, Amazon is probably the last place you think you would find such smart technology, but it’s true. AWS engineers are slowly revealing the many of their innovations (think Amazon Silk browser technology).
And it appears that many of the latest cloud-based computing and services companies such as Amazon, Google and many others have been developing new methods of storing data objects. These methods are very different from the traditional methods of storing data, and many are no longer adopting the relational database model (RDBMS) to scale their business.
The traditional 3-tier architecture often adopted by web-based (before the advent of “cloud”), is evolving. As shown in the diagram below:
the foundation tier is usually a relational database (or a distributed relational database), communicating with the back-end storage (usually a SAN).
All that is changing because the relational database model is not keeping up with the tremendous pace of the proliferation of web-based and cloud-based objects or unstructured data. As explained by Alex Iskold, a writer of ReadWriteWeb, there are scalability issues with the conventional relational database.
Before I get to the scalability issues mentioned in the above diagram, let me set the floor for discussion.
For theoretical schoolers of relational database, the term ACID defines and guarantees the transactional reliability of relational databases. ACID stands for Atomicity, Consistency, Isolation and Durability. According to Wikipedia, “transactions provide an “all-or-nothing” proposition, stating that each work-unit performed in a database must either complete in its entirety or have no effect whatsoever. Further, the system must isolate each transaction from other transactions, results must conform to existing constraints in the database, and transactions that complete successfully must get written to durable storage.”
ACID has been the cornerstone of relational database from the very beginning. But as the demands of greater scalability and greater distribution of data, all 4 components of ACID – Atomicity, Consistency, Isolation, Durability – can no longer hold true. Hence, the CAP Theorem.
CAP Theorem (aka Brewer’s Theorem) stands for Consistency, Availability and Partition Tolerance. In the ACM (Association of Computing Machinery) conference in 2000, Eric Brewer of University of California, Berkeley delivered the theorem. It states that it is impossible for a distributed computer system (or a database system) to simultaneously guarantee all 3 components – Consistency, Availability and Partition Tolerance.
Therefore, as the database systems become more and more distributed in cyberspace, the ACID theorem begins to break down. All 4 components of ACID cannot be guaranteed simultaneously anymore as the database systems begin to become more and more distributed.
So when we get back to the diagram, both the concepts on left and right – Master/Slave OR Multiple Peers – will put a tremendous strain on the single, non-distributed relational database.
New data models are surfacing to handling the very distributed data sets. Distributed object-based ”file systems” and NoSQL type of databases are some of the unconventional data storage “systems” that are beginning to surface as viable alternatives to the relational database method in cyberspace. And one of them is the Amazon Dynamo Storage System. (ADSS)
ADSS is a highly available, Amazon-proprietary key-value distributed data store. ADSS has both the properties of distributed hash table and a database and it is used internally to power various Cloud Services in Amazon Web Services (AWS).
It behaves like a relational database where it stores data objects to be retrieved. However, the data objects are not stored in a table format of a conventional relational database. Instead, the data is stored in a distributed hash table and data content or value is retrieved with a key, hence a key-value data model.
- Physical nodes are thought of as identical and organized into a ring.
- Virtual nodes are created by the system and mapped onto physical nodes, so that hardware can be swapped for maintenance and failure.
- The partitioning algorithm is one of the most complicated pieces of the system, it specifies which nodes will store a given object.
- The partitioning mechanism automatically scales as nodes enter and leave the system.
- Every object is asynchronously replicated to N nodes.
- The updates to the system occur asynchronously and may result in multiple copies of the object in the system with slightly different states.
- The discrepancies in the system are reconciled after a period of time, ensuring eventual consistency.
- Any node in the system can be issued a put or get request for any key
The Dynamo architecture addresses the CAP Theorem well. It is highly available, where nodes, either physical or virtual, can be easily swapped without affected the storage services. It is also high performance, nodes (again physical or virtual) can be added to boost the performance. The high performance and highly available components addresses the “A” piece of CAP.
Its distributed nature also allows it to scale to billions and billions of data objects and hence meets the “P” requirement of CAP. The Partitioning Tolerance is definitely there.
However, as stated by CAP Theorem, you can’t have all 3 happening at the same time. Therefore, the “C” or Consistency piece of CAP has to be compromised. That is why Dynamo has been labeled an “eventually consistency” storage system.
As data is stored into ADSS, the changes of the data is propogated and will be asynchronously replicated to other nodes in the system, eventually making all the data objects and its value consistent. However, given the speed of things in cyberspace and the nature of most Cloud Computing services, the consistency piece could be difficult to accomplish and that is OK because in most of the transactions that are distributed, inconsistency is acceptable.
So that’s a bit about the Amazon Dynamo. Alas, we may never get our grubby hands on this piece of cool data storage and management technology, but knowing that Dynamo is powering AWS and its business is an eye-opener for us into the realm of a new technology evolution.
By now, I believe most of you in the storage networking world would have heard of Hadoop. Hadoop was created by Doug Cutting, while he and his team was working on an open source web search engine called Nutch. The easily recognized little yellow elephant, Hadoop, was Doug Cutting’s son toy, which he made as Hadoop’s mascot. Pretty cool!
And today, Hadoop has become THE platform for Big Data applications. Why?
As I have mentioned before, everything that we do or don’t do, generates data, either as a direct product or in-direct product. I am blogging right now and I am creating data. I was in Singapore the whole of this week and everywhere I go in the MRT stations, I am being watched by the video cameras they have at the station. A new friend in class said that Singapore is the second most “watched” city after London, where there are video cameras mounted everywhere, either discreetly or indiscreetly. And that’s just video data. And there’s plenty of other human activities that generate tons and tons of data.
IDC Digital Universe Report for 2011 said that we have generated 1.8ZB (zettabyte) of data this year alone. I mentioned in my previous blog that this is a gold mine and companies are scrambling to tap on massive amount of data. Extracting valuable information to anticipate the next trend or predict that next evolution in human preference is akin to the Gold Rush in the wild, wild west in the late 19th century. Folks, Big Data is going to be this generation’s “Digital Gold Rush”.
Sieving, filtering and processing gazillions of data (more unstructured than structured) will not work in defined, well-formatted relational databases. The data model of relational databases will simply break down. And of course, there are different schools of thoughts of different data models, but the Hadoop model seems to be gaining momentum and mind share of data scientists. That is because of Hadoop’s capability to deal with massive unstructured data, processing it and producing results in a small amount of time.
One way to process the pool of massive data is parallel programming. In parallel programming, multi-threading is commonly deployed to achieve the performance and effects of programming. But implementing multi-threading in parallel programming is difficult. Developers often has to deal with LWP (lightweight processes), semaphores, shared memory, mutex (mutually exclusive) locking and so on. Hence this style of programming works with different states on shared data, often resulting in different results in different states, even when using the same programming expression.
Hadoop belongs to another school of programming known as functional programming, where the different states on shared data concept is removed. With that in mind, the dependency on different states is also removed, resulting in a much easier and simpler parallel programming implementation. Hadoop borrows ideas from the MapReduce software framework made well known by Google and the Google File System.
Before, we get to know Hadoop, we must know MapReduce. MapReduce is a framework which allows very large data sets to be processed with a very large set of computer nodes in a cluster. Typically the computational processing is executed in a distributed fashion, spread across many computer nodes and final results are consolidated from the sub-results of these distributed processing nodes.
According to Wikipedia, the 2 key functions of Map Reduce are map() and reduce(). That’s pretty obvious. The extract below was taken from the Wikipedia definition, and explains both functions very well.
“Map” step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.
“Reduce” step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.
The diagram below probably can simplify the concept of MapReduce to the readers.
Hadoop is one of the open-source implementations of MapReduce. It is one of the projects of Apache Foundation, and the project has sparked a brand-new niche of data search, data management and data science. The diagram below will allow our readers to juxtapose MapReduce and Hadoop, and comparing them in the simplest fashion.
Hadoop primary development platform is Java. Hadoop’s architecture consists mainly of 2 components – Hadoop Common and a Hadoop-compatible file system, as shown in the diagram below.
Hadoop MapReduce layer above is the file/object access interface to the Hadoop-compatible file system below. HDFS is Hadoop Distributed File System is just one of a few Hadoop-compatible file systems. Other file systems include:
- Amazon S3 File System as part of the Amazon EC2 Infrastructure-as-a-Service (IaaS) cloud platform
- CloudStore – a similar Hadoop-like implementation using C++ and also inspired by Google File System
- FTP file systems
- HTTP and HTTPS read-only file systems
- Any file systems accessible with the file:// URL nomenclature
But the main engine of Hadoop is in the MapReduce layer. The 2 core components in this layer is JobTracker and TaskTracker. Both has their own individual roles to play and collectively, they are key cogs in the Hadoop distributed data processing model.
Below are extract I picked up from Wikipedia.
JobTracker submits MapReduce jobs to client applications. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware filesystem, the JobTracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone network. If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on each node spawns off a separate Java Virtual Machine process to prevent the TaskTracker itself from failing if the running job crashes the JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty and can be viewed from a web browser. Jetty is a Java-based HTTP server, among other things
JobTracker records what it is up to in the filesystem. When a JobTracker starts up, it looks for any such data, so that it can restart work from where it left off.
By default Hadoop uses first-in, first-out (FIFO), and optional 5 scheduling priorities to schedule jobs from a work queue. In version 0.19 the job scheduler was refactored out of the JobTracker, and added the ability to use an alternate scheduler (such as the Fair scheduler or the Capacity scheduler).
The fair scheduler was developed by Facebook. The goal of the fair scheduler is to provide fast response times for small jobs and QoS (Quality of Service) for production jobs. The fair scheduler has three basic concepts.
- Jobs are grouped into Pools.
- Each pool is assigned a guaranteed minimum share.
- Excess capacity is split between jobs.
By default jobs that are uncategorized go into a default pool. Pools have to specify the minimum number of map slots, reduce slots, and a limit on the number of running jobs.
The capacity scheduler was developed by Yahoo. The capacity scheduler supports several features which are similar to the fair scheduler.
- Jobs are submitted into queues.
- Queues are allocated a fraction of the total resource capacity.
- Free resources are allocated to queues beyond their total capacity.
- Within a queue a job with a high level of priority will have access to the queue’s resources.
I took most the extract below from Wikipedia, and I don’t claim to be a knowledgeable person on Hadoop. All the credits go to Wikipedia editors to put Hadoop in layman terms.
Hadoop has certainly won the hearts of the new digital gold rush, Big Data and is slowly becoming a force to be reckoned with among data scientists. Hadoop implementations are powering new frontiers in processing and mining the ever growing data capacity, giving solution providers a simple programming methodology and data model to gain more insights into the vast seas of data and information.
Hadoop has many fans, and slowly becoming the data platform for large companies such as Yahoo!, Facebook, IBM, Amazon, Apple, eBay and many more. Facebook even claims to have the largest Hadoop clusters in the world, growing to 30PB in July of 2011.
This little yellow elephant is going places and one to watch out for.
Big data is Big Business these days. IDC predicts that between 2012 and 2020, the spending on big data solution will account for 80% of IT spending and growing at 18% per annum. EMC predicts that the big data is worth USD$70 billion! That’s a very huge market.
We generate data, and plenty of it. In the IDC Digital Universe Report for 2011 (sponsored by EMC), approximately 1.8 zettabytes of data will be created and replicated in 2011. How much is 1 zettabyte, you say? Look at the conversion below:
1 zettabyte = 1 billion terabytes
That’s right, folks. 1 billion terabytes!
And this “mountain” of data and information is a Goldmine of goldmines, and companies around the world are scrambling to tap on this treasure chest. According to Wikibon, big data has the following characteristics:
- Very large, distributed aggregations of loosely structured data – often incomplete and inaccessible
- Petabytes/exabytes of data
- Millions/billions of people
- Billions/trillions of records
- Loosely-structured and often distributed data
- Flat schemas with few complex interrelationships
- Often involving time-stamped events
- Often made up of incomplete data
- Often including connections between data elements that must be probabilistically inferred
But what is relevant is not the definition of big data, but rather what you get from the mountain of information generated. The ability to “mine” the information from big data, now popularly known as Big Data Analytics, has sparked a new field within the data storage and data management industry. This is called Data Science. And companies and enterprises that are able to effectively use the new data from Big Data will win big in the next decade. Activities such as
- Business decision making
- Gain competitive advantage
- Drive productivity growth in relevant industry segments
- Understanding consumer and business behavioural patterns
- Knowing buying decisions and business cycles
- Yielding new innovation
- Reveal customer insights
- much, much more
will drive a whole new paradigm that shall be known as Data Science.
And EMC, having purchased Greenplum more than a year ago, has started their Data Computing Products Division immediately after the Greenplum acquisition. And in October of 2010, EMC announced their Greenplum Data Computing Appliance with some impressive numbers. Using 2 configurations of their appliance, noted below:
Below are 2 tables of the Greenplum performance benchmarks:
That’s what these big data appliance is able. The ability to load billions of either structured or unstructured files or objects in mere minutes is what drives the massive adoption of Big Data.
And a few days, EMC announced their Greenplum Unified Analytics Platform (UAP) which comprises of 3 Greenplum components:
- A relational database for structured data
- An enterprise Hadoop engine for the analysis and processing of unstructured data
- Chorus 2.0, which is a social media collaboration tool for data scientists
The diagram below summarizes the UAP solution:
Greenplum is certainly ahead of the curve. Competitors like IBM Netezza, Teradata and Oracle Exalogic are racing to be ahead but Greenplum is one of the early adopters of a single platform for big data. Having a consolidation platform will not only reduce costs (integration of all big data components usually incurs high professional services’ fees) but will also reduce the barrier to entry to big data, thus further accelerating the adoption of big data.
Big Data is still very much at its infancy and EMC is pushing to establish its footprint in this space. EMC Education has already announce the general availability of courses related to big data last week and also the EMC Data Science Architect (EMC DSA) certification. Greenplum is enjoying the early sweetness of the Big Data game and there will be more to come. I am certainly looking forward to share more on this plum (pun intended ) of the data storage and data management excitement.
IBM claims that we are responsible of for creating 2.5 quintillion bytes of data every day. How much is 1 quintilion?
According to the web,
1 quintillion = 1,000,000,000,000,000,000
After billion, it is trillion, then quadrillion, and then quintillion. That’s what 1 quintillion is, with 18 zeroes!
These data comes from everything from social networking updates, meteorology (weather reports), remote sensing maps (Google Maps, GPS, Geographical Information Systems), photos (Flickr), videos (YouTube), Internet search (Google) and so on. The big data terminology, according to Wikipedia, is data that are too large to be handled and processed by conventional data management tools. This presents a new set of difficulties when it comes to collected these data, storing them and sharing them. Indexing and searching big data would require special technologies to be able to mine and extract valuable information from big data datasets, within an acceptable period of time.
According to Wiki, “Technologies being applied to big data include massively parallel processing (MPP) databases, datamining grids, distributed file systems, distributed databases, cloud computing platforms, the Internet, and scalable storage systems.” That is why EMC has paid big money to acquire GreenPlum and IBM acquired Netezza. Traditional data warehousing players such Teradata, Oracle and Ingres are in the picture as well, setting a collision course between the storage and infrastructure companies and the data warehousing solutions companies.
The 2010 Gartner Magic Quadrant has seen non-traditional players such as IBM/Netezza and EMC/Greenplum, in its leaders quadrant.
And the key word that is already on everyone’s lips is “ANALYTICS“.
The ability to extract valuable information that helps determines what the next future trend is and personalized profiling will be something that may already arrived as companies are clamouring to get more and more out of our personalities so that they can sell you more of their wares.
Meteorological organizations are using big data analytics to find out about weather patterns and climate change. Space exploration becomes more acute and precise from the tons and tons of data collected from space explorations. Big data analytics are also helping pharmaceutical companies develop new biological and pharmaceutical breakthroughs. And the list goes on.
I am a new stranger into big data and I do not proclaim to know a lot. But terms such as scale-out NAS, distributed file systems, grid computing, massively parallel processing are certainly bringing the data storage world into a new frontier, and it is something we as storage professionals have to adapt to. I am eager to learn and know more about big data. It is a big headache but change is inevitable.