Category Archives: Big Data

Primary Dedupe where are you?

Posted by cfheoh

I am a bit surprised that primary storage deduplication has not taken off in a big way, unlike the times when the buzz of deduplication first came into being about 4 years ago.

When the first deduplication solutions first came out, it was particularly aimed at the backup data space. It is now more popularly known as secondary data deduplication, the technology has reduced the inefficiencies of backup and helped sparked the frenzy of adulation of companies like Data Domain, Exagrid, Sepaton and Quantum a few years ago. The software vendors were not left out either. Symantec, Commvault, and everyone else in town had data deduplication for backup and archiving.

It was no surprise that EMC battled NetApp and finally won the rights to acquire Data Domain for USD$2.4 billion in 2009. Today, in my opinion, the landscape of secondary data deduplication has pretty much settled and matured. Practically everyone has some sort of secondary data deduplication technology or solution in place.

But then the talk of primary data deduplication hardly cause a ripple when compared a few years ago, especially here in Malaysia. Yeah, the IT crowd is pretty fickle that way because most tend to follow the trend of the moment. Last year was Cloud Computing and now the big buzz word is Big Data.

We are here to look at technologies to solve problems, folks, and primary data deduplication technology solutions should be considered in any IT planning. And it is our job as storage networking professionals to continue to advise customers about what is relevant to their business and addressing their pain points.

I get a bit cheesed off that companies like EMC, or HDS continue to spend their marketing dollars on hyping the trends of the moment rather than using some of their funds to promote good technologies such as primary data deduplication that solve real life problems. The same goes for most IT magazines, publications and other communications mediums, rarely giving space to technologies that solves problems on the ground, and just harping on hypes, fuzz and buzz. It gets a bit too ordinary (and mundane) when they are trying too hard to be extraordinary because everyone is basically talking about the same freaking thing at the same time, over and over again. (Hmmm … I think I am speaking off topic now .. I better shut up!)

We are facing an avalanche of data. The other day, the CEO of Nexenta used the word “data tsunami” but whatever terms used do not matter. There is too much data. Secondary data deduplication solved one part of the problem and now it’s time to talk about the other part, which is data in primary storage, hence primary data deduplication.

What is out there? Who’s doing what in term of primary data deduplication?

NetApp has their A-SIS (now NetApp Dedupe) for years and they are good in my books. They talk to customers about the benefits of deduplication on their FAS filers. (Side note: I am seeing more benefits of using data compression in primary storage but I am not going to there in this entry). EMC has primary data deduplication in their Celerra years ago but they hardly talk much about it. It’s on their VNX as well but again, nobody in EMC ever speak about their primary deduplication feature.

I have always loved Ocarina Networks ECO technology and Dell don’t give much hoot about Ocarina since the acquisition in 2010. The technology surfaced a few months ago in Dell DX6000G Storage Compression Node for its Object Storage Platform, but then again, all Dell talks about is their Fluid Data Architecture from the Compellent division. Hey Dell, you guys are so one-dimensional! Ocarina is a wonderful gem in their jewel case, and yet all their storage guys talk about are Compellent and EqualLogic.

Moving on … I ought to knock Oracle on the head too. ZFS has great data deduplication technology that is meant for primary data and a couple of years back, Greenbytes took that and made a solution out of it. I don’t follow what Greenbytes is doing nowadays but I do hope that the big wave of primary data deduplication will rise for companies such as Greenbytes to take off in a big way. No thanks to Oracle for ignoring another gem in ZFS and wasting their resources on pre-sales (in Malaysia) and partners (in Malaysia) that hardly know much about the immense power of ZFS.

But an unexpected source coming from Microsoft could help trigger greater interest in primary data deduplication. I have just read that the next version of Windows Server OS will have primary data deduplication integrated into NTFS. The feature will be available in Windows 8 and the architectural view is shown below:

The primary data deduplication in NTFS will be a feature add-on for Windows Server users. It is implemented as a filter driver on a per volume basis, with each volume a complete, self describing unit. It is cluster aware, and fully crash consistent on all operations.

The technology is Microsoft’s own technology, built from scratch and will be working to position Hyper-V as an strong enterprise choice in its battle for the server virtualization space with VMware. Mind you, VMware already has a big, big lead and this is just something that Microsoft must do-or-die to keep Hyper-V playing catch-up. Otherwise, the gap between Microsoft and VMware in the server virtualization space will be even greater.

I don’t have the full details of this but I read that the NTFS primary deduplication chunk sizes will be between 32KB to 128KB and it will be post-processing.

With Microsoft introducing their technology soon, I hope primary data deduplication will get some deserving accolades because I think most companies are really not doing justice to the great technologies that they have in their jewel cases. And I hope Microsoft, with all its marketing savviness and adeptness, will do some justice to a technology that solves real life’s data problems.

I bid you good luck – Primary Data Deduplication! You deserved better.

Posted in Big Data, Cloud, Commvault, Data, Deduplication, Dell, EMC, Falconstor, HP, IBM, NetApp, Oracle

4 Comments

Tags: Commvault, data reduction, Dell, EMC, Greenbytes, Microsoft, NetApp, Nexenta, Ocarina, Oracle, primary storage deduplication, Symantec, ZFS

Captain Dynamo Storage System

Dec 23

Posted by cfheoh

My research on file systems brought me to an very interesting piece of article. It is titled “Dynamo: Amazon’s Highly Available Key-Value Store” dated 2007.

Yes, this is an internal storage systems designed and developed in Amazon to scale and support Amazon Web Services (AWS). It is a very complex piece of technology and the paper is highly technical (not for the faint of heart). And of all places, Amazon is probably the last place you think you would find such smart technology, but it’s true. AWS engineers are slowly revealing the many of their innovations (think Amazon Silk browser technology).

And it appears that many of the latest cloud-based computing and services companies such as Amazon, Google and many others have been developing new methods of storing data objects. These methods are very different from the traditional methods of storing data, and many are no longer adopting the relational database model (RDBMS) to scale their business.

The traditional 3-tier architecture often adopted by web-based (before the advent of “cloud”), is evolving. As shown in the diagram below:

the foundation tier is usually a relational database (or a distributed relational database), communicating with the back-end storage (usually a SAN).

All that is changing because the relational database model is not keeping up with the tremendous pace of the proliferation of web-based and cloud-based objects or unstructured data. As explained by Alex Iskold, a writer of ReadWriteWeb, there are scalability issues with the conventional relational database.

Before I get to the scalability issues mentioned in the above diagram, let me set the floor for discussion.

For theoretical schoolers of relational database, the term ACID defines and guarantees the transactional reliability of relational databases. ACID stands for Atomicity, Consistency, Isolation and Durability. According to Wikipedia, “transactions provide an “all-or-nothing” proposition, stating that each work-unit performed in a database must either complete in its entirety or have no effect whatsoever. Further, the system must isolate each transaction from other transactions, results must conform to existing constraints in the database, and transactions that complete successfully must get written to durable storage.”

ACID has been the cornerstone of relational database from the very beginning. But as the demands of greater scalability and greater distribution of data, all 4 components of ACID – Atomicity, Consistency, Isolation, Durability – can no longer hold true. Hence, the CAP Theorem.

CAP Theorem (aka Brewer’s Theorem) stands for Consistency, Availability and Partition Tolerance. In the ACM (Association of Computing Machinery) conference in 2000, Eric Brewer of University of California, Berkeley delivered the theorem. It states that it is impossible for a distributed computer system (or a database system) to simultaneously guarantee all 3 components – Consistency, Availability and Partition Tolerance.

Therefore, as the database systems become more and more distributed in cyberspace, the ACID theorem begins to break down. All 4 components of ACID cannot be guaranteed simultaneously anymore as the database systems begin to become more and more distributed.

So when we get back to the diagram, both the concepts on left and right – Master/Slave OR Multiple Peers – will put a tremendous strain on the single, non-distributed relational database.

New data models are surfacing to handling the very distributed data sets. Distributed object-based “file systems” and NoSQL type of databases are some of the unconventional data storage “systems” that are beginning to surface as viable alternatives to the relational database method in cyberspace. And one of them is the Amazon Dynamo Storage System. (ADSS)

ADSS is a highly available, Amazon-proprietary key-value distributed data store. ADSS has both the properties of distributed hash table and a database and it is used internally to power various Cloud Services in Amazon Web Services (AWS).

It behaves like a relational database where it stores data objects to be retrieved. However, the data objects are not stored in a table format of a conventional relational database. Instead, the data is stored in a distributed hash table and data content or value is retrieved with a key, hence a key-value data model.

The data content is stored and retrieved through a simple put and get interface, much like how RESTful would do it. From the article in ReadWriteWeb, here’s how Dynamo works:

Physical nodes are thought of as identical and organized into a ring.
Virtual nodes are created by the system and mapped onto physical nodes, so that hardware can be swapped for maintenance and failure.
The partitioning algorithm is one of the most complicated pieces of the system, it specifies which nodes will store a given object.
The partitioning mechanism automatically scales as nodes enter and leave the system.
Every object is asynchronously replicated to N nodes.
The updates to the system occur asynchronously and may result in multiple copies of the object in the system with slightly different states.
The discrepancies in the system are reconciled after a period of time, ensuring eventual consistency.
Any node in the system can be issued a put or get request for any key

The Dynamo architecture addresses the CAP Theorem well. It is highly available, where nodes, either physical or virtual, can be easily swapped without affected the storage services. It is also high performance, nodes (again physical or virtual) can be added to boost the performance. The high performance and highly available components addresses the “A” piece of CAP.

Its distributed nature also allows it to scale to billions and billions of data objects and hence meets the “P” requirement of CAP. The Partitioning Tolerance is definitely there.

However, as stated by CAP Theorem, you can’t have all 3 happening at the same time. Therefore, the “C” or Consistency piece of CAP has to be compromised. That is why Dynamo has been labeled an “eventually consistency” storage system.

As data is stored into ADSS, the changes of the data is propogated and will be asynchronously replicated to other nodes in the system, eventually making all the data objects and its value consistent. However, given the speed of things in cyberspace and the nature of most Cloud Computing services, the consistency piece could be difficult to accomplish and that is OK because in most of the transactions that are distributed, inconsistency is acceptable.

So that’s a bit about the Amazon Dynamo. Alas, we may never get our grubby hands on this piece of cool data storage and management technology, but knowing that Dynamo is powering AWS and its business is an eye-opener for us into the realm of a new technology evolution.

Posted in Amazon, Big Data, Data, Object Storage

A little yellow elephant

Dec 17

Posted by cfheoh

By now, I believe most of you in the storage networking world would have heard of Hadoop. Hadoop was created by Doug Cutting, while he and his team was working on an open source web search engine called Nutch. The easily recognized little yellow elephant, Hadoop, was Doug Cutting’s son toy, which he made as Hadoop’s mascot. Pretty cool!

And today, Hadoop has become THE platform for Big Data applications. Why?

As I have mentioned before, everything that we do or don’t do, generates data, either as a direct product or in-direct product. I am blogging right now and I am creating data. I was in Singapore the whole of this week and everywhere I go in the MRT stations, I am being watched by the video cameras they have at the station. A new friend in class said that Singapore is the second most “watched” city after London, where there are video cameras mounted everywhere, either discreetly or indiscreetly. And that’s just video data. And there’s plenty of other human activities that generate tons and tons of data.

IDC Digital Universe Report for 2011 said that we have generated 1.8ZB (zettabyte) of data this year alone. I mentioned in my previous blog that this is a gold mine and companies are scrambling to tap on massive amount of data. Extracting valuable information to anticipate the next trend or predict that next evolution in human preference is akin to the Gold Rush in the wild, wild west in the late 19th century. Folks, Big Data is going to be this generation’s “Digital Gold Rush”.

Sieving, filtering and processing gazillions of data (more unstructured than structured) will not work in defined, well-formatted relational databases. The data model of relational databases will simply break down. And of course, there are different schools of thoughts of different data models, but the Hadoop model seems to be gaining momentum and mind share of data scientists. That is because of Hadoop’s capability to deal with massive unstructured data, processing it and producing results in a small amount of time.

One way to process the pool of massive data is parallel programming. In parallel programming, multi-threading is commonly deployed to achieve the performance and effects of programming. But implementing multi-threading in parallel programming is difficult. Developers often has to deal with LWP (lightweight processes), semaphores, shared memory, mutex (mutually exclusive) locking and so on. Hence this style of programming works with different states on shared data, often resulting in different results in different states, even when using the same programming expression.

Hadoop belongs to another school of programming known as functional programming, where the different states on shared data concept is removed. With that in mind, the dependency on different states is also removed, resulting in a much easier and simpler parallel programming implementation. Hadoop borrows ideas from the MapReduce software framework made well known by Google and the Google File System.

Before, we get to know Hadoop, we must know MapReduce. MapReduce is a framework which allows very large data sets to be processed with a very large set of computer nodes in a cluster. Typically the computational processing is executed in a distributed fashion, spread across many computer nodes and final results are consolidated from the sub-results of these distributed processing nodes.

According to Wikipedia, the 2 key functions of Map Reduce are map() and reduce(). That’s pretty obvious. The extract below was taken from the Wikipedia definition, and explains both functions very well.

“Map” step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

“Reduce” step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

The diagram below probably can simplify the concept of MapReduce to the readers.

Hadoop is one of the open-source implementations of MapReduce. It is one of the projects of Apache Foundation, and the project has sparked a brand-new niche of data search, data management and data science. The diagram below will allow our readers to juxtapose MapReduce and Hadoop, and comparing them in the simplest fashion.

Hadoop primary development platform is Java. Hadoop’s architecture consists mainly of 2 components – Hadoop Common and a Hadoop-compatible file system, as shown in the diagram below.

Hadoop MapReduce layer above is the file/object access interface to the Hadoop-compatible file system below. HDFS is Hadoop Distributed File System is just one of a few Hadoop-compatible file systems. Other file systems include:

Amazon S3 File System as part of the Amazon EC2 Infrastructure-as-a-Service (IaaS) cloud platform
CloudStore – a similar Hadoop-like implementation using C++ and also inspired by Google File System
FTP file systems
HTTP and HTTPS read-only file systems
Any file systems accessible with the file:// URL nomenclature

But the main engine of Hadoop is in the MapReduce layer. The 2 core components in this layer is JobTracker and TaskTracker. Both has their own individual roles to play and collectively, they are key cogs in the Hadoop distributed data processing model.

Below are extract I picked up from Wikipedia.

JobTracker submits MapReduce jobs to client applications. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware filesystem, the JobTracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone network. If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on each node spawns off a separate Java Virtual Machine process to prevent the TaskTracker itself from failing if the running job crashes the JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty and can be viewed from a web browser. Jetty is a Java-based HTTP server, among other things

JobTracker records what it is up to in the filesystem. When a JobTracker starts up, it looks for any such data, so that it can restart work from where it left off.

Scheduling

By default Hadoop uses first-in, first-out (FIFO), and optional 5 scheduling priorities to schedule jobs from a work queue. In version 0.19 the job scheduler was refactored out of the JobTracker, and added the ability to use an alternate scheduler (such as the Fair scheduler or the Capacity scheduler).

Fair scheduler

The fair scheduler was developed by Facebook. The goal of the fair scheduler is to provide fast response times for small jobs and QoS (Quality of Service) for production jobs. The fair scheduler has three basic concepts.

Jobs are grouped into Pools.
Each pool is assigned a guaranteed minimum share.
Excess capacity is split between jobs.

By default jobs that are uncategorized go into a default pool. Pools have to specify the minimum number of map slots, reduce slots, and a limit on the number of running jobs.

Capacity scheduler

The capacity scheduler was developed by Yahoo. The capacity scheduler supports several features which are similar to the fair scheduler.

Jobs are submitted into queues.
Queues are allocated a fraction of the total resource capacity.
Free resources are allocated to queues beyond their total capacity.
Within a queue a job with a high level of priority will have access to the queue’s resources.

I took most the extract below from Wikipedia, and I don’t claim to be a knowledgeable person on Hadoop. All the credits go to Wikipedia editors to put Hadoop in layman terms.

Hadoop has certainly won the hearts of the new digital gold rush, Big Data and is slowly becoming a force to be reckoned with among data scientists. Hadoop implementations are powering new frontiers in processing and mining the ever growing data capacity, giving solution providers a simple programming methodology and data model to gain more insights into the vast seas of data and information.

Hadoop has many fans, and slowly becoming the data platform for large companies such as Yahoo!, Facebook, IBM, Amazon, Apple, eBay and many more. Facebook even claims to have the largest Hadoop clusters in the world, growing to 30PB in July of 2011.

This little yellow elephant is going places and one to watch out for.

Posted in Analytics, Big Data, Data, Filesystems, Hadoop

Greenplum looking mighty sweet

Dec 16

Posted by cfheoh

Big data is Big Business these days. IDC predicts that between 2012 and 2020, the spending on big data solution will account for 80% of IT spending and growing at 18% per annum. EMC predicts that the big data is worth USD$70 billion! That’s a very huge market.

We generate data, and plenty of it. In the IDC Digital Universe Report for 2011 (sponsored by EMC), approximately 1.8 zettabytes of data will be created and replicated in 2011. How much is 1 zettabyte, you say? Look at the conversion below:

                    1 zettabyte = 1 billion terabytes

That’s right, folks. 1 billion terabytes!

And this “mountain” of data and information is a Goldmine of goldmines, and companies around the world are scrambling to tap on this treasure chest. According to Wikibon, big data has the following characteristics:

Very large, distributed aggregations of loosely structured data – often incomplete and inaccessible
Petabytes/exabytes of data
Millions/billions of people
Billions/trillions of records
Loosely-structured and often distributed data
Flat schemas with few complex interrelationships
Often involving time-stamped events
Often made up of incomplete data
Often including connections between data elements that must be probabilistically inferred

But what is relevant is not the definition of big data, but rather what you get from the mountain of information generated. The ability to “mine” the information from big data, now popularly known as Big Data Analytics, has sparked a new field within the data storage and data management industry. This is called Data Science. And companies and enterprises that are able to effectively use the new data from Big Data will win big in the next decade. Activities such as

Business decision making
Gain competitive advantage
Drive productivity growth in relevant industry segments
Understanding consumer and business behavioural patterns
Knowing buying decisions and business cycles
Yielding new innovation
Reveal customer insights
much, much more

will drive a whole new paradigm that shall be known as Data Science.

And EMC, having purchased Greenplum more than a year ago, has started their Data Computing Products Division immediately after the Greenplum acquisition. And in October of 2010, EMC announced their Greenplum Data Computing Appliance with some impressive numbers. Using 2 configurations of their appliance, noted below:

Below are 2 tables of the Greenplum performance benchmarks:

That’s what these big data appliance is able. The ability to load billions of either structured or unstructured files or objects in mere minutes is what drives the massive adoption of Big Data.

And a few days, EMC announced their Greenplum Unified Analytics Platform (UAP) which comprises of 3 Greenplum components:

A relational database for structured data
An enterprise Hadoop engine for the analysis and processing of unstructured data
Chorus 2.0, which is a social media collaboration tool for data scientists

The diagram below summarizes the UAP solution:

Greenplum is certainly ahead of the curve. Competitors like IBM Netezza, Teradata and Oracle Exalogic are racing to be ahead but Greenplum is one of the early adopters of a single platform for big data. Having a consolidation platform will not only reduce costs (integration of all big data components usually incurs high professional services’ fees) but will also reduce the barrier to entry to big data, thus further accelerating the adoption of big data.

Big Data is still very much at its infancy and EMC is pushing to establish its footprint in this space. EMC Education has already announce the general availability of courses related to big data last week and also the EMC Data Science Architect (EMC DSA) certification. Greenplum is enjoying the early sweetness of the Big Data game and there will be more to come. I am certainly looking forward to share more on this plum (pun intended ;-)) of the data storage and data management excitement.

Posted in Big Data, EMC

2 Comments

Tags: big data, data analytics, data science, EMC, Greenplum, Hadoop, IBM Netezza, Oracle Exalogic, Teradata, Unified Analytics Platform, unstructured data

What should be a Cloud Storage?

Dec 8

Posted by cfheoh

For us filesystem guys, NAS is the way to go. We are used to store files into network file systems via NFS and CIFS protocols and treating the NAS storage array like a refrigerator – taking stuff out and putting stuff back it. All that is fine and well as long as the data is what I would term as corporate data.

Corporate data is generated by employees, applications and users of the company and for a long time, the power of data creation lies in the hands of the enterprise. That is why storage solutions are designed to address the needs of the enterprise where the data is structured and well defined. How the data is stored; the data is formatted; and how is being accessed are the “boundary” of how the data is being used. Typically a database is used to “restrict” the data created so that the information can be retrieved and modified quickly. And of course, SAN guys will tell you to put these structured data of the data base into their SAN.

For the unstructured data in the enterprise, NAS file systems hold that responsibility. Files such as documents and presentations have a more loosely defined “boundaries”, and hence filesystems are a better natural fit for unstructured data. Filesystems are like a free-for-all container, and able to store and provide access to any files in the enterprise.

But today, as the Web 2.0 applications are already taking over the enterprise, the power of data creation does not necessary lie in the hands of the enterprise applications and users. In fact, it is estimated that the percentage of enterprise data now has exceeded 50% of the enterprise’s total data capacity. With the proliferation of personal devices such as tablets, Blackberries, smart phones, PDAs and so on, individual contributors are generating plenty of data. This situation has been made more acute with Web 2.0 applications, such as Facebook, blogs, social networking, Twitter, Google Search and so on.

Unfortunately, file systems in the NAS category still pretty much the traditional file systems, while the needs of a new type of file system could not be met by the traditional file systems. The paradigm is definitely shifting. The new unstructured data world needs a new storage concept. I would term this type of storage as “Cloud Storage” because it breaks down the traditional concepts of NAS.

So what basically defined a Cloud Storage? I already mentioned that the type of unstructured data has changed. And the new requirements for unstructured data type are:

The unstructured data type is capable of globally distributed.
There will be billions and billions of unstructured data objects created but each object, be it a Twitter tweet, or a uploaded mobile video, or even the clandestine data collected by CarrierIQ, can be accessed easily via a single namespace
The storage file system foundation for these new unstructured data type is easily provisioned and management. Look at Facebook. It is easy to setup, get going and the user (and probably the data administrator) can easily manage the user interface and the platform
For the service provider of Cloud Storage, the file system must be secure and support multi-tenancy and virtualization of storage resources
There should be some form of policy-driven content management. That is why development platforms such as Joomla!, Drupal, WordPress are slowing become enterprise driven to address these unstructured data types.
Highly searchable and have a high degree of search optimization. A Google search do have a strong degree of intelligence and relevance to the data being search as well as generating tons of by-product data that feeds the need to understand the consumers or the users better. Hail Big Data!

So when I compare traditional NAS storage solutions such as Netapp or EMC VNX or BlueArc, I ask the question of whether their NAS solutions has these capabilities to meet the requirements of these new unstructured data type.

Most of them, no matter how they package it, is still relying on files as the granular object of storage. And today, most files may have some form of metadata such as file name, owner, size etc, DO NOT, possess the capability of content-aware. Here’s an example when I want to show you:

The file properties (part of the file metadata) tell you about the file but little about the content of the file. Today, it requires more than that and the new unstructured data type should look more like this:

If you look at the diagram below, the object on the right (which is the new unstructured data type), display much more information than a typical file in a NAS file system. There additional information becomes the fodder to other applications such as search engines, RSS feeds, robots and spiders and of course, big data analytics.

Here’s another example of what I mean about these extended metadata, and a Cloud Storage storage array is required to work with these new set of parameters and a new set of requirement.

There’s a new unstructured data type in town. Traditional NAS systems may not have the right features to work with this new paradigm.

Don’t be white washed by the fancy talk of storage vendors in town. Learn the facts, and find out what is really a Cloud Storage.

It’s time to think differently. It’s time to think of what should be a Cloud Storage.

Posted in Analytics, Big Data, Filesystems, Object Storage

3 Comments

Tags: big data, Cloud Storage, databases, metadata, structured data, unstructured data

Big data is big headache

Oct 28

Posted by cfheoh

IBM claims that we are responsible of for creating 2.5 quintillion bytes of data every day. How much is 1 quintilion?

According to the web,

1 quintillion = 1,000,000,000,000,000,000

After billion, it is trillion, then quadrillion, and then quintillion. That’s what 1 quintillion is, with 18 zeroes!

These data comes from everything from social networking updates, meteorology (weather reports), remote sensing maps (Google Maps, GPS, Geographical Information Systems), photos (Flickr), videos (YouTube), Internet search (Google) and so on. The big data terminology, according to Wikipedia, is data that are too large to be handled and processed by conventional data management tools. This presents a new set of difficulties when it comes to collected these data, storing them and sharing them. Indexing and searching big data would require special technologies to be able to mine and extract valuable information from big data datasets, within an acceptable period of time.

According to Wiki, “Technologies being applied to big data include massively parallel processing (MPP) databases, datamining grids, distributed file systems, distributed databases, cloud computing platforms, the Internet, and scalable storage systems.” That is why EMC has paid big money to acquire GreenPlum and IBM acquired Netezza. Traditional data warehousing players such Teradata, Oracle and Ingres are in the picture as well, setting a collision course between the storage and infrastructure companies and the data warehousing solutions companies.

The 2010 Gartner Magic Quadrant has seen non-traditional players such as IBM/Netezza and EMC/Greenplum, in its leaders quadrant.

And the key word that is already on everyone’s lips is “ANALYTICS“.

The ability to extract valuable information that helps determines what the next future trend is and personalized profiling will be something that may already arrived as companies are clamouring to get more and more out of our personalities so that they can sell you more of their wares.

Meteorological organizations are using big data analytics to find out about weather patterns and climate change. Space exploration becomes more acute and precise from the tons and tons of data collected from space explorations. Big data analytics are also helping pharmaceutical companies develop new biological and pharmaceutical breakthroughs. And the list goes on.

I am a new stranger into big data and I do not proclaim to know a lot. But terms such as scale-out NAS, distributed file systems, grid computing, massively parallel processing are certainly bringing the data storage world into a new frontier, and it is something we as storage professionals have to adapt to. I am eager to learn and know more about big data. It is a big headache but change is inevitable.

Posted in Analytics, Big Data

1 Comment

Tags: analytics, big data, data warehouse, distributed file systems

Search for:
Top Posts & Pages
- Copy-on-Write and SSDs - A better match than other file systems?
- The definition of Cloud Computing ... really
May 2024

M T W T F S S

1 2 3 4 5

6 7 8 9 10 11 12

13 14 15 16 17 18 19

20 21 22 23 24 25 26

27 28 29 30 31

« Mar
Menu
Topics
- 10Gigabit Ethernet
- Acquisition
- Amazon
- Analytics
- Apple
- Appliance
- Atempo
- Avere
- Backup
- Big Data
- Bluecoat
- Brocade
- Cloud
- Commvault
- Data
- Data Corruption
- Deduplication
- Dell
- Dennis Ritchie
- Disks
- Dropbox
- EMC
- Falconstor
- Filesystems
- Fujitsu
- Gartner
- Green
- Hadoop
- HP
- IBM
- IDC
- iSCSI
- Joyent
- Kaminario
- Memory Cloud
- Nasuni
- NetApp
- NFS
- Novell
- Object Storage
- Openstack
- Oracle
- Performance Benchmark
- Performance Caching
- Rackspace
- RAID
- Reliability
- SCSI
- Seagate
- Security
- Snapshots
- SNIA
- Solaris
- Solid State Devices
- Starboard Storage
- Steve Jobs
- Storage Magazine
- Storage Market Share
- Storage Optimization
- Storage Tiering
- Tapes
- Uncategorized
- Unified Storage
- Virtualization
- VMware
Archives
Meta
Follow Blog via Email

Enter your email address to follow this blog and receive notifications of new posts by email.

Email Address:

Join 71 other subscribers
Blog Stats
- 294,893 hits

Storage Gaga

Going Ga-ga over storage networking technologies ….

Category Archives: Big Data

Primary Dedupe where are you?

Captain Dynamo Storage System

A little yellow elephant

Scheduling

Fair scheduler

Capacity scheduler

Greenplum looking mighty sweet

Big data is big headache

Top Posts & Pages

Menu

Topics

Archives

Meta

Follow Blog via Email

Blog Stats