Blog Archives

Kaminario who?

The name “Kaminario” intrigues me and I don’t know the meaning of it. But there is a nice roll off the tongue until you say it a few times, fast and your tongue get twisted in a jiffy.

Kaminario is one of the few prominent startups in the all-flash storage space, getting USD$15 million Series C funding from big gun VCs of Sequoia and Globespan Capital Partners in 2011. That brought their total to USD$34 million, and also bringing them the attention of storage market.

I am beginning my research into their technology and their product line, the K2 and see why are they special. I am looking for an angle that differentiates them and how they position themselves in the market and why they deserved Series C funding.

Kaminario was founded in 2008, with their headquarters in Boston Massachusetts. They have a strong R&D facility in Israel and looking at their management lineup, they are headed by several personalities with an Israel background.

All this shouldn’t be a problem to many except the fact that Malaysia don’t recognize Israel diplomatically and some companies here, especially the government, might have an issue with the Israeli link. But then again, we have a lot of hypocrites in Malaysian politics and I am not going to there in my blog. It’s a waste of my time.

The key technology is Kaminario’s K2 SPEAR Architecture and it defines a fundamental method to store and retrieve performance-sensitive data. Yes, since this is an all-Flash storage solution, performance numbers, speeds and feeds are the “weapons” to influence prospects with high performance requirements. Kaminario touts their storage solution scales up to 1.5 million IOPS and 16GB/sec throughput and indeed they are fantastic numbers when you compare them with the conventional HDDs based storage platforms. But nowadays, if you are in the all-Flash game, everyone else is touting similar performance numbers as well. So, it is no biggie.

The secret sauce to the Kaminario technology is of course, its architecture – SPEAR. SPEAR stands for Scale-out Performance Storage Architecture. While Kaminario states that their hardware is pretty much off-the-shelf, open industry standard, somehow under the covers, the SPEAR architecture could have incorporate some special, proprietary design in its hardware to maximize the SPEAR technology. Hence, I believe there is a reason why Kaminario chose a blade-based system in the enclosures of its rack. Here’s a look at their hardware offering:

The idea using blades is a good idea because blades offers integrated wiring, consolidation, simple plug-and-play, ease-of-support, N+1 availability and so on. But this will also can put Kaminario in a position of all-blades or nothing. This is something some customers in Malaysia might have to get used to because many would prefer their racks. I could be wrong and let’s hope I am.

Each enclosure houses 16 blades, with N+1 availability. As I am going through Kaminario’s architecture, the word availability is becoming louder, and this could be something Kaminario is differentiating from the rest. Yes, Kaminario has the performance numbers, but Kaminario is also has a high-available (are we talking 6 nines?) architecture inherent within SPEAR. Of course, I have not done enough to compare Kaminario with the rest yet, but right now, availability isn’t something that most all-Flash startups trumpet loudly. I could be wrong but the message will become clearer when I go through my list of all-Flash – SolidFire, PureStorage, Virident, Violin Memory and Texas Memory Systems.

Each of the blades can be either an ioDirector or a DataNode, and they are interconnected internally with 1/10 Gigabit ports, with at least one blade acting as a standby blade to the rest in a logical group of production blades. The 10Gigabit connection are used for “data passing” between the blades for purpose of load-balancing as well as spreading out the availability function for the data. The Gigabit connection is used for management reasons.

In addition to that there is also a Fibre Channel piece that is fronting the K2 to the hosts in the SAN. Yes, this is an FC-SAN storage solution but since there was no mention of iSCSI, the IP-SAN capability is likely not there (yet).

 Here’s a look at the Kaminario SPEAR architecture:

The 2 key components are the ioDirector and the DataNode. A blade can either have a dedicated personality (either ioDirector or DataNode) or it can share both personalities in one blade. Minimum configuration is 2-blades of 2 ioDirectors for redundancy reasons.

The ioDirector is the front-facing piece. It presents to the SAN the K2 block-based LUNs and has the intelligence to dynamically load balance both Reads and Writes and also optimizing its resource utilization. The DataNode plays the role of fetching, storing, and backup and is pretty much the back-end worker.

With this description, there are 2 layers in the SPEAR architecture. And interestingly, while I mentioned that Kaminario is an all-Flash storage player, it actually has HDDs as well. The HDDs do not participate in the primary data serving and serve as containers for backup for the primary data in the SSDs, which can be MLC-Flash or DRAMs. The back-end backup layer comprising of HDDs is what I said earlier about availability. Kaminario is adding data availability as part of its differentiating features.

That’s the hardware layout of SPEAR, but the more important piece is its software, the SPEAR OS. It has 3 patent-pending  capabilities, with not so cool names (which are trademarked).

  1. Automated Data Distribution
  2. Intelligent Parallel I/O Processing
  3. Self Healing Data Availability

The Automated Data Distribution of the SPEAR OS acts as a balancer. It balances the data by dynamically and randomly (in an random equilibrium fashion, I think) to spread out the data over the storage capacity for efficiency, SSD longevity and of course, optimized performance balancing.

The second capability is Intelligent Parallel I/O Processing. The K2 architecture is essentially a storage grid. The internal 10Gigabit interconnects basically ties all nodes (ioDirectors and DataNodes) together in a grid-like fashion for the best possible intra-node communications. The parallelization of the I/O Read and Write requests spreads across the nodes in the storage grid, giving the best average response and service times.

Last but not least is the Self Healing Data Availability, a capability to dynamically reconfigure accessibility to the data in the event of node failure(s). Kaminario claims no single point of failure, which is something I am very interested to know if given a chance to assess the storage a bit deeper. So far, that’s the information I am able to get to.

The Kaminario K2 product line comes in 3 model – D, F, and H.

D is for DRAM only and F is for Flash MLC only. The H model is a combination of both Flash and DRAM SSDs. Here how Kaminario addresses each of the 3 models:

Kaminario is one of the early all-Flash storage systems that has gained recognition in 2011. They have been named a finalist in both Storage Magazine and SearchStorage Storage Product of the Year competitions for 2011. This not only endorses a brand new market for solid state storage systems but validates an entirely new category in the storage networking arena.

Kaminario can be one to watch in 2012 as with others that I plan to review in the coming weeks. The battle for Flash racks is coming!

BTW, Dell is a reseller of Kaminario.

Advertisements

Battle of flash racks coming soon

The battle is probably already here. It has just begun for rack mounted flash-based or DRAM-based (or both) storage systems.

We have read in the news about the launch of EMC’s Project Lightning, and I wrote about it. EMC is already stirring up the competition, aiming its guns at FusionIO. Here’s a slide from EMC comparing their VFCache with FusionIO.

Not to be outdone, NetApp set its motion to douse the razzmatazz of EMC’s Lightning, announcing the future availability of their server-side flash software (no PCIe card) but it will work with major host-based/server-side PCIe Flash cards. (FusionIO, heads up). Ah, in Sun Tsu Art of War, this is called helping your buddy fight the bigger enemy.

NetApp threw some FUDs into the battle zone, claiming that EMC VFCache only supports 300GB while the NetApp flash software will support 2TB, NetApp multiprotocol, and VMware’s VMotion, DRS and HA. (something that VFCache does not support now).

The battle of PCIe has begun.

The next battle will be for the rackmounted flash storage systems or appliance. EMC is following it up with Project Thunder (because thunder comes after lightning), which is a flash-based storage system or appliance. Here’s a look at EMC’s preliminary information on Project Thunder.

And here’s how EMC is positioning different storage tiers in the following diagram below (courtesy of VirtualGeek), being glued together by EMC FAST (Fully Automated Storage Tiering) technology.

But EMC is not alone, as there are already several prominent start-ups out there, already offering flash-based, rackmount storage systems.

In the battle ring, there is Kaminario K2 with the SPEAR (Scale-out Performance Storage Architecture), Violin Memory with Violin Switched Memory (VXM) architecture, Purestorage Purity Operating Environment and SolidFire’s Element OS, just to name a few. Of course, we should never discount the grand daddy of all flash-based storage – Texas Memory Systems RamSAN.

The whole motion of competition in this new arena is starting all over again and it’s exciting for me. There is so much to learn about newer, more innovative architecture and I intend to share more of these players in the coming blog entries. It is time to take notice because the SSDs are dropping in price, FAST! And in 2012, I strongly believe that this is the next battle of the storage players, both established and start-ups.

Let the battle begin!

 

Lightning about to strike

Watch out for February 6th, 2012 folks! The Lightning is about to strike!

Yes, it is likely that EMC will be announcing their server-based, 8-lane PCIe Flash memory card in early week of February. The PCIe card was dubbed “Project Lightning” when it was first announced in EMC World in May last year. It represents EMC’s first foray of products that sits on the server side, giving the impression that EMC could be entering the server business. I blogged about this way back in September last year. As explained by the EMC folks, they are not going into the server business but rather “extending” their performance tiering into the server space. Think of it like an umbilical cord that  sucks the server’s CPU processing power to give maximum performance boost for the EMC storage.

The card will sport Solid State Drive from LSI Warp Drive and comes in 100/200/300GB capacity. Here’s a picture of how the Lightning card would look like:

The SSD is an SLC (Single Level Cell) and is capable of delivering 150,000 random reads IOPS based on 4K blocks and 190,000 random writes IOPS. It can squeeze 1.4GB/sec in read throughput. While it is not on par with the performance of Fusion-IO, it can definitely do well leveraging EMC’s huge customer base. Furthermore, PCIe-based Flash memory cards such as Fusion-IO will not be able to take advantage of the bridge that links the server and the storage, making it confined to the server’s resources. The advantage is definitely EMC when you explore the possibilities.

Here’s a view of a slide from Virtual Geek summarizing the Project Lightning:

The Lightning card is aimed at customers who demand the highest performance, even higher that Tier 0. It will be integrated with EMC’s FAST (Fully Automated Storage Tiering) technology and is available to the VNX and VMAX platforms.

So watch out folks, because Lightning is about to strike soon!

Not all SSDs are the same

Happy Lunar New Year! The Chinese around world has just ushered in the Year of the Water Dragon yesterday. To all my friends and family, and readers of my blog, I wish you a prosperous and auspicious Chinese New Year!

Over the holidays, I have been keeping up with the progress of Solid State Drives (SSDs). I am sure many of us are mesmerized by SSDs and the storage vendors are touting the best of SSDs have to offer. But let me tell you one thing – you are probably getting the least of what the best SSDs have to offer. You might be puzzled why I say things like this.

Let me share with a common sales pitch. Most (if not all) storage vendors will tout performance (usually IOPS) as the greatest benefits of SSDs. The performance numbers have to be compared to something, and that something is your regular spinning Hard Disk Drives (HDDs). The slowest SSDs in terms of IOPS is about 10-15x faster than the HDDs. A single SSD can at least churn 5,000 IOPS when compared to the fastest 15,000 RPM HDDs, which churns out about 200 IOPS (depending on HDD vendors). Therefore, the slowest SSDs can be 20-25x faster than the fastest HDDs, when measured in IOPS.

But the intent of this blogger is to share with you more about SSDs. There’s more to know because SSDs are not built the same. There are write-bias SSDs, read-bias SSDs; there are SLC (single level cell) and MLC (multi level cell) SSDs and so on. How do you differentiate them if Vendor A touts their SSDs and Vendor B touts their SSDs as well? You are not comparing SSDs and HDDs anymore. How do you know what questions to ask when they show you their performance statistics?

SNIA has recently released a set of methodology called “Solid State Storage (SSS) Performance Testing Specifications (PTS)” that helps customers evaluate and compare the SSD performance from a vendor-neutral perspective. There is also a whitepaper related to SSS PTS. This is something very important because we have to continue to educate the community about what is right and what is wrong.

In a recent webcast, the presenters from the SNIA SSS TWG (Technical Working Group) mentioned a few questions that I  think we as vendors and customers should think about when working with an SSD sales pitch. I thought I share them with you.

  • Was the performance testing done at the SSD device level or at the file system level?
  • Was the SSD pre-conditioned before the testing? If so, how?
  • Was the performance results taken at a steady state?
  • How much data was written during the testing?
  • Where was the data written to?
  • What data pattern was tested?
  • What was the test platform used to test the SSDs?
  • What hardware or software package(s) used for the testing?
  • Was the HBA bandwidth, queue depth and other parameters sufficient to test the SSDs?
  • What type of NAND Flash was used?
  • What is the target workload?
  • What was the percentage weight of the mix of Reads and Writes?
  • Are there warranty life design issue?

 

I thought that these questions were very relevant in understanding SSDs’ performance. And I also got to know that SSDs behave differently throughout the life stages of the device. From a performance point of view, there are 3 distinct performance life stages

  • Fresh out of the box (FOB)
  • Transition
  • Steady State

 

 

As you can see from the graph below, a SSD, fresh out of the box (FOB) displayed considerable performance numbers. Over a period of time (the graph shown minutes), it transitioned into a mezzanine stage of lower IOPS and finally, it normalized to the state called the Steady State. The Steady State is the desirable test range that will give the most accurate type of IOPS numbers. Therefore, it is important that your storage vendor’s performance numbers should be taken during this life stage.

Another consideration when understanding the SSDs’ performance numbers are what type of tests used? The test could be done at the file system level or at the device level. As shown in the diagram below, the test numbers could be taken from many different elements through the stack of the data path.

 

Performance for cached data would given impressive numbers but it is not accurate. File system performance will not be useful because the data travels through different layers, masking the true performance capability of the SSDs. Therefore, SNIA’s performance is based on a synthetic device level test to achieve consistency and a more accurate IOPS numbers.

There are many other factors used to determine the most relevant performance numbers. The SNIA PTS test has 4 main test suite that addresses different aspects of the SSD’s performance. They are:

  • Write Saturation test
  • Latency test
  • IOPS test
  • Throughput test
The SSS PTS would be able to reveal which is a better SSD. Here’s a sample report on latency.
Once again, it is important to know and not to take vendors’ numbers in verbatim. As the SSD market continue to grow, the responsibility lies on both side of the fence – the vendor and the customer.

 

Betcha don’t encrypt your disks

At the Internet Alliance event this morning, someone from Computerworld gave me a copy of their latest issue. The headline was “Security Incidents Soar”, with the details of the half-year review by CyberSecurity Malaysia.

Typically, the usual incidents list evolve around spam, intrusions, frauds, viruses and so on. However, storage always seems to be missing. As I see it, storage security doesn’t sit well with the security guys. In fact, storage is never the sexy thing and it is usually the IPS, IDS, anti-virus and firewall that get the highlights. So, when we talk about storage security, there is so little to talk about. In fact, in my almost 20-years of experience, storage security was only brought up ONCE!

In security, the most valuable piece of asset is data and no matter where the data goes, it always lands on …. STORAGE! That is why storage security could be one of the most overlooked piece in security. Fortunately, SNIA already has this covered. In SNIA’s Solid State Storage Initiative (SSSI), one aspect that was worked on was Self Encrypted Drives (SED).

SED is not new. As early as 2007, Seagate already marketed encrypted hard disk drives. In 2009, Seagate introduced enterprise-level encrypted hard disk drives. And not surprisingly, other manufacturers followed. Today, Hitachi, Toshiba, Samsung, and Western Digital have encrypted hard disk drives.

But there were prohibitive factors that dampened the adoption of self-encrypted drives. First of all, it was the costs. It was expensive a few years ago. There was (and still is) a lack of knowledge between the hardware of Self Encrypted Drives (SED) and software-based encryption. As the SED were manufactured, some had proprietary implementations that did not do their part to promote the adoption of SEDs.

As data travels from one infrastructure to another, data encryption can be implemented at different points. As the diagram below shows,

encryption can be put in place at the software level, the OS level, at the HBA, the network itself. It can also happen at the switch (network or fabric), at the storage array controller or at the hard disk level.

EMC multipathing software, PowerPath, has an encryption facility to ensure that data is encryption on its way from the HBA to the EMC CLARiiON storage controllers.

The “bump-in-the-wire” appliance is a bridge device that helps in composing encryption to the data before it reaches the storage. Recall that NetApp had a FIPS 140 certified product called Decru DataFort, which basically encrypted NAS and SAN traffic en-route to the NetApp FAS storage array.

And according to SNIA SSSI member, Tom Coughlin, SED makes more sense that software-based security. How does SED work?

First of all, SED works with 2 main keys:

  • Authentication Key (AK)
  • Drive Encryption Key (DEK)
The DEK is the most important component, because it is a symmetric key that encrypts and decrypts data on the HDDs or SSDs. This DEK is not for any Tom (sorry Tom), Dick and Harry. In order to gain access to DEK, one has to be authenticated and the authentication is completed by having the right authentication key (AK). Usually the AK is based on a 128/26-bit AES or DES and DEK is of a higher bit range. The diagram below shows the AK and DEK in action:
Because SED occurs at the drive level, it is significantly simpler to implement, with lower costs as well. For software-based encryption, one has to set up some form of security architecture. IPSec comes to mind. This is not only more complex, but also more costly to implement as well. Since it is software, the degree of security compromise is higher, meaning, the security model is less secure when compared to SED. The DEK of the SED does not leave the array, and if the DEK is implemented within the disk enclosure or the security module of SoC (System-on-Chip), this makes even more secure that software-based encryption. Also, the DEK is away from the CPU and memory, thus removing these components as a potential attack vendor that could compromise the data on the disks drives.
Furthermore, software-based encryption takes up CPU cycles, thus slows down the overall performance. In the Tom Coughlin study, based on both SSDs and HDDs, the performance of SED outperforms software-based encryption every time. Here’s a table from that study:
Another security concern is about data erasure. According to an old IBM study, about 90% of the retired HDDs still has data that is readable. That means that data erasure techniques used are either not implemented properly or simply not good enough. For us in the storage industry, an effective but time consuming technique is to overwrite the entire disks with 1s and reusing it. But to hackers, there are ways to “undelete” these bits and make the data readable again.
SED provides crypto erasure that is both effective and very quick. Since the data encryption key (DEK) was used to encrypt and decrypt data, the DEK can be changed and renewed in split seconds, making the content of the disk drive unreadable. The diagram below shows how crypto erasure works:

Data security is already at its highest alert and SEDs are going to be a key component in the IT infrastructure. The open and common standards are coming together, thanks to efforts to many bodies including SNIA. At the same time, product certifications are coming up and more importantly, the price of SED has come to the level that it is almost on par with normal, non-encrypted drives.

Hackers and data thieves are getting smarter all the time and yet, the security of the most important place of where the data rest is the least considered. SNIA and other bodies hope to create more awareness and seek greater adoption of self encrypted drives. We hope you will help spread the word too. Betcha thinking twice now about encrypting your data  on your disk drives now.

Having fun with your storage vendor and get the information to fit your data center

I was on my way to Singapore yesterday. At the departure lounge, I just started reading “Data Center Storage” by Hubbert Smith (ISBN#: 978-1439834879) yesterday and I learned something very interesting immediately. Then my thoughts started stirring and I thought I have a bit of fun with what I have learned from the book.

The single, most significant piece of the storage solution is the hard disk drive (HDD). Regardless of SAN or NAS protocols, the data is stored and served from the hard disk drives. And there are 4 key metrics of a HDD, which are

  • Price
  • Performance
  • Capacity
  • Power

As storage professionals, we are often challenged to deliver the best storage solution to meet the customer’s requirements. Therefore, it is not about providing the fastest IOPS or the best availability or the lowest price. It is about providing the best balance of the 4 key metrics above.

The 4 metrics are of little help when they are standalone but if they are combined in relation to each other, you as a customer, can obtain some measurable ratios that will be useful to size for a requirements, keeping the balance of the 4 key metrics better defined rather than getting fluff and BS from the storage vendor.

In the book, the following table was displayed and I found it to be extremely useful:

Key Ratios for HDDs
Ratio
Performance/Price IOPS/$
Performance/Power IOPS/watt
Capacity/Price GB/$
Capacity/Power GB/watt

The relational ratios in red are going to be useful in determining the right type of storage for the requirement. And we will come back to this later. We begin our quest to obtain the information that we want – Performance, Capacity, Price, Power.

Capacity is the easy one because it is a given fact the size of the HDDs.

IOPS for each type of HDDs is also easy to obtain. See table below:

Disk Type RPM IOPS Range
SATA 5,400 50-75
SATA 7,200 75-100
SAS/FC 10,000 100-125
SAS/FC 15,000 175-200
SSD N/A 5,000-10,000

The watt of each HDDs is also quite easy. Just ask the vendor to give the specification of the HDDs.

The pricing part would be part where we can have a bit of fun with the storage vendor. Usually, storage vendors do not release the price of a single HDD in the quotation. The total price is lumped together with everything else, making it harder to decipher the price. So, what can the customer do?

Easy. Get 4-5 quotations from the storage vendor, each with different type of HDDs. This is the customer’s rights. For example, I have created several fictitious quotations, each with a different type of HDDs/SSD and pricing.

Quote #1 (SATA 7200 RPM)

Quote #2 (SAS 10,000 RPM)

Quote #3 (SAS 15,000 RPM)

Quote #4 (SSD)

From the 4 quotations, we cannot ascertain the true price of a single disk, but we can assume that the 12 units HDDs/SSDs take up 50% of the entire quotation. With all things being equal, especially the quantity of 12, we can establish the very rough estimate of the price. Having fun asking the storage vendor to run around with the quotations is the added bonus.

But we can derive the following figures (rough estimates but useful when we apply them to the key ratios above)

1TB SATA = 3333.33; 300GB 10,000 RPM SAS = 5000.00; 300GB 15,000 RPM SAS = 6250.00; 100GB SSD = 10416.66

When we juxtapose the information that we have collected i.e. price, performance and capacity (ok, I am skipping power/watt because I am lazy to find out), we come up with a table below:

 

 

In the boxed area, we can now easily determine which HDDs/SSDs that give the best value for money either Performance/$ or Capacity/$. The higher the key ratio, the better the value.

From this aspect, the customer can now determine methodically which type of disk he should invest into, in order to get the best value.

This is just a very simplistic method to find the value of the storage solution to be purchased. Bear in mind that there are many other factors to consider as well, such as rack unit height, total power consumption, storage efficiency, data protection and many more.

I am not taking credit for what Hufferd Smith has proposed. All kudos to him but I am using his method to apply to what is relevant to us on the field.

In conclusion, the customer won’t be baffled and confused thinking that they got the best deal at lowest price or fastest performance. This crude method can help turn perception into something that is more concrete and analytical. It’s time we, as customer, know our rights, and know what we are buying into and have a bit of fun too with the storage vendor.

Copy-on-Write and SSDs – A better match than other file systems?

We have been taught that file systems are like folders, sub-folders and eventually files. The criteria in designing file systems is to ensure that there are few key features

  • Ease of storing, retrieving and organizing files (sounds like a fridge, doesn’t it?)
  • Simple naming convention for files
  • Performance in storing and retrieving files – hence our write and read I/Os
  • Resilience in restoring full or part of a file when there are discrepancies

In file systems performance design, one of the most important factors is locality. By locality, I mean that data blocks of a particular file should be as nearby as possible. Hence, in most file systems designs originated from the Berkeley Fast File System (BFFS), requires the file system to seek the data block to be modified to ensure locality, i.e. you try not to split up the contiguity of the data blocks. The seek time to find the require data block takes time, but you are compensate with faster reads because the read-ahead feature allows you to read extra blocks ahead in anticipation that the data blocks are related.

In Copy-on-Write file systems (also known as shadow-paging file systems), the seek portion is usually not present because the new modified block is written somewhere else, not the present location of the original block. This is the foundation of Copy-on-Write file systems such as NetApp’s WAFL and Oracle Solaris ZFS. Because the new data blocks are written somewhere else, the storing (write operation) portion is faster. It eliminated the seek time and it also skipped the read-modify-write action to the original location of the data block. Therefore, write is likely to be faster.

However, the read portion will be slower because if you want to read a file, the file system has to go around looking for the data blocks because it lacks the locality. Therefore, as the COW file system ages, it tends to have higher file system fragmentation. I wrote about this in my previous blog. It is a case of ENJOY-FIRST/SUFFER-LATER. I am not writing this to say that COW file systems are bad. Obviously, NetApp and Oracle have done enough homework to make the file systems one of the better storage file systems in the market.

So, that’s Copy-on-Write file systems. But what about SSDs?

Solid State Drives (SSDs) will make enemies with file systems that tend prefer locality. Remember that some file systems prefer its data blocks to be contiguous? Well, SSDs employ “wear-leveling” and required writes to be spread out as much as possible across the SSDs device to prolong the life of the SSD device to reduce “wear-and-tear”. That’s not good news because SSDs just told the file systems, “I don’t like locality and I will spread out the data blocks“.

NAND Flash SSDs (the common ones we find in the market and not DRAM-based SSDs) are funny creatures. When you write to SSDs, you must ERASE first, WRITE AGAIN to the SSDs. This is the part that is creating the wear-and tear of the device. When I mean ERASE first, WRITE AGAIN, I describe it below

  • Writing 1 –> 0 (OK, no problem)
  • Writing 0 –> 1 (not OK, because NAND Flash can’t do that)

So, what does the SSD do? It ERASES everything, writing the entire data blocks on the device to 1s, and then converting some of them to 0s. Crazy, isn’t it? The firmware in the SSDs controller will also spread out the erase-and-then write operations across the entire SSD device to avoid concentrating the operations on a small location or dataset. This is the “wear-leveling” we often hear about.

Since SSDs shun locality and avoid the data blocks to be nearby, and Copy-on-Write file systems are already doing this because its nature to write new data blocks somewhere else, the combination of both COW file system and SSDs seems like a very good fit. It even looks symbiotic because it is a case of “I help you; and you help me“.

From this perspective, the benefits of COW file systems and SSDs extends beyond resiliency of the SSD device but also in performance. Since the data blocks are spread out at different locations in the SSD device, the effect of parallelism will inadvertently help with COW’s performance. Make sense, doesn’t it?

I have not learned about other file systems and how they behave with SSDs, but it is pretty clear that Copy-on-Write file systems works well with Solid State Devices. Have a good week ahead :-)!

What kind of IOPS and throughput do you get from RAID-5/6? – Part 2

In my previous blog entry, I mentioned the write penalty for RAID-5/6. This factor will figure heavily in the way we size the RAID-level for performance capacity planning.

It is difficult to ascertain what kind of IOPS and throughput that are required for an application, especially a database, to run well with additional room to grow. From a DBA or an application developer, I believe they would have adequate information to tell what is the numbers of users that the application can support, both average and peak, transactions per second (TPS), block size required for logs, database files and so on.

But as we are all aware, most of the time, these types of information are not readily available. So, coming from a storage angle, the storage administrator can advise the DBA or the application developer that the configured RAID group or volume or LUN is capable of delivering a certain number of IOPS and is able to achieve a certain throughput MB/sec. These numbers will be off the box itself immediately. Of course, other factors such as HBA speed, the FC/iSCSI configurations, the network traffic and so on will affect the overall performance delivery to the application. But we can safely inform the DBA and/or the application developer that this is what the storage is delivering out of the box.

The building blocks of all storage RAID groups/volumes/LUNs are pretty much your hard disk drives (HDDs) and/or Solid State Drives (SSDs). The manufacturer of these disks will usually publish the IOPS and throughput of individual drives but if these information is not available, we can construct IOPS of an individual HDD from its seek and latency times.

For example, if the HDD’s

average latency = 2.8 ms;          average read seek = 4.2 ms;              average write seek = 4.8 ms

then the IOPS can be calculated as

                                  1
         IOPS = ---------------------------------------
                (average latency) + (average seek time)

Therefore from the details above,

                    1
         IOPS = -------------------  = 136.986 IOPS
                (0.0028) + (0.0045)

That’s pretty simple, right? But of course, it is easier to just accept that a certain type of disk will have a range of IOPS as shown in the table below:

Disk Type RPM IOPS Range
SATA 5,400 50-75
SATA 7,200 75-100
SAS/FC 10,000 100-125
SAS/FC 15,000 175-200
SSD N/A 5,000-10,000

The information from the table above is just for reference only and by no means a very accurate one but it is good enough for us to determine the IOPS of a RAID group/volume/LUN. Let’s look at the RAID write penalty again in the table below:

RAID-level Number of I/O Reads
Number of I/O for Writes
RAID Write Penalty
0 1 1 1
1 (1+0, 0+1) 1 2 2
5 1 4 4
6 1 6 6

Next, we need to know what is the ratio of Reads vs Writes for that particular database or application. I mentioned earlier that in OLTP-type of applications, we usually take a 2:1 or 3:1 ratio in favour of Reads.

To make things simpler, let’s assume we create a RAID-6 volume of 6 data disks and 2 parity disks in a RAID-6 (6+2) configuration. The disks used are SATA disks of 7,200 RPM, with each individual disk of 100 IOPS. Assume we are using a ratio of 2:1 in favour of Reads, which gives us 66.666% and 33.333% respectively for Reads and Writes.

Therefore, the combined IOPS of the 8 disks in the RAID-6 configuration is probably about 800 IOPS. However, because of the write penalty of RAID-6, the effective IOPS for the RAID-6 volume will be lower than that. Let’s do some calculation to see what happens:

1)  Read IOPS + Write IOPS = 800 IOPS

2)  (0.66666 x 800) + (0.33333 x 800) = 800 IOPS

3) Read IOPS will be 0.66666 x 800 = 533.328 IOPS

4) Write IOPS will be 0.33333 x 800 = 266.664 IOPS. However, since RAID-6 has a write penalty of 6, this number has to be divided by 6. 266.664/6 will be 44.444 IOPS for Writes

Therefore, what the RAID-6 volume is capable of is approximately 533 IOPS for Reads and 44 IOPS for Writes.

We have determined IOPS for the RAID volume but what about throughput. Throughput is determined by the block size used. Assume that our RAID-6 volume uses a 4-K block size. With a combined effective IOPS of 577 (533+44), we multiply the IOPS with the block size

     Throughput = 577 IOPS x 4-KB
                = 2308KB/sec

Therefore when I/O is sustained in a sequential manner, the effective throughput is 2308KB/sec.

On the other hand, we often were told to add more spindles to the volume to increase the IOPS. This is true, to a point, where the maximum amount of IOPS that can be delivered will taper into a flatline, because the I/O channel to the RAID volume  has been saturated. Therefore, it is best to know that adding more spindles does not always equate to a higher IOPS.

Performance sizing for a database or an application is both a science and an art. Mathematically, we can prove things to a a certain amount of accuracy and confidence but each storage platform is very different in the way they handle RAID. Newer storage platforms have proprietary RAID that nowadays, it does not matter much what kind of RAID is best for the application. Vendors such as IBM XIV has RAID-X which both radical in design and implementation. NetApp will almost always say RAID-DP is the best no matter what, because RAID-DP is all NetApp.

So there is no right or wrong to choose the RAID-level for the application. But it is VERY important to know what are the best practice are and my advice is everyone is to do Proof-of-Concepts, and TEST, TEST, TEST! And ASK QUESTIONS!

SSDs coming into mainstream … be Ready!

There has been a slew of SSD news in the storage blogosphere with the big one from eBay.

eBay has just announced that it has 100TB of SSDs from Nimbus Data Systems. On top of that, OCZ, SanDisk and STEC, all major SSD manufacturers, have announced a whole lot of new products with the PCIe SSD cards leading the way. The most interesting thing was the factor of $/GB has gone down significantly, getting very close to the $/GB of spinning disks. This is indeed good news to the industry because SSDs delivers low latency, high IOPS, low power consumption and many other new benefits.

Side note: As I am beginning to understand more about SSDs, I found out that NAND flash SSD has a latency in the microseconds compared to spinning HDDs, which has milliseconds latency range. In addition to that DRAM SSDs have latency that is in the range of nano seconds, which is basically memory type of access. DRAM SSDs are of course, more expensive. 

The SSDs are coming very soon into the mainstream, and this will inadvertently, drive a new generation of applications and accelerate growth in knowledge acquisition. We are already seeing the decline of Fibre Channel disks and the rise of SAS and SATA disks but SSDs in the enterprise storage, as far as I am concerned, brings forth 2 new challenges which we, as professionals and users in the storage networking environment, must address.

These challenges can be simplified to

  1. Are we ready?
  2. Where is the new bottleneck?

To address the first challenge, we must understand the second challenge first.

In system architectures, we know of various of performance bottlenecks that exist either in CPU, memory, bus, bridge, buffer, I/O devices and so on. In order to deliver the data to be process, we have to view the data block/byte service request in its entirety.

When a user request for a file, this is a service request. The end objective is the user is able to read and write the file he/she requested. The time taken from the beginning of the request to the end of it, is known as service time, which latency plays a big part of it. We assume that the file resides in a NAS system in the network.

The request for the file begins by going through the file system layer of the host the user is accessing, then to the user and kernel space, moving on through the device driver of the NIC card, through the TCP/IP stack (which has its own set of buffer overheads and so on), passing the request through the physical wire. From there it moves on through the NAS system with the RAID system, file system and so on until it reaches the file request. Note that I have shortened the entire process for simple explanation but it shows that the service request passes through a whole lot of things in order to complete the request.

Bottlenecks exist everywhere within the service request path and is also subjected to external factors related to that service request. For a long, long time, I/O has been biggest bottleneck to the processing of the service request because it is usually and almost always the slowest component in the entire scheme of things.

The introduction of SSDs will improve the I/O performance tremendously, into the micro- or even nano-seconds range, putting it in almost equal performance terms with other components in the system architecture. The buses and the bridges in the computer systems could be the new locations where the bottleneck of a service request exist. Hence we have use this understanding to change the modus operandi of the existing types of applications such as databases, email servers and file servers.

The usual tried-and-tested best practices may have to be changed to adapt to the shift of the bottleneck.

So, we have to equip ourselves with what SSDs is doing and will do to the industry. We have to be ready and take advantage of this “quiet” period to learn and know more about SSD technology and what the experts are saying. I found a great website that introduces and speaks about SSD in depth. It is called StorageSearch and it is what I consider the best treasure trove on the web right now for SSD information. It is run by a gentleman named Zsolt Kerekes. Go check it out.

Yup, we must be get ready when SSDs hit the mainstream, and ride the wave.

 

 

Can snapshots replace traditional backups?

Backup is necessary evil. In IT, every operator, administrator, engineer, manager, and C-level executive knows that you got to have backup. When it comes to the protection of data and information in a business, backup is the only way.

Backup has also become the bane of IT operations. Every product that is out there in the market is trying to cram as much production data to backup as possible just to fit into the backup window. We only have 24 hours in a day, so there is no way the backup window can be increased unless

  • You reduce the size of the primary data to be backed up – think compression, deduplication, archiving
  • You replicate the primary data to a secondary device and backup the secondary device – which is ironic because when you replicate, you are creating a copy of the primary data, which technically is a backup. So you are technically backing up a backup
  • You speed up the transfer of primary data to the backup device

Either way, the IT operations is trying to overcome the challenges of the backup window. And the whole purpose for backup is to be cock-sure that data can be restored when it comes to recovery. It’s like insurance. You pay for the premium so that you are able to use the insurance facility to recover during the times of need. We have heard that analogy many times before.

On the flip side of the coin, a snapshot is also a backup. Snapshots are point-in-time copies of the primary data and many a times, snapshots are taken and then used as the source of a “true” backup to a secondary device, be it disk-based or tape-based. However, snapshots have suffered the perception that it is a pseudo-backup, until recent last couple of years.

Here are some food for thoughts …

WHAT IF we eliminate backing data to a secondary device?

WHAT IF the IT operations is ready to embrace snapshots as the true backup?

WHAT IF we rely on snapshots for backup and replicated snapshots for disaster recovery?

First of all, it will solve the perennial issues of backup to a “secondary device”. The operative word here is the “secondary device”, because that secondary device is usually external to the primary storage.

Tape subsystems and tape are constantly being ridiculed as the culprit of missing backup windows. Duplications after duplications of the same set of files in every backup set triggered the adoption of deduplication solutions from Data Domain, Avamar, PureDisk, ExaGrid, Quantum and so on. Networks are also blamed because network backup runs through the LAN. LANless backup will use another conduit, usually Fibre Channel, to transport data to the secondary device.

If we eliminate the “secondary device” and perform backup in the primary storage itself, then networks are no longer part of the backup. There is no need for deduplication because the data could already have been deduplicated and compressed in the primary storage.

Note that what I have suggested is to backup, compress and dedupe, AND also restore from the primary storage. There is no secondary storage device for backup, compress, dedupe and restore.

Wouldn’t that paint a better way of doing backup?

Snapshots will be the only mechanism to backup. Snapshots are quick, usually in minutes and some in seconds. Most snapshot implementations today are space efficient, consuming storage only for delta changes. The primary device will compress and dedupe, depending on the data’s characteristics.

For DR, snapshots are shipped to a remote storage of equal prowess at the DR site, where the snapshot can be rebuild and be in a ready mode to become primary data when required. NetApp SnapVault is one example. ZFS snapshot replication is another.

And when it comes to recovery, quick restores of primary data will be from snapshots. If the primary storage goes down, clients and host initiators can be rerouted quickly to the DR device for services to resume.

I believe with the convergence of multi-core processing power, 10GbE networks, SSDs, very large capacity drives, we could be seeing a shift in the backup design model and possible the entire IT landscape. Snapshots could very likely replace traditional backup in the near future, and secondary device may be a thing of the past.