Blog Archives

Storage must go on a diet

Nowadays, the capacity of the hard disk drives (HDDs) are really big. 3TB is out and 4TB is in the horizon. What’s next?

For small-medium businesses in Malaysia, depending on their data requirements and applications, 3-10TB is pretty sufficient  and with room to grow as well. Therefore, a 6TB requirement can be easily satisfied with 2 x 3TB HDDs.

If I were the customer, why would I buy a storage array, with the software licenses and other stuff that will not only increase my cost of equipment acquisition and data management, it will also increase the complexity of my IT infrastructure? I could just slot HDDs into my existing server, RAID it with RAID-0 (not a good idea but to save costs, most customers would do that) and I have a 6TB volume! It’s cheaper, easier to manage with Windows or Linux, and my system administrator doesn’t have to fuss about lack of storage experience.

And RAID isn’t really keeping up with the tremendous growth of HDD’s capacity as well. In fact, RAID is at risk. RAID (especially RAID 5/6) just cannot continue provide the LUN or volume reliability and data availability because it just takes too damn long to rebuild the volume after the failure of a disk.

Back in the days where HDDs were less than 500GB, RAID-5 would still hold up but after passing the 1TB mark, RAID-6 became more prevalent. But now, that 1TB has ballooned to 3TB and RAID-6 is on shaky ground. What’s next? RAID-7? ZFS has RAID-Z3, triple parity but come on, how many vendors have that? With triple parity or stronger RAID (is there one?), the price of the storage array is going to get too costly.

Experts have been speaking about parity-declustering,  but that’s something that a few vendors have right now. Panasas, founded by one of forefathers of RAID, Garth Gibson, comes to mind. In fact, Garth Gibson and Mark Holland of Cargenie-Mellon University’s Parallel Data Lab (PDL) presented a paper about parity-declustering more than 10 years ago.

Let’s get back to our storage fatty. Yes, our storage is getting fat, obese, rotund or whatever you want to call it. And storage vendors have been pushing a concept in hope that storage administrators and customers can take advantage of it. It is called Storage Optimization or Storage Efficiency.

Here are a few ways you can consider to put your storage on a diet.

  • Compression
  • Thin Provisioning
  • Deduplication
  • Storage Tiering
  • Tapes and SSDs
To me, compression has not taken the storage world by storm. But then again, there aren’t many vendors that tout compression as a feature for storage optimization. Most of them rather prefer to push the darling of data reduction, data deduplication, as the main feature for save more space. Theoretically, data deduplication makes more sense when the data is inactive, and has high occurrence of duplicated data. That is why secondary storage such  as backup deduplication targets like Data Domain, HP StoreOnce, Quantum DXi can publish 20:1 rates and over time, that rate can get even higher.
NetApp also has been pushing their A-SIS data deduplication on primary storage. Yes, it helps with the storage savings in primary but when the need for higher data transfer rates and time to access “manipulated” data (deduped or compressed), it is likely that compression is a better choice for primary, active data.
So who has compression? NetApp ONTAP 8.0.1 has compression now and IBM with its Storewize V7000 started as a compression device. Read about IBM Storewize in my blog here. Dell has Ocarina Networks, which was recently unleashed. I am a big fan of Ocarina Networks and I wrote about the technology in my previous blog. EMC, during the Celerra days of DART has compression but I don’t hear much about it in their VNX. Compression is there, believe me, embedded all the loads of EMC marketing.
Thin Provisioning is now a must-have and standard feature of all storage vendors. What is Thin Provisioning? The diagram below shows you:
In the past, storage systems aren’t so intelligent. You ask for 10TB, you are given 10TB and that 10TB is “deducted” from the storage capacity. That leads to wastage and storage inefficiencies. Today, Thin Provisioning will give you 10TB but storage capacity is consumed as it is being used. The capacity is not pre-allocated as in the past. Thin provisioning is a great diet pill for bloated storage projects. 
Another up and coming feature is storage tiering. Storage tiering, when associated to storage optimization, should include hierarchical storage management (HSM) and tape-out as well. Storage optimization solutions should not offer only in the storage array itself. Storage tiering within the storage array is available with most vendors – IBM EasyTier, EMC FAST2, Dell Fluid Data Management and many others. But what about data being moved out of the storage array? What about reducing the capacity of the data online or near-line? Why not put them offline if there isn’t a need for it?
I term this as Active Archiving, something I learned while I was at EMC. Here’s a look at EMC’s style of Active Archiving:
Active Archiving promotes the concept of data archiving and is not unique only to EMC. Almost all storage vendors, either natively or with 3rd party vendors, can perform fairly efficient data archiving in one way or another. One of the software that I liked (and not unique!) is Quantum Stornext. Here’s a video of how Quantum Stornext helps reduce the fat of the storage.
With the single-copy sharing feature of Quantum Stornext to multiple disparate OSes, there are lesser duplicate files in storage as well.
Tapes have been getting a bad name in the past few years. It has been repositioned and repurposed as an archive medium rather than a backup medium. But tape is the greenest and most powerful storage diet pill around. And we should not be discount tapes because tapes are fighting back. Pretty soon you will be hearing about Linear Tape File System (LTFS). In a nutshell, Linear Tape File System (LTFS) allows you to use the tape almost as if it were a hard disk. You can drag and drop files from your server to the tape, see the list of saved files using a standard operating system directory (no backup software catalog needed), and use point and click to restore. How cool is that!
And Solid State Drives (SSDs) makes sense as well.
There are times that we need IOPS and using spinning drives, we have to set up many disk spindles to achieve the IOPS that we want.  For example, using the diagram below from the godfather of storage, Greg Schulz,
The set of 16 spinning HDD drives on the left can only deliver 3,520 IOPS. The problem is, we have wasted a lot of disk space, as seen in the diagram below. This design, which most customer would be accustomed to, may look cheaper but in actual fact, is NOT.
If the price of a Fibre Channel HDD is RM2,000, the total of 16 would make up RM32,000.00. That is not inclusive of additional power and cooling and rack space and also the data management costs. Assuming the SSDs costs 5 times more than the Fibre Channel HDD. SSDs are capable of delivering very high IOPS. Here I am putting a modest 5,000 IOPS per SSDs. With just 2 SSDs (as the right design suggests), the total costs is only RM20,000. It has greater performance room to grow, and also savings in data management, power and cooling.
Folks, consider SSDs as part of your storage diet plan.
All these features are available, in whole or in part, and they are part of the storage technology offerings that is out there. With all these being said, are you doing something about it? Get off your lazy bum and start managing your storage and put your storage on a diet!!!
Advertisements

The recipe for storage performance modeling

Good morning, afternoon, evening, Ladies & Gentlemen, wherever you are.

Today, we are going to learn how to bake, errr … I mean, make a storage performance model. Before we begin, allow me to set the stage.

Don’t you just hate it when you are asked to do storage performance sizing and you don’t have a freaking idea how to get started? A typical techie would probably say, “Aiya, just use the capacity lah!”, and usually, they will proceed to size the storage according to capacity. In fact, sizing by capacity is the worst way to do storage performance modeling.

Bear in mind that storage is not a black box, although some people wished it was. It is not black magic when it comes to performance sizing because things can be applied in a very scientific and logical manner.

SNIA (Storage Networking Industry Association) has made a storage performance modeling methodology (that’s quite a mouthful), and basically simplified it into these few key ingredients. This recipe is for storage performance modeling in general and I am advising you guys out there to engage your storage vendors professional services. They will know their storage solutions best.

And I am going to say to you – Don’t be cheap and not engage professional services – to get to the experts out there. I was having a chat with an consultant just now at McDonald’s. I have known this friend of mine for about 6-7 years now and his name is Sugen Sumoo, the Director of DBORA Consulting. They specialize in Oracle and database performance tuning and performance forecasting and it is something that a typical DBA can’t do, because DBORA Consulting is the Professional Service that brings expertise and value to Oracle customers. Likewise, you have to engage your respective storage professional services as well.

In a cook book or a cooking show, you are presented with the ingredients used and in this recipe for storage performance modeling, the ingredients (in no particular order) are:

  • Application block size
  • Read and Write ratio
  • Application access patterns
  • Working set size
  • IOPS or throughput
  • Demand intensity

Application Block Size

First of all, the storage is there to serve applications. We always have to look from the applications’ point of view, not storage’s point of view.  Different applications have different block size. Databases typically range from 8K-64K and backup applications usually deal with larger block sizes. Video applications can have 256K block sizes or higher. It all depends.

The best way is to find out from the DBA, email administrator or application developers. The unfortunate thing is most so-called technical people or administrators in Malaysia doesn’t have a clue about the applications they manage. So, my advice to you storage professionals, do your research on the application and take the default value. These clueless fellas are likely to take the default.

Read and Write ratio

Applications behave differently at different times of the day, and at different times of the month (no, it’s not PMS). At the end of the financial year or calendar, there are some tasks that these applications do as well. But in a typical day, there are different weightage or percentage of read operations versus write operations.

Most OLTP (online transaction processing)-based applications tend to be read heavy and write light, but we need to find out the ratio. Typically, it can be a 2:1 ratio or 60%:40%, but it is best to speak to the application administrators about the ratio. DSS (Decision Support Systems) and data warehousing applications could have much higher reads than writes while a seismic-analysis applications can have multiple writes during the analysis periods. It all depends.

To counter the “clueless” administrators, ask lots of questions. Find out the workflow of several key tasks and ask what that particular tasks do at different checkpoints of the application’s processing. If you are lazy (please don’t be lazy, because it degrades your value as a storage professional), use a rule of thumb.

Application access patterns

Applications behave differently in general. They can be sequential, like backup or video streaming. They can be random like emails, databases at certain times of the day, and so on. All these behavioral patterns affect how we design and size the disks in the storage.

Some RAID levels tend to work well with sequential access and others, with random access. It is not difficult to find out about the applications’ pattern and if you read more about the different RAID-levels in storage, you can easily identify the type of RAID levels suitable for each type of behavioral patterns.

Working set size

This variable is a bit more difficult to determine. This means that a chunk of the application has to be loaded into a working area, usually memory and cache memory, to be used and abused by the application users.

Unless someone is well versed with the applications, one would not be able to determine how much of the applications would be placed in memory and in cache memory. Typically, this can only be determined after the application has been running for some time.

The flexibility of having SSDs, especially the DRAM-type of SSDs, are very useful to ensure that there is sufficient “working space” for these applications.

IOPS or Throughput

According to SNIA model, for I/O less than 64K, IOPS should be used as a yardstick to do storage performance modeling. Anything larger, use throughput, in which MB/sec is the measurement unit.

The application guy would be able to tell you what kind of IOPS their application is expecting or what kind of throughput they want. Again, ask a lot of questions, because this will help you determine the type of disks and the kind of performance you give to the application guys.

If the application guy is clueless again, ask someone more senior or ask the vendor. If the vendor engineers cannot give you an answer, then they should not be working for the vendor.

Demand intensity

This part is usually overlooked when it comes to performance sizing. Demand intensity refers to how intense is the I/O requests. It could come from 1 channel or 1 part of the applications, or it could come from several parts of the applications in parallel. It is as if the storage is being ‘bombarded’ by applications and this is the part that is hard to determine as well.

In some applications, the degree of intensity or parallelism can be tuned and to find out, ask the application administrator or developer. If not, ask the vendor. Also do a lot of research on the application’s architecture.

And one last thing. What I have learned is to add buffers to the storage performance model. Typically I would add about 10-20% extra but you never know. As storage professionals, I would strongly encourage to engage professional services, because it is worthwhile, especially in the early stages of the sizing. It is usually a more expensive affair to size it after the applications have been installed and running.

“Failure to plan is planning to fail”.  The recipe isn’t that difficult. Go figure it out.

Don’t just look at disk reliability!

I am sure that many of you in the storage networking industry can relate to this very well.

When 1 or 2 disk drives fail, the customer will usually press you for an answer and usually this question will pop up. “How come the MTBF is 1.5 million hours but the drive(s) failed after a few months? We also get asked of “How reliable are the disks?” “How sure are you that the storage disks I buy will last?”

And for us in this line, we cannot deny the fact that the customer should be better informed (or at least we get cheesed off by these questions). A few blogs ago, I took the easy way out and educated the customer about MTBF (Mean Time Between Failure). This is only a quarter of the story because MTBF alone does not determine the reliability of the storage ecosystem and the reliability of the storage ecosystem (which translates to data availability) is something that the customer should ask rather than spending their time pressing their annoyance onto you about 1 or 2  disk failures.

I also want to say a little about another disk reliability statistics called AFR. More about that later.

Let’s get a little deeper with disk MTBF. Disk MTBF is a statistically calculated, pre-production measurement. The key word here is “PRE” meaning that THIS IS NOT A FIELD TESTED statistics! This is a statistical likelihood of how long a disk device will last.

One thing to note is how MTBF is derived. In fact, MTBF is established before the entire disk drive line goes into volume production. Typically, there is a process called Real Demonstration Test (RDT). RDT involves putting about 1,000 or more drives into a testing chamber, running them very hard, in elevated temperatures with 100% I/O for about 6-8 weeks. This is to simulate the harshest of an operating environment and inevitably, some disk drives will fail. From these failures, the MTBF is calculated.

A enterprise hard disk drives MTBF will usually be between 1.2 million to 2.0 million hours while the consumer grade drives usually have MTBF of about 300,000-600,000 hours. Therefore, it is important to educate customers because customers like to use some home office/SMB storage solutions to compare with the enterprise storage solution you are about to propose to him.

One of the war stories I heard was from a high-definition video production house. They get hundreds of thousands of Malaysian Ringgit worth of contract from a satellite TV content provider. But being less “educated” (could also be translated to being cheapo), they decided to store their valuable video contents on Buffalo NAS storage. And video production environments can be harsh. The I/O stress on the disks are strenuous and the Buffalo NAS disks crashed. They lost all contents (I don’t know what happened to their backup), and they were fined hundreds of thousands of Malaysian Ringgit and had their contract terminated on the spot. This is not to say that the Buffalo NAS is a poor product, but they got the wrong product for the job. You can’t expect to race the Formula 1 with an old jalopy, can you? You got to get the right solution for the job, even if it costs more.

So the moral of the story is – “Educate yourself and be prepared to invest if the dollar value of the data is more important than what are you think you might be cost-saving”

Over the years, MTBF (even though it is still very much in use today) is getting less and less useful as a reliability measurement. So, what’s better? AFR!

AFR or Annualized Failure Rate has been in use for almost 10 years now, and Seagate, the hard disk manufacturer, uses the AFR value heavily. AFR is the percentage of the installed bases of hard disk drives that failed and returned to factory in a given year. This is a more realistic figure and it is the statistics from the field. The typical value for enterprise disk drives  is usually between 0.7-1.0% although a few years ago, Google created a splash in the industry when they reported in an AFR of 36%. For those who would like to read Google’s paper, click here.

Therefore AFR is a more reliable measurement of disk reliability than MTBF.

But disk reliability is just a 1/4 of the story. We need to be out there educating the customers about the storage ecosystem reliability rather than a specific component. The data availability is paramount because components will fail throughout the lifecycle of the solution. That is why there are technology like RAID, snapshots, backup, mirroring and so on to ensure that the data is made available for the operations and businesses to continue.

Ultimately, if the customer wants to use the disk MTBF onto you, he’s basically shooting at you with the wrong bullet. It’s time you storage networking professional out there educate the customers.

Using simple MTBF to determine reliability to Finance

The other day, a prospect was requesting quotations after quotations from a friend of mine to make so-called “apple-to-apple” comparison with another storage vendor. But it was difficult to have that sort of comparisons because one guy would propose SAS, and the other SATA and so on. I was roped in by my friend to help. So in the end I asked this prospect, which 3 of these criteria matters to him most – Performance, Capacity or Reliability.

He gave me an answer and the reliability criteria was leading his requirement. Then he asked me if I could help determine in a “quick-and-dirty manner” by using MTBF (Mean Time Between Failure) of the disks to convince his finance about the question of reliability.

Well, most HDD vendors published their MTBF as a measuring stick to determine the reliability of their manufactured disks. MTBF is by no means accurate but it is useful to define HDD reliability in a crude manner. If you have seen the components that goes into a HDD, you would be amazed that the HDD components go through a tremendously stressed environment. The Read/Write head operating at a flight height (head gap)  between the platters thinner than a human hair and the servo-controlled technology maintains the constant, never-lagging 7200/10,000/15,000 RPM days-after-days, months-after-months, years-after-years. And it yet, we seem to take the HDD for granted, rarely thinking how much technology goes into it on a nanoscale. That’s technology at its best – bringing something so complex to make it so simple for all of us.

I found that the Seagate Constellation.2 Enterprise-class 3TB 7200 RPM disk MTBF is 1.2 million hours while the Seagate Cheetah 600GB 10,000 RPM disk MTBF is 1.5 million hours. So, the Cheetah is about 30% more reliable than the Constellation.2, right?

Wrong! There are other factors involved. In order to achieve 3TB usable, a RAID 1 (average write performance, very good read performance) would require 2 units of 3TB 7200 RPM disks. On the other hand, using a 10, 000 RPM disks, with the largest shipping capacity of 600GB, you would need 10 units of such HDDs. RAID-DP (this is NetApp by the way) would give average write performance (better than RAID 1 in some cases) and very good read performance (for sequential access).

So, I broke down the above 2 examples to this prospect (to achieve 3TB usable)

  1. Seagate Constellation.2 3TB 7200 RPM HDD MTBF is 1.2 million hours x 2 units
  2. Seagate Cheetah 600GB 10,000 RPM HDD MTBF is 1.5 million hours x 10 units

By using a simple calculation of

    RF (Reliability Factor) = MTBF/#HDDs

the prospect will be able to determine which of the 2 HDD types above could be more reliable.

In case #1, RF is 600,000 hours and in case #2, the RF is 125,000 hours. Suddenly you can see that the Constellation.2 HDDs which has a lower MTBF has a higher RF compared to the Cheetah HDDs. Quick and simple, isn’t it?

Note that I did not use the SAS versus SATA technology into the mixture because they don’t matter. SAS and SATA are merely data channels that drives data in and out of the spinning HDDs. So, folks, don’t be fooled that a SAS drive is more reliable than a SATA drive. Sometimes, they are just the same old spinning HDDs. In fact, the mentioned Seagate Constellation.2 HDD (3TB, 7200 RPM) has both SAS and SATA interface.

Of course, this is just one factor in the whole Reliability universe. Other factors such as RAID-level, checksum, CRC, single or dual-controller also determines the reliability of the entire storage array.

In conclusion, we all know that the MTBF alone does not determine the reliability of the solution the prospect is about to purchase. But this is one way you can use to help the finance people to get the idea of reliability.

Copy-on-Write and SSDs – A better match than other file systems?

We have been taught that file systems are like folders, sub-folders and eventually files. The criteria in designing file systems is to ensure that there are few key features

  • Ease of storing, retrieving and organizing files (sounds like a fridge, doesn’t it?)
  • Simple naming convention for files
  • Performance in storing and retrieving files – hence our write and read I/Os
  • Resilience in restoring full or part of a file when there are discrepancies

In file systems performance design, one of the most important factors is locality. By locality, I mean that data blocks of a particular file should be as nearby as possible. Hence, in most file systems designs originated from the Berkeley Fast File System (BFFS), requires the file system to seek the data block to be modified to ensure locality, i.e. you try not to split up the contiguity of the data blocks. The seek time to find the require data block takes time, but you are compensate with faster reads because the read-ahead feature allows you to read extra blocks ahead in anticipation that the data blocks are related.

In Copy-on-Write file systems (also known as shadow-paging file systems), the seek portion is usually not present because the new modified block is written somewhere else, not the present location of the original block. This is the foundation of Copy-on-Write file systems such as NetApp’s WAFL and Oracle Solaris ZFS. Because the new data blocks are written somewhere else, the storing (write operation) portion is faster. It eliminated the seek time and it also skipped the read-modify-write action to the original location of the data block. Therefore, write is likely to be faster.

However, the read portion will be slower because if you want to read a file, the file system has to go around looking for the data blocks because it lacks the locality. Therefore, as the COW file system ages, it tends to have higher file system fragmentation. I wrote about this in my previous blog. It is a case of ENJOY-FIRST/SUFFER-LATER. I am not writing this to say that COW file systems are bad. Obviously, NetApp and Oracle have done enough homework to make the file systems one of the better storage file systems in the market.

So, that’s Copy-on-Write file systems. But what about SSDs?

Solid State Drives (SSDs) will make enemies with file systems that tend prefer locality. Remember that some file systems prefer its data blocks to be contiguous? Well, SSDs employ “wear-leveling” and required writes to be spread out as much as possible across the SSDs device to prolong the life of the SSD device to reduce “wear-and-tear”. That’s not good news because SSDs just told the file systems, “I don’t like locality and I will spread out the data blocks“.

NAND Flash SSDs (the common ones we find in the market and not DRAM-based SSDs) are funny creatures. When you write to SSDs, you must ERASE first, WRITE AGAIN to the SSDs. This is the part that is creating the wear-and tear of the device. When I mean ERASE first, WRITE AGAIN, I describe it below

  • Writing 1 –> 0 (OK, no problem)
  • Writing 0 –> 1 (not OK, because NAND Flash can’t do that)

So, what does the SSD do? It ERASES everything, writing the entire data blocks on the device to 1s, and then converting some of them to 0s. Crazy, isn’t it? The firmware in the SSDs controller will also spread out the erase-and-then write operations across the entire SSD device to avoid concentrating the operations on a small location or dataset. This is the “wear-leveling” we often hear about.

Since SSDs shun locality and avoid the data blocks to be nearby, and Copy-on-Write file systems are already doing this because its nature to write new data blocks somewhere else, the combination of both COW file system and SSDs seems like a very good fit. It even looks symbiotic because it is a case of “I help you; and you help me“.

From this perspective, the benefits of COW file systems and SSDs extends beyond resiliency of the SSD device but also in performance. Since the data blocks are spread out at different locations in the SSD device, the effect of parallelism will inadvertently help with COW’s performance. Make sense, doesn’t it?

I have not learned about other file systems and how they behave with SSDs, but it is pretty clear that Copy-on-Write file systems works well with Solid State Devices. Have a good week ahead :-)!

What kind of IOPS and throughput do you get from RAID-5/6? – Part 2

In my previous blog entry, I mentioned the write penalty for RAID-5/6. This factor will figure heavily in the way we size the RAID-level for performance capacity planning.

It is difficult to ascertain what kind of IOPS and throughput that are required for an application, especially a database, to run well with additional room to grow. From a DBA or an application developer, I believe they would have adequate information to tell what is the numbers of users that the application can support, both average and peak, transactions per second (TPS), block size required for logs, database files and so on.

But as we are all aware, most of the time, these types of information are not readily available. So, coming from a storage angle, the storage administrator can advise the DBA or the application developer that the configured RAID group or volume or LUN is capable of delivering a certain number of IOPS and is able to achieve a certain throughput MB/sec. These numbers will be off the box itself immediately. Of course, other factors such as HBA speed, the FC/iSCSI configurations, the network traffic and so on will affect the overall performance delivery to the application. But we can safely inform the DBA and/or the application developer that this is what the storage is delivering out of the box.

The building blocks of all storage RAID groups/volumes/LUNs are pretty much your hard disk drives (HDDs) and/or Solid State Drives (SSDs). The manufacturer of these disks will usually publish the IOPS and throughput of individual drives but if these information is not available, we can construct IOPS of an individual HDD from its seek and latency times.

For example, if the HDD’s

average latency = 2.8 ms;          average read seek = 4.2 ms;              average write seek = 4.8 ms

then the IOPS can be calculated as

                                  1
         IOPS = ---------------------------------------
                (average latency) + (average seek time)

Therefore from the details above,

                    1
         IOPS = -------------------  = 136.986 IOPS
                (0.0028) + (0.0045)

That’s pretty simple, right? But of course, it is easier to just accept that a certain type of disk will have a range of IOPS as shown in the table below:

Disk Type RPM IOPS Range
SATA 5,400 50-75
SATA 7,200 75-100
SAS/FC 10,000 100-125
SAS/FC 15,000 175-200
SSD N/A 5,000-10,000

The information from the table above is just for reference only and by no means a very accurate one but it is good enough for us to determine the IOPS of a RAID group/volume/LUN. Let’s look at the RAID write penalty again in the table below:

RAID-level Number of I/O Reads
Number of I/O for Writes
RAID Write Penalty
0 1 1 1
1 (1+0, 0+1) 1 2 2
5 1 4 4
6 1 6 6

Next, we need to know what is the ratio of Reads vs Writes for that particular database or application. I mentioned earlier that in OLTP-type of applications, we usually take a 2:1 or 3:1 ratio in favour of Reads.

To make things simpler, let’s assume we create a RAID-6 volume of 6 data disks and 2 parity disks in a RAID-6 (6+2) configuration. The disks used are SATA disks of 7,200 RPM, with each individual disk of 100 IOPS. Assume we are using a ratio of 2:1 in favour of Reads, which gives us 66.666% and 33.333% respectively for Reads and Writes.

Therefore, the combined IOPS of the 8 disks in the RAID-6 configuration is probably about 800 IOPS. However, because of the write penalty of RAID-6, the effective IOPS for the RAID-6 volume will be lower than that. Let’s do some calculation to see what happens:

1)  Read IOPS + Write IOPS = 800 IOPS

2)  (0.66666 x 800) + (0.33333 x 800) = 800 IOPS

3) Read IOPS will be 0.66666 x 800 = 533.328 IOPS

4) Write IOPS will be 0.33333 x 800 = 266.664 IOPS. However, since RAID-6 has a write penalty of 6, this number has to be divided by 6. 266.664/6 will be 44.444 IOPS for Writes

Therefore, what the RAID-6 volume is capable of is approximately 533 IOPS for Reads and 44 IOPS for Writes.

We have determined IOPS for the RAID volume but what about throughput. Throughput is determined by the block size used. Assume that our RAID-6 volume uses a 4-K block size. With a combined effective IOPS of 577 (533+44), we multiply the IOPS with the block size

     Throughput = 577 IOPS x 4-KB
                = 2308KB/sec

Therefore when I/O is sustained in a sequential manner, the effective throughput is 2308KB/sec.

On the other hand, we often were told to add more spindles to the volume to increase the IOPS. This is true, to a point, where the maximum amount of IOPS that can be delivered will taper into a flatline, because the I/O channel to the RAID volume  has been saturated. Therefore, it is best to know that adding more spindles does not always equate to a higher IOPS.

Performance sizing for a database or an application is both a science and an art. Mathematically, we can prove things to a a certain amount of accuracy and confidence but each storage platform is very different in the way they handle RAID. Newer storage platforms have proprietary RAID that nowadays, it does not matter much what kind of RAID is best for the application. Vendors such as IBM XIV has RAID-X which both radical in design and implementation. NetApp will almost always say RAID-DP is the best no matter what, because RAID-DP is all NetApp.

So there is no right or wrong to choose the RAID-level for the application. But it is VERY important to know what are the best practice are and my advice is everyone is to do Proof-of-Concepts, and TEST, TEST, TEST! And ASK QUESTIONS!

Don’t RAID-5/6 everything! – Part 1

It’s a beautiful Saturday morning … the sun is out, and the birds are chirping … and here I am, thinking about RAID-5/6. What’s wrong with me?

Anyway, have you ever wondered almost all your volumes are in a RAID-5/6 configuration? Like an obedient child, the answer would probably be “Oh, my vendor said it is good for me …”

In storage, the rule is applications-read, applications-write. And different applications have different behaviors but typically, they fall under 2 categories:

  • Random access
  • Sequential access

The next question to ask is how much Read/Writes ratio (or percentage) is in that Random Access behavior and how much of Read/Write ratio in Sequential Access behavior.

We usually pigeonhole transactional databases such as SQL Server, Oracle into OLTP-type characteristics with random access being the dominant access method. Similarly, email applications such as Exchange, Lotus and even SMTP into similar OLTP-type characteristics as well. We typically do a 2:1 or 3:1 ratio for OLTP-type applications with Read heavy and less of Writes. Data warehouse type of databases tend to be more sequential.

However, even within these OLTP applications, there are also sequential access behaviors as well, as the following table for a database shows:

Operation Random or Sequential Read/Write Heavy Block Size
DB-Log Random (Sequential in log recovery) Write Heavy unless you are doing log recovery 1KB – 64KB
DB-Data Files Random Read/Write mix dependent on load 4KB – 32KB
Batch insert Sequential Write Heavy 8KB – 128KB
Index scan Sequential Read Heavy 8KB – 128KB

We will look into 4 RAID-levels in this scenario and see how each RAID-level applies to an OLTP-type of environment. These RAID levels are RAID-0, RAID-1 (1+0, 0+1 included), RAID-5 and RAID-6.

RAID-0 is the baseline, with 1 x Read and 1 x Write being processed as per normal.

In RAID-1, it would require 2 x Writes and 1 x Read, because the write operation is mirrored. The RAID penalty is 2.

To avoid the cost of RAID-1, RAID-5 is almost always the RAID level of choice (unless you speak to those NetApp fellas). RAID-5 is a parity-based RAID and require 2 x Read (1 to read the data block and 1 to read the parity block) AND 2 x Write (1 to write the modified block and 1 to write the modified parity). Hence it has a RAID penalty of 4.

RAID-6 was to address the risk of RAID-5 because disk capacity are so freaking large now (3TB just came out). To rebuild a large-TB drive would take longer time and the RAID-5 volume is at risk if a second disk failure occurs. Hence, double parity RAID in RAID-6. But unfortunately, the RAID penalty for RAID-6 is 6!

To summarize the RAID write penalty,

RAID-level Number of I/O Reads
Number of I/O for Writes
RAID Write Penalty
0 1 1 1
1 (1+0, 0+1) 1 2 2
5 1 4 4
6 1 6 6

So, it is well known that RAID 0 has good performance for reads and writes but with absolutely no protection. RAID-1 would be good for random reads and writes but it is costly. RAID-5 is good for applications with a high ratio of sequential reads vs writes (2:1, 3:1 as mentioned), and RAID-6, errr … should be taken similarly as RAID-5 with some additional performance penalty.

With that in mind, a storage administrator must question why a particular RAID-level was proposed to the database or any like-applications.

I am going out to enjoy the Saturday now … and today, August 13th is the World’s Left-Handed Day. More about this RAID penalty and IOPS in my next entry.