Category Archives: RAID
Nowadays, the capacity of the hard disk drives (HDDs) are really big. 3TB is out and 4TB is in the horizon. What’s next?
For small-medium businesses in Malaysia, depending on their data requirements and applications, 3-10TB is pretty sufficient and with room to grow as well. Therefore, a 6TB requirement can be easily satisfied with 2 x 3TB HDDs.
If I were the customer, why would I buy a storage array, with the software licenses and other stuff that will not only increase my cost of equipment acquisition and data management, it will also increase the complexity of my IT infrastructure? I could just slot HDDs into my existing server, RAID it with RAID-0 (not a good idea but to save costs, most customers would do that) and I have a 6TB volume! It’s cheaper, easier to manage with Windows or Linux, and my system administrator doesn’t have to fuss about lack of storage experience.
And RAID isn’t really keeping up with the tremendous growth of HDD’s capacity as well. In fact, RAID is at risk. RAID (especially RAID 5/6) just cannot continue provide the LUN or volume reliability and data availability because it just takes too damn long to rebuild the volume after the failure of a disk.
Back in the days where HDDs were less than 500GB, RAID-5 would still hold up but after passing the 1TB mark, RAID-6 became more prevalent. But now, that 1TB has ballooned to 3TB and RAID-6 is on shaky ground. What’s next? RAID-7? ZFS has RAID-Z3, triple parity but come on, how many vendors have that? With triple parity or stronger RAID (is there one?), the price of the storage array is going to get too costly.
Experts have been speaking about parity-declustering, but that’s something that a few vendors have right now. Panasas, founded by one of forefathers of RAID, Garth Gibson, comes to mind. In fact, Garth Gibson and Mark Holland of Cargenie-Mellon University’s Parallel Data Lab (PDL) presented a paper about parity-declustering more than 10 years ago.
Let’s get back to our storage fatty. Yes, our storage is getting fat, obese, rotund or whatever you want to call it. And storage vendors have been pushing a concept in hope that storage administrators and customers can take advantage of it. It is called Storage Optimization or Storage Efficiency.
Here are a few ways you can consider to put your storage on a diet.
- Thin Provisioning
- Storage Tiering
- Tapes and SSDs
Good morning, afternoon, evening, Ladies & Gentlemen, wherever you are.
Today, we are going to learn how to bake, errr … I mean, make a storage performance model. Before we begin, allow me to set the stage.
Don’t you just hate it when you are asked to do storage performance sizing and you don’t have a freaking idea how to get started? A typical techie would probably say, “Aiya, just use the capacity lah!”, and usually, they will proceed to size the storage according to capacity. In fact, sizing by capacity is the worst way to do storage performance modeling.
Bear in mind that storage is not a black box, although some people wished it was. It is not black magic when it comes to performance sizing because things can be applied in a very scientific and logical manner.
SNIA (Storage Networking Industry Association) has made a storage performance modeling methodology (that’s quite a mouthful), and basically simplified it into these few key ingredients. This recipe is for storage performance modeling in general and I am advising you guys out there to engage your storage vendors professional services. They will know their storage solutions best.
And I am going to say to you – Don’t be cheap and not engage professional services – to get to the experts out there. I was having a chat with an consultant just now at McDonald’s. I have known this friend of mine for about 6-7 years now and his name is Sugen Sumoo, the Director of DBORA Consulting. They specialize in Oracle and database performance tuning and performance forecasting and it is something that a typical DBA can’t do, because DBORA Consulting is the Professional Service that brings expertise and value to Oracle customers. Likewise, you have to engage your respective storage professional services as well.
In a cook book or a cooking show, you are presented with the ingredients used and in this recipe for storage performance modeling, the ingredients (in no particular order) are:
- Application block size
- Read and Write ratio
- Application access patterns
- Working set size
- IOPS or throughput
- Demand intensity
Application Block Size
First of all, the storage is there to serve applications. We always have to look from the applications’ point of view, not storage’s point of view. Different applications have different block size. Databases typically range from 8K-64K and backup applications usually deal with larger block sizes. Video applications can have 256K block sizes or higher. It all depends.
The best way is to find out from the DBA, email administrator or application developers. The unfortunate thing is most so-called technical people or administrators in Malaysia doesn’t have a clue about the applications they manage. So, my advice to you storage professionals, do your research on the application and take the default value. These clueless fellas are likely to take the default.
Read and Write ratio
Applications behave differently at different times of the day, and at different times of the month (no, it’s not PMS). At the end of the financial year or calendar, there are some tasks that these applications do as well. But in a typical day, there are different weightage or percentage of read operations versus write operations.
Most OLTP (online transaction processing)-based applications tend to be read heavy and write light, but we need to find out the ratio. Typically, it can be a 2:1 ratio or 60%:40%, but it is best to speak to the application administrators about the ratio. DSS (Decision Support Systems) and data warehousing applications could have much higher reads than writes while a seismic-analysis applications can have multiple writes during the analysis periods. It all depends.
To counter the “clueless” administrators, ask lots of questions. Find out the workflow of several key tasks and ask what that particular tasks do at different checkpoints of the application’s processing. If you are lazy (please don’t be lazy, because it degrades your value as a storage professional), use a rule of thumb.
Application access patterns
Applications behave differently in general. They can be sequential, like backup or video streaming. They can be random like emails, databases at certain times of the day, and so on. All these behavioral patterns affect how we design and size the disks in the storage.
Some RAID levels tend to work well with sequential access and others, with random access. It is not difficult to find out about the applications’ pattern and if you read more about the different RAID-levels in storage, you can easily identify the type of RAID levels suitable for each type of behavioral patterns.
Working set size
This variable is a bit more difficult to determine. This means that a chunk of the application has to be loaded into a working area, usually memory and cache memory, to be used and abused by the application users.
Unless someone is well versed with the applications, one would not be able to determine how much of the applications would be placed in memory and in cache memory. Typically, this can only be determined after the application has been running for some time.
The flexibility of having SSDs, especially the DRAM-type of SSDs, are very useful to ensure that there is sufficient “working space” for these applications.
IOPS or Throughput
According to SNIA model, for I/O less than 64K, IOPS should be used as a yardstick to do storage performance modeling. Anything larger, use throughput, in which MB/sec is the measurement unit.
The application guy would be able to tell you what kind of IOPS their application is expecting or what kind of throughput they want. Again, ask a lot of questions, because this will help you determine the type of disks and the kind of performance you give to the application guys.
If the application guy is clueless again, ask someone more senior or ask the vendor. If the vendor engineers cannot give you an answer, then they should not be working for the vendor.
This part is usually overlooked when it comes to performance sizing. Demand intensity refers to how intense is the I/O requests. It could come from 1 channel or 1 part of the applications, or it could come from several parts of the applications in parallel. It is as if the storage is being ‘bombarded’ by applications and this is the part that is hard to determine as well.
In some applications, the degree of intensity or parallelism can be tuned and to find out, ask the application administrator or developer. If not, ask the vendor. Also do a lot of research on the application’s architecture.
And one last thing. What I have learned is to add buffers to the storage performance model. Typically I would add about 10-20% extra but you never know. As storage professionals, I would strongly encourage to engage professional services, because it is worthwhile, especially in the early stages of the sizing. It is usually a more expensive affair to size it after the applications have been installed and running.
“Failure to plan is planning to fail”. The recipe isn’t that difficult. Go figure it out.
This is the third and last blog entry of how do we get the ONTAP final capacity.
In my first blog, we ran through a gamut of explanations how disk rightsizing came about for NetApp’s ONTAP. And the importance of disk rightsizing is to give ONTAP a level set of disks, regardless of manufacturer, model, make, firmware versions and so on, and ONTAP is pretty damn sure that the disks that it gets will not mess up.
In my second blog, progressing from the disk rightsizing stage, was the RAID group sizing stage, where different RAID group size affected the number of disks used for data and for parity in an aggregate. An aggregate, for the uninformed, is the disks pool in which the flexible volume, FlexVol, is derived. In a simple picture below,
OK, the diagram’s in Japanese (I am feeling a bit cheeky today :P)!
But it does look a bit self explanatory with some help which I shall provide now. If you start from the bottom of the picture, 16 x 300GB disks are combined together to create a RAID Group. And there are 4 RAID Groups created – rg0, rg1, rg2 and rg3. These RAID groups make up the ONTAP data structure called an aggregate. From ONTAP version 7.3 onward, there were some minor changes of how ONTAP reports capacity but fundamentally, it did not change much from previous versions of ONTAP. And also note that ONTAP takes a 10% overhead of the aggregate for its own use.
With the aggregate, the logical structure called the FlexVol is created. FlexVol can be as small as several megabytes to as large as 100TB, incremental by any size on-the-fly. This logical structure also allow shrinking of the capacity of the volume online and on-the-fly as well. Eventually, the volumes created from the aggregate become the next-building blocks of NetApp NFS and CIFS volumes and also LUNs for iSCSI and Fibre Channel. Also note that, for a more effective organization of logical structures from the volumes, using qtree is highly recommended for files and ONTAP management reasons.
However, for both aggregate and the FlexVol volumes created from the aggregate, snapshot reserve is recommended. The aggregate takes a 5% overhead of the capacity for snapshot reserve, while for every FlexVol volume, a 20% snapshot reserve is applied. While both snapshot percentage are adjustable, it is recommended to keep them as best practice (except for FlexVol volumes assigned for LUNs, which could be adjusted to 0%)
Note: Even if the snapshot reserve is adjusted to 0%, there are still some other rule sets for these LUNs that will further reduce the capacity. When dealing with NetApp engineers or pre-sales, ask them about space reservations and how they do snapshots for fat LUNs and thin LUNs and their best practices in these situations. Believe me, if you don’t ask, you will be very surprised of the final usable capacity allocated to your applications)
In a nutshell, the dissection of capacity after the aggregate would look like the picture below:
We can easily quantify the overall usable in the little formula that I use for some time:
Rightsized Disks capacity x # Disks x 0.90 x 0.95 = Total Aggregate Usable Capacity
Then remember that each volume takes a 20% snapshot reserve overhead. That’s what you have got to play with when it comes to the final usable capacity.
Though the capacity is not 100% accurate because there are many variables in play but it gives the customer a way to manually calculate their potential final usable capacity.
Please note the following best practices and this is only applied to 1 data aggregate only. For more aggregates, the same formula has to be applied again.
- A RAID-DP, 3-disk rootvol0, for the root volume is set aside and is not accounted for in usable capacity
- A rule-of-thumb of 2-disks hot spares is applied for every 30 disks
- The default RAID Group size is used, depending on the type of disk profile used
- Snapshot reserves default of 5% for aggregate and 20% per FlexVol volumes are applied
- Snapshots for LUNs are subjected to space reservation, either full or fractional. Note that there are considerations of 2x + delta and 1x + delta (ask your NetApp engineer) for iSCSI and Fibre Channel LUNs, even though snapshot reserves are adjusted to 0% and snapshots are likely to be turned off.
It has been a tough week for me and that’s why I haven’t been writing much this week. So, right now, right after dinner, I am back on keyboard again, continuing where I have left off with NetApp’s usable capacity.
A blog and a half ago, I wrote about the journey of getting NetApp’s usable capacity and stopping up to the point of the disk capacity after rightsizing. We ended with the table below.
|Manufacturer Marketing Capacity||NetApp Rightsized Capacity|
* The size of 34.5GB was for the Fibre Channel Zone Checksum mechanism employed prior to ONTAP version 6.5 of 512 bytes per sector. After ONTAP 6.5, block checksum of 520 bytes per sector was employed for greater data integrity protection and resiliency.
At this stage, the next variable to consider is RAID group sizing. NetApp’s ONTAP employs 2 types of RAID level – RAID-4 and the default RAID-DP (a unique implementation of RAID-6, employing 2 dedicated disks as double parity).
Before all the physical hard disk drives (HDDs) are pooled into a logical construct called an aggregate (which is what ONTAP’s FlexVol is about), the HDDs are grouped into a RAID group. A RAID group is also a logical construct, in which it combines all HDDs into data or parity disks. The RAID group is the building block of the Aggregate.
So why a RAID group? Well, first of all, (although likely possible), it is not prudent to group a large number of HDDs into a single group with only 2 parity drives supporting the RAID. Even though one can maximize the allowable, aggregated capacity from the HDDs, the data reconstruction or data resilvering operation following a HDD failure (disks are supposed to fail once in a while, remember?) would very much slow the RAID operations to a trickle because of the large number of HDDs the operation has to address. Therefore, it is best to spread them out into multiple RAID groups with a recommended fixed number of HDDs per RAID group.
RAID group is important because it is used to balance a few considerations
- Performance in recovery if there is a disk reconstruction or resilvering
- Combined RAID performance and availability through a Mean Time Between Data Loss (MTBDL) formula
Different ONTAP versions (and also different disk types) have different number of HDDs to constitute a RAID group. For ONTAP 8.0.1, the table below are its recommendation.
So, given a large pool of HDDs, the NetApp storage administrator has to figure out the best layout and the optimal number of HDDs to get to the capacity he/she wants. And there is also a best practice to set aside 2 HDDs for a RAID-DP configuration with every 30 or so HDDs. Also, it is best practice to take the default recommended RAID group size most of the time.
I would presume that this is all getting very confusing, so let me show that with an example. Let’s use the common 2TB SATA HDD and let’s assume the customer has just bought a 100 HDDs FAS6000. From the table above, the default (and recommended) RAID group size is 14. The customer wants to have maximum usable capacity as well. In a step-by-step guide,
- Consider the hot sparing best practice. The customer wants to ensure that there will always be enough spares, so using the rule-of-thumb of 2 HDDs per 30 HDDs, 6 disks are set aside as hot spares. That leaves 94 HDDs from the initial 100 HDDs.
- There is a root volume, rootvol, and it is recommended to put this into an aggregate of its own so that it gets maximum performance and availability. To standardize, the storage administrator configures 3 HDDs as 1 RAID group to create the rootvol aggregate, aggr0. Even though the total capacity used by the rootvol is just a few hundred GBs, it is not recommended to place data into rootvol. Of course, this situation cannot be avoided in most of the FAS2000 series, where a smaller HDDs count are sold and implemented. With 3 HDDs used up as rootvol, the customer now has 91 HDDs.
- With 91 HDDs, and using the default RAID group size of 14, for the next aggregate of aggr1, the storage administrator can configure 6 x full RAID group of 14 HDDs (6 x 14 = 84) and 1 x partial RAID group of 7. (91/14 = 6 remainder 7). And 84 + 7 = 91 HDDs.
- RAID-DP requires 2 disks per RAID group to be used as parity disks. Since there are a total of 7 RAID groups from the 91 HDDs, 14 HDDs are parity disks, leaving 77 HDDs as data disks.
This is where the rightsized capacity comes back into play again. 77 x 2TB HDDs is really 77 x 1.69TB = 130.13TB from an initial of 100 x 2TB = 200TB.
If you intend to create more aggregates (in our example here, we have only 2 aggregates – aggr0 and aggr1), there will be more consideration for RAID group sizing and parity disks, further reducing the usable capacity.
This is just part 2 of our “Playing with NetApp Capacity” series. We have not arrived at the final usable capacity yet and I will further share that with you over the weekend.
After being in the storage networking industry for so long, I have seen most of the new storage solutions out there. Most of them don’t really differ much from what already out there, and it gets a little boring. But once in a while, a little gem is unearthed and my excitement bubbles up again.
Today, I was at the HP P4000 G2 SAN workshop and the LeftHand Networks SAN/iQ storage solution which HP acquired in 2008 left me with 3 words – Interesting, Innovative and Impressive – from a technology standpoint.
I must admit that this is a little gem that got past my radar and now it’s HP’s gain. I have heard about LeftHand Networks in the past, and at the same time, I was also looking at another storage solution called Intransa. Unfortunately, Intransa went on to differentiate themselves and today, they are focused more as a storage solution for videos and CCTVs, seldom surfacing with innovative technology. LeftHand Networks was and is different and I can understand why HP bought them, because the technology that they bring with them to HP is really cool!
Now rebranded and renamed as HP P4000 G2 SAN, the storage solution no longer sits on proprietary hardware. As part of HP’s Converged Infrastructure strategy, the SAN/iQ has been fully integrated into the HP Proliant x86 platform (I heard there’s a blade version as well), making it simple to procure and probably helps simplify operational resource planning and logistics as well. At the same, there is also a P4000 VSA (Virtual Storage Appliance) as well, which HP guys have been using for demo for several years now. There is a 60-day trial available at the HP P4000 VSA Download site, for organizations to have a try-and-buy and if they do, they can turn some of their old x86 platforms into a storage appliance by just adding more hard disk drives. That’s saves money too!
So, what’s cool, you say?
2 key technologies stands out
- Storage Clustering
- Network RAID
As I was well informed at the workshop today, the Storage Clustering technology is not exclusive to the P4000. In fact, Dell EqualLogic employs something similar as well. But it was something that impressed me and it is different from the traditional storage SANs that we usually see.
You see, in the traditional SAN setup, the LUNs or volumes are either loosely or tightly linked to 2 active/active storage processors/controllers. And the way most of the storage vendors do, when a customer runs out of capacity or performance or both, they would have to do a forklift upgrade of the controllers. This is something that is disruptive and also does not allow CPU, memory or I/O channels upgrade to the existing controller. Today, most storage vendors do not allow you to break open the storage processor chassis and change the CPU, add more RAM or add more I/O paths to support more disk drives or increase throughput. Mind you, this is something that I have been questioning for a long time but as the storage networking industry has it, you got to upgrade the entire storage processor or controller in order to get more power and capacity.
The P4000 (as well as the Dell EqualLogic) approaches this from another angle where instead of doing a forklift upgrade of the storage processor/controller, just add another node of the same CPU and RAM profile, and have the P4000 SAN/iQ software group the new node together with the existing node(s) to form a storage cluster group. As best practice, the Storage Cluster feature should have 16 nodes or less, but in one of the war stories shared, one customer in the US actually had 32 nodes in a Storage Cluster group, for storage capacity reasons.
As more nodes are added to the Storage Cluster group, the LUNs/volumes can be extended or spanned to the other nodes as long as they are physically connected in a Gigabit network and the entire LUN or volume is been seen as ONE irregardless of which physical nodes it may be sitting. Typically you will see this sort of thing of single “Global Namespace” concept at the file system level but this is the first time I have seen it implemented at the SAN level. (Ok, I have to admit that I am a little behind times with this technology)
Here’s a little diagram I dug up from LeftHand before it was acquired by HP which I hope will enlightened the readers about this Storage Cluster feature.
But the best is yet to come as the HP Solution Architect (Timothy Chua) mentioned that the Network RAID feature was uniquely LeftHand’s and way cooler. And I couldn’t agree more because this lighted me up like a spark plug!
Since Storage Clustering could span LUNs/volumes across nodes, it was only natural that the RAID capability be extended across nodes as well. RAID-10, RAID-5, RAID-6 could all be spanned across all nodes, spread the data blocks and its mirrored/parity data blocks across the nodes in the network. And the nodes does not have to at a single site. With Gigabit networks, the nodes can be separated into multiple sites as well, giving the entire solution quite a comprehensive campus-wide storage high availability. And since this is Network RAID, it gives an entirely new meaning to the word Disaster Recovery because this will eliminate the need for data replication. Primary data in a Network RAID-10 in Node 1/Site 2 could be mirrored in Node 2/Site 2, which can be further mirrored to Node 3/Site 3 and Node 4/Site 4 for a 4-way mirror. This is the P4000 Multi-site SAN solution.
The diagram below shows how Network RAID is implemented with VMware ESX.
And since replication is no longer a requirement, VMware’s SRM (Site Recovery Manager) is also not required as well.
It is no surprise that synchronous replication in the P4000 solution is equivalent to Network RAID. Though the concept of separating the storage controllers/nodes into multiple sites for true long-distance mirroring exists, they usually don’t exist at this level. NetApp has their Fabric and Stretch MetroCluster and EMC has their VPlex, but they usually are proposed at the higher end of the spectrum. Looks to me that HP P4000 is the only one that has this concept at the entry level iSCSI SAN level. Kudos!
They have an asynchronous replication as well for longer distance networks.
I did not stay for the demo today but I am already tickled pink about the HP P4000 technology. It had a good impression on me and I can’t wait to know more of how it works internally. Looking forward to a deeper dive of the P4000 and hope to stay for the demo next time.
The other day, a prospect was requesting quotations after quotations from a friend of mine to make so-called “apple-to-apple” comparison with another storage vendor. But it was difficult to have that sort of comparisons because one guy would propose SAS, and the other SATA and so on. I was roped in by my friend to help. So in the end I asked this prospect, which 3 of these criteria matters to him most – Performance, Capacity or Reliability.
He gave me an answer and the reliability criteria was leading his requirement. Then he asked me if I could help determine in a “quick-and-dirty manner” by using MTBF (Mean Time Between Failure) of the disks to convince his finance about the question of reliability.
Well, most HDD vendors published their MTBF as a measuring stick to determine the reliability of their manufactured disks. MTBF is by no means accurate but it is useful to define HDD reliability in a crude manner. If you have seen the components that goes into a HDD, you would be amazed that the HDD components go through a tremendously stressed environment. The Read/Write head operating at a flight height (head gap) between the platters thinner than a human hair and the servo-controlled technology maintains the constant, never-lagging 7200/10,000/15,000 RPM days-after-days, months-after-months, years-after-years. And it yet, we seem to take the HDD for granted, rarely thinking how much technology goes into it on a nanoscale. That’s technology at its best – bringing something so complex to make it so simple for all of us.
I found that the Seagate Constellation.2 Enterprise-class 3TB 7200 RPM disk MTBF is 1.2 million hours while the Seagate Cheetah 600GB 10,000 RPM disk MTBF is 1.5 million hours. So, the Cheetah is about 30% more reliable than the Constellation.2, right?
Wrong! There are other factors involved. In order to achieve 3TB usable, a RAID 1 (average write performance, very good read performance) would require 2 units of 3TB 7200 RPM disks. On the other hand, using a 10, 000 RPM disks, with the largest shipping capacity of 600GB, you would need 10 units of such HDDs. RAID-DP (this is NetApp by the way) would give average write performance (better than RAID 1 in some cases) and very good read performance (for sequential access).
So, I broke down the above 2 examples to this prospect (to achieve 3TB usable)
- Seagate Constellation.2 3TB 7200 RPM HDD MTBF is 1.2 million hours x 2 units
- Seagate Cheetah 600GB 10,000 RPM HDD MTBF is 1.5 million hours x 10 units
By using a simple calculation of
RF (Reliability Factor) = MTBF/#HDDs
the prospect will be able to determine which of the 2 HDD types above could be more reliable.
In case #1, RF is 600,000 hours and in case #2, the RF is 125,000 hours. Suddenly you can see that the Constellation.2 HDDs which has a lower MTBF has a higher RF compared to the Cheetah HDDs. Quick and simple, isn’t it?
Note that I did not use the SAS versus SATA technology into the mixture because they don’t matter. SAS and SATA are merely data channels that drives data in and out of the spinning HDDs. So, folks, don’t be fooled that a SAS drive is more reliable than a SATA drive. Sometimes, they are just the same old spinning HDDs. In fact, the mentioned Seagate Constellation.2 HDD (3TB, 7200 RPM) has both SAS and SATA interface.
Of course, this is just one factor in the whole Reliability universe. Other factors such as RAID-level, checksum, CRC, single or dual-controller also determines the reliability of the entire storage array.
In conclusion, we all know that the MTBF alone does not determine the reliability of the solution the prospect is about to purchase. But this is one way you can use to help the finance people to get the idea of reliability.
We have been taught that file systems are like folders, sub-folders and eventually files. The criteria in designing file systems is to ensure that there are few key features
- Ease of storing, retrieving and organizing files (sounds like a fridge, doesn’t it?)
- Simple naming convention for files
- Performance in storing and retrieving files – hence our write and read I/Os
- Resilience in restoring full or part of a file when there are discrepancies
In file systems performance design, one of the most important factors is locality. By locality, I mean that data blocks of a particular file should be as nearby as possible. Hence, in most file systems designs originated from the Berkeley Fast File System (BFFS), requires the file system to seek the data block to be modified to ensure locality, i.e. you try not to split up the contiguity of the data blocks. The seek time to find the require data block takes time, but you are compensate with faster reads because the read-ahead feature allows you to read extra blocks ahead in anticipation that the data blocks are related.
In Copy-on-Write file systems (also known as shadow-paging file systems), the seek portion is usually not present because the new modified block is written somewhere else, not the present location of the original block. This is the foundation of Copy-on-Write file systems such as NetApp’s WAFL and Oracle Solaris ZFS. Because the new data blocks are written somewhere else, the storing (write operation) portion is faster. It eliminated the seek time and it also skipped the read-modify-write action to the original location of the data block. Therefore, write is likely to be faster.
However, the read portion will be slower because if you want to read a file, the file system has to go around looking for the data blocks because it lacks the locality. Therefore, as the COW file system ages, it tends to have higher file system fragmentation. I wrote about this in my previous blog. It is a case of ENJOY-FIRST/SUFFER-LATER. I am not writing this to say that COW file systems are bad. Obviously, NetApp and Oracle have done enough homework to make the file systems one of the better storage file systems in the market.
So, that’s Copy-on-Write file systems. But what about SSDs?
Solid State Drives (SSDs) will make enemies with file systems that tend prefer locality. Remember that some file systems prefer its data blocks to be contiguous? Well, SSDs employ “wear-leveling” and required writes to be spread out as much as possible across the SSDs device to prolong the life of the SSD device to reduce “wear-and-tear”. That’s not good news because SSDs just told the file systems, “I don’t like locality and I will spread out the data blocks“.
NAND Flash SSDs (the common ones we find in the market and not DRAM-based SSDs) are funny creatures. When you write to SSDs, you must ERASE first, WRITE AGAIN to the SSDs. This is the part that is creating the wear-and tear of the device. When I mean ERASE first, WRITE AGAIN, I describe it below
- Writing 1 –> 0 (OK, no problem)
- Writing 0 –> 1 (not OK, because NAND Flash can’t do that)
So, what does the SSD do? It ERASES everything, writing the entire data blocks on the device to 1s, and then converting some of them to 0s. Crazy, isn’t it? The firmware in the SSDs controller will also spread out the erase-and-then write operations across the entire SSD device to avoid concentrating the operations on a small location or dataset. This is the “wear-leveling” we often hear about.
Since SSDs shun locality and avoid the data blocks to be nearby, and Copy-on-Write file systems are already doing this because its nature to write new data blocks somewhere else, the combination of both COW file system and SSDs seems like a very good fit. It even looks symbiotic because it is a case of “I help you; and you help me“.
From this perspective, the benefits of COW file systems and SSDs extends beyond resiliency of the SSD device but also in performance. Since the data blocks are spread out at different locations in the SSD device, the effect of parallelism will inadvertently help with COW’s performance. Make sense, doesn’t it?
I have not learned about other file systems and how they behave with SSDs, but it is pretty clear that Copy-on-Write file systems works well with Solid State Devices. Have a good week ahead :-)!
In my previous blog entry, I mentioned the write penalty for RAID-5/6. This factor will figure heavily in the way we size the RAID-level for performance capacity planning.
It is difficult to ascertain what kind of IOPS and throughput that are required for an application, especially a database, to run well with additional room to grow. From a DBA or an application developer, I believe they would have adequate information to tell what is the numbers of users that the application can support, both average and peak, transactions per second (TPS), block size required for logs, database files and so on.
But as we are all aware, most of the time, these types of information are not readily available. So, coming from a storage angle, the storage administrator can advise the DBA or the application developer that the configured RAID group or volume or LUN is capable of delivering a certain number of IOPS and is able to achieve a certain throughput MB/sec. These numbers will be off the box itself immediately. Of course, other factors such as HBA speed, the FC/iSCSI configurations, the network traffic and so on will affect the overall performance delivery to the application. But we can safely inform the DBA and/or the application developer that this is what the storage is delivering out of the box.
The building blocks of all storage RAID groups/volumes/LUNs are pretty much your hard disk drives (HDDs) and/or Solid State Drives (SSDs). The manufacturer of these disks will usually publish the IOPS and throughput of individual drives but if these information is not available, we can construct IOPS of an individual HDD from its seek and latency times.
For example, if the HDD’s
average latency = 2.8 ms; average read seek = 4.2 ms; average write seek = 4.8 ms
then the IOPS can be calculated as
1 IOPS = --------------------------------------- (average latency) + (average seek time)
Therefore from the details above,
1 IOPS = ------------------- = 136.986 IOPS (0.0028) + (0.0045)
That’s pretty simple, right? But of course, it is easier to just accept that a certain type of disk will have a range of IOPS as shown in the table below:
|Disk Type||RPM||IOPS Range|
The information from the table above is just for reference only and by no means a very accurate one but it is good enough for us to determine the IOPS of a RAID group/volume/LUN. Let’s look at the RAID write penalty again in the table below:
|RAID-level||Number of I/O Reads
||Number of I/O for Writes
||RAID Write Penalty|
|1 (1+0, 0+1)||1||2||2|
Next, we need to know what is the ratio of Reads vs Writes for that particular database or application. I mentioned earlier that in OLTP-type of applications, we usually take a 2:1 or 3:1 ratio in favour of Reads.
To make things simpler, let’s assume we create a RAID-6 volume of 6 data disks and 2 parity disks in a RAID-6 (6+2) configuration. The disks used are SATA disks of 7,200 RPM, with each individual disk of 100 IOPS. Assume we are using a ratio of 2:1 in favour of Reads, which gives us 66.666% and 33.333% respectively for Reads and Writes.
Therefore, the combined IOPS of the 8 disks in the RAID-6 configuration is probably about 800 IOPS. However, because of the write penalty of RAID-6, the effective IOPS for the RAID-6 volume will be lower than that. Let’s do some calculation to see what happens:
1) Read IOPS + Write IOPS = 800 IOPS
2) (0.66666 x 800) + (0.33333 x 800) = 800 IOPS
3) Read IOPS will be 0.66666 x 800 = 533.328 IOPS
4) Write IOPS will be 0.33333 x 800 = 266.664 IOPS. However, since RAID-6 has a write penalty of 6, this number has to be divided by 6. 266.664/6 will be 44.444 IOPS for Writes
Therefore, what the RAID-6 volume is capable of is approximately 533 IOPS for Reads and 44 IOPS for Writes.
We have determined IOPS for the RAID volume but what about throughput. Throughput is determined by the block size used. Assume that our RAID-6 volume uses a 4-K block size. With a combined effective IOPS of 577 (533+44), we multiply the IOPS with the block size
Throughput = 577 IOPS x 4-KB = 2308KB/sec
Therefore when I/O is sustained in a sequential manner, the effective throughput is 2308KB/sec.
On the other hand, we often were told to add more spindles to the volume to increase the IOPS. This is true, to a point, where the maximum amount of IOPS that can be delivered will taper into a flatline, because the I/O channel to the RAID volume has been saturated. Therefore, it is best to know that adding more spindles does not always equate to a higher IOPS.
Performance sizing for a database or an application is both a science and an art. Mathematically, we can prove things to a a certain amount of accuracy and confidence but each storage platform is very different in the way they handle RAID. Newer storage platforms have proprietary RAID that nowadays, it does not matter much what kind of RAID is best for the application. Vendors such as IBM XIV has RAID-X which both radical in design and implementation. NetApp will almost always say RAID-DP is the best no matter what, because RAID-DP is all NetApp.
So there is no right or wrong to choose the RAID-level for the application. But it is VERY important to know what are the best practice are and my advice is everyone is to do Proof-of-Concepts, and TEST, TEST, TEST! And ASK QUESTIONS!
It’s a beautiful Saturday morning … the sun is out, and the birds are chirping … and here I am, thinking about RAID-5/6. What’s wrong with me?
Anyway, have you ever wondered almost all your volumes are in a RAID-5/6 configuration? Like an obedient child, the answer would probably be “Oh, my vendor said it is good for me …”
In storage, the rule is applications-read, applications-write. And different applications have different behaviors but typically, they fall under 2 categories:
- Random access
- Sequential access
The next question to ask is how much Read/Writes ratio (or percentage) is in that Random Access behavior and how much of Read/Write ratio in Sequential Access behavior.
We usually pigeonhole transactional databases such as SQL Server, Oracle into OLTP-type characteristics with random access being the dominant access method. Similarly, email applications such as Exchange, Lotus and even SMTP into similar OLTP-type characteristics as well. We typically do a 2:1 or 3:1 ratio for OLTP-type applications with Read heavy and less of Writes. Data warehouse type of databases tend to be more sequential.
However, even within these OLTP applications, there are also sequential access behaviors as well, as the following table for a database shows:
|Operation||Random or Sequential||Read/Write Heavy||Block Size|
|DB-Log||Random (Sequential in log recovery)||Write Heavy unless you are doing log recovery||1KB – 64KB|
|DB-Data Files||Random||Read/Write mix dependent on load||4KB – 32KB|
|Batch insert||Sequential||Write Heavy||8KB – 128KB|
|Index scan||Sequential||Read Heavy||8KB – 128KB|
We will look into 4 RAID-levels in this scenario and see how each RAID-level applies to an OLTP-type of environment. These RAID levels are RAID-0, RAID-1 (1+0, 0+1 included), RAID-5 and RAID-6.
RAID-0 is the baseline, with 1 x Read and 1 x Write being processed as per normal.
In RAID-1, it would require 2 x Writes and 1 x Read, because the write operation is mirrored. The RAID penalty is 2.
To avoid the cost of RAID-1, RAID-5 is almost always the RAID level of choice (unless you speak to those NetApp fellas). RAID-5 is a parity-based RAID and require 2 x Read (1 to read the data block and 1 to read the parity block) AND 2 x Write (1 to write the modified block and 1 to write the modified parity). Hence it has a RAID penalty of 4.
RAID-6 was to address the risk of RAID-5 because disk capacity are so freaking large now (3TB just came out). To rebuild a large-TB drive would take longer time and the RAID-5 volume is at risk if a second disk failure occurs. Hence, double parity RAID in RAID-6. But unfortunately, the RAID penalty for RAID-6 is 6!
To summarize the RAID write penalty,
|RAID-level||Number of I/O Reads
||Number of I/O for Writes
||RAID Write Penalty|
|1 (1+0, 0+1)||1||2||2|
So, it is well known that RAID 0 has good performance for reads and writes but with absolutely no protection. RAID-1 would be good for random reads and writes but it is costly. RAID-5 is good for applications with a high ratio of sequential reads vs writes (2:1, 3:1 as mentioned), and RAID-6, errr … should be taken similarly as RAID-5 with some additional performance penalty.
With that in mind, a storage administrator must question why a particular RAID-level was proposed to the database or any like-applications.
I am going out to enjoy the Saturday now … and today, August 13th is the World’s Left-Handed Day. More about this RAID penalty and IOPS in my next entry.