Category Archives: Disks

Lightning about to strike

Watch out for February 6th, 2012 folks! The Lightning is about to strike!

Yes, it is likely that EMC will be announcing their server-based, 8-lane PCIe Flash memory card in early week of February. The PCIe card was dubbed “Project Lightning” when it was first announced in EMC World in May last year. It represents EMC’s first foray of products that sits on the server side, giving the impression that EMC could be entering the server business. I blogged about this way back in September last year. As explained by the EMC folks, they are not going into the server business but rather “extending” their performance tiering into the server space. Think of it like an umbilical cord that  sucks the server’s CPU processing power to give maximum performance boost for the EMC storage.

The card will sport Solid State Drive from LSI Warp Drive and comes in 100/200/300GB capacity. Here’s a picture of how the Lightning card would look like:

The SSD is an SLC (Single Level Cell) and is capable of delivering 150,000 random reads IOPS based on 4K blocks and 190,000 random writes IOPS. It can squeeze 1.4GB/sec in read throughput. While it is not on par with the performance of Fusion-IO, it can definitely do well leveraging EMC’s huge customer base. Furthermore, PCIe-based Flash memory cards such as Fusion-IO will not be able to take advantage of the bridge that links the server and the storage, making it confined to the server’s resources. The advantage is definitely EMC when you explore the possibilities.

Here’s a view of a slide from Virtual Geek summarizing the Project Lightning:

The Lightning card is aimed at customers who demand the highest performance, even higher that Tier 0. It will be integrated with EMC’s FAST (Fully Automated Storage Tiering) technology and is available to the VNX and VMAX platforms.

So watch out folks, because Lightning is about to strike soon!

Not all SSDs are the same

Happy Lunar New Year! The Chinese around world has just ushered in the Year of the Water Dragon yesterday. To all my friends and family, and readers of my blog, I wish you a prosperous and auspicious Chinese New Year!

Over the holidays, I have been keeping up with the progress of Solid State Drives (SSDs). I am sure many of us are mesmerized by SSDs and the storage vendors are touting the best of SSDs have to offer. But let me tell you one thing – you are probably getting the least of what the best SSDs have to offer. You might be puzzled why I say things like this.

Let me share with a common sales pitch. Most (if not all) storage vendors will tout performance (usually IOPS) as the greatest benefits of SSDs. The performance numbers have to be compared to something, and that something is your regular spinning Hard Disk Drives (HDDs). The slowest SSDs in terms of IOPS is about 10-15x faster than the HDDs. A single SSD can at least churn 5,000 IOPS when compared to the fastest 15,000 RPM HDDs, which churns out about 200 IOPS (depending on HDD vendors). Therefore, the slowest SSDs can be 20-25x faster than the fastest HDDs, when measured in IOPS.

But the intent of this blogger is to share with you more about SSDs. There’s more to know because SSDs are not built the same. There are write-bias SSDs, read-bias SSDs; there are SLC (single level cell) and MLC (multi level cell) SSDs and so on. How do you differentiate them if Vendor A touts their SSDs and Vendor B touts their SSDs as well? You are not comparing SSDs and HDDs anymore. How do you know what questions to ask when they show you their performance statistics?

SNIA has recently released a set of methodology called “Solid State Storage (SSS) Performance Testing Specifications (PTS)” that helps customers evaluate and compare the SSD performance from a vendor-neutral perspective. There is also a whitepaper related to SSS PTS. This is something very important because we have to continue to educate the community about what is right and what is wrong.

In a recent webcast, the presenters from the SNIA SSS TWG (Technical Working Group) mentioned a few questions that I  think we as vendors and customers should think about when working with an SSD sales pitch. I thought I share them with you.

  • Was the performance testing done at the SSD device level or at the file system level?
  • Was the SSD pre-conditioned before the testing? If so, how?
  • Was the performance results taken at a steady state?
  • How much data was written during the testing?
  • Where was the data written to?
  • What data pattern was tested?
  • What was the test platform used to test the SSDs?
  • What hardware or software package(s) used for the testing?
  • Was the HBA bandwidth, queue depth and other parameters sufficient to test the SSDs?
  • What type of NAND Flash was used?
  • What is the target workload?
  • What was the percentage weight of the mix of Reads and Writes?
  • Are there warranty life design issue?

 

I thought that these questions were very relevant in understanding SSDs’ performance. And I also got to know that SSDs behave differently throughout the life stages of the device. From a performance point of view, there are 3 distinct performance life stages

  • Fresh out of the box (FOB)
  • Transition
  • Steady State

 

 

As you can see from the graph below, a SSD, fresh out of the box (FOB) displayed considerable performance numbers. Over a period of time (the graph shown minutes), it transitioned into a mezzanine stage of lower IOPS and finally, it normalized to the state called the Steady State. The Steady State is the desirable test range that will give the most accurate type of IOPS numbers. Therefore, it is important that your storage vendor’s performance numbers should be taken during this life stage.

Another consideration when understanding the SSDs’ performance numbers are what type of tests used? The test could be done at the file system level or at the device level. As shown in the diagram below, the test numbers could be taken from many different elements through the stack of the data path.

 

Performance for cached data would given impressive numbers but it is not accurate. File system performance will not be useful because the data travels through different layers, masking the true performance capability of the SSDs. Therefore, SNIA’s performance is based on a synthetic device level test to achieve consistency and a more accurate IOPS numbers.

There are many other factors used to determine the most relevant performance numbers. The SNIA PTS test has 4 main test suite that addresses different aspects of the SSD’s performance. They are:

  • Write Saturation test
  • Latency test
  • IOPS test
  • Throughput test
The SSS PTS would be able to reveal which is a better SSD. Here’s a sample report on latency.
Once again, it is important to know and not to take vendors’ numbers in verbatim. As the SSD market continue to grow, the responsibility lies on both side of the fence – the vendor and the customer.

 

SSDs rising in the flood crisis

The Thailand flood last year spelled disaster to the storage industry. We have already seen several big boys in the likes of HP, EMC and NetApp announcing the rise of prices because of the flood.

NetApp’s announcement is here; EMC is here; and HP is here, if you want to read about it. Below is a nice and courteous EMC letter to their customers.

 

But the Chinese character of “crisis” (below) also spells opportunities; opportunities for Solid State Drives (SSDs) that is.

 

For those of us close to the ground, the market for spinning hard disk drives (HDDs) has certainly been challenging for the past few months, especially for smaller system providers like us. Without the leveraging powers of the bigger boys, we practically had to beg to buy HDDs, not to mention the fact that the price has practically doubled.

Before the Thailand flood crisis, the GB/$ of a 2TB HDD was 0.325 Malaysian ringgit per GB. That’s about 33 cents. Today, the price is about 55 cents per GB. In comparison, at least from my experience, the GB/$ of SSDs has gone down from $5.83 to $4.99.

I know some of you might pooh-pooh the price difference between a 2TB SATA/SAS and a 120GB SSD, partly because the SSD seems so expensive. But when you consider that doing the math, the SSDs is likely to be 50x faster (at worst average) and 200x faster (at best average) for applications requiring IOPS, this could mean that transactional applications are likely to be completed an average of 100x faster, with better response time, with lower latency. This will have a domino effect on other related applications, making the entire service request performing and completing faster. When we put a price to the transactional hours, for example $10/hour work, then we can see the cost savings coming from using SSDs in the storage.

Interestingly, a friend of mine asked me about the prominence of an all SSDs storage systems. I have written about all SSDs systems in the past, and also did a high overview of Pure Storage some time back. And a very interesting fact I recalled was these systems having massive amount of IOPS. Having plenty of IOPS helps because you do away with Automated Storage Tiering (AST) because you don’t have to tier your data, and you don’t have to pay for such a feature.

Yes, all-SSDs pure-play storage systems are gaining prominence and it’s time to take notice. Nimbus beat NetApp and HP 3PAR last year to win eBay with an all SSDs storage solution and other players such as Violin Memory Systems, Pure Storage, SolidFire and of course, Texas Memory Systems (aka RAMSAN). And they are attracting big names into their management portfolios and getting VC dollars of course.

The Thailand flood aftermath will probably take 6 months or more to return to its previous production capacity prior to the crisis and SSDs can take this window of opportunity in the crisis to surge ahead. And if this flood is going to be an annual thing for Thailand (God bless Thailand), HDD market is going to have a perennial problem. And SSDs is going to rise even faster.

 

Apple chomps Anobit

A few days ago, Apple paid USD$500 million to buy an Israeli startup, Anobit, a maker of flash storage technology.

Obviously, one of the reasons Apple did so is to move up a notch to differentiate itself from the competition and positions itself as a premier technology innovator. It has won the MP3 war with its iPod, but in the smartphones, tablets and notebooks space, Apple is being challenged strongly.

Today, flash storage technology is prevalent, and the demand to pack more capacity into a small real-estate of flash will eventually lead to reliability issues. The most common type of NAND flash storage is the MLC (multi-level cells) versus the more expensive type called SLC (single level cells).

But physically and the internal-build of MLC and SLC are the exactly the same, except that in SLC, one cell contains 1 bit of data. Obviously this means that 2 or more bits occupy one cell in MLC. That’s the only difference from a physical structure of NAND flash. However, if you can see from the diagram below, SLCs has advantages over MLCs.

NAND Flash uses electrical voltage to program a cell and it is always a challenge to store bits of data in a very, very small cell. If you apply too little voltage, the bit in the cell does not register and will result in something unreadable or an error. If you apply too much voltage, the adjacent cells are disturbed and resulting in errors in the flash. Voltage leak is not uncommon.

The demands of packing more and more data (i.e. more bits) into one cell geometry results in greater unreliability. Though the reliability of  the NAND Flash storage is predictable, i.e. we would roughly know when it will fail, we will eventually reach a point where the reliability of MLCs will no longer be desirable if we continue the trend of packing more and more capacity.

That’s when Anobit comes in. Anobit has designed and implemented architectural changes of the way NAND Flash storage is used. The technology in laymen terms comes in 2 stages.

  1. Error reduction – by understanding what causes flash impairment. This could be cross-coupling, read disturbs, data retention impairments, program disturbs, endurance impairments
  2. Error Correction and Signal Processing – Advanced ECC (error-correcting code), and introducing the patented (and other patents pending) Memory Signal Processing (TM) to improve the reliability and performance of the NAND Flash storage as show in the diagram below:

In a nutshell, Anobit’s new and innovative approach will result in

  • More reliable MLCs
  • Better performing MLCs
  • Cheaper NAND Flash technology

This will indeed extend the NAND Flash technology into greater innovation of flash storage technology in the near future. Whatever Apple will do with Anobit’s technology is anybody’s guess but one thing is certain. It’s going to propel Apple into newer heights.

Betcha don’t encrypt your disks

At the Internet Alliance event this morning, someone from Computerworld gave me a copy of their latest issue. The headline was “Security Incidents Soar”, with the details of the half-year review by CyberSecurity Malaysia.

Typically, the usual incidents list evolve around spam, intrusions, frauds, viruses and so on. However, storage always seems to be missing. As I see it, storage security doesn’t sit well with the security guys. In fact, storage is never the sexy thing and it is usually the IPS, IDS, anti-virus and firewall that get the highlights. So, when we talk about storage security, there is so little to talk about. In fact, in my almost 20-years of experience, storage security was only brought up ONCE!

In security, the most valuable piece of asset is data and no matter where the data goes, it always lands on …. STORAGE! That is why storage security could be one of the most overlooked piece in security. Fortunately, SNIA already has this covered. In SNIA’s Solid State Storage Initiative (SSSI), one aspect that was worked on was Self Encrypted Drives (SED).

SED is not new. As early as 2007, Seagate already marketed encrypted hard disk drives. In 2009, Seagate introduced enterprise-level encrypted hard disk drives. And not surprisingly, other manufacturers followed. Today, Hitachi, Toshiba, Samsung, and Western Digital have encrypted hard disk drives.

But there were prohibitive factors that dampened the adoption of self-encrypted drives. First of all, it was the costs. It was expensive a few years ago. There was (and still is) a lack of knowledge between the hardware of Self Encrypted Drives (SED) and software-based encryption. As the SED were manufactured, some had proprietary implementations that did not do their part to promote the adoption of SEDs.

As data travels from one infrastructure to another, data encryption can be implemented at different points. As the diagram below shows,

encryption can be put in place at the software level, the OS level, at the HBA, the network itself. It can also happen at the switch (network or fabric), at the storage array controller or at the hard disk level.

EMC multipathing software, PowerPath, has an encryption facility to ensure that data is encryption on its way from the HBA to the EMC CLARiiON storage controllers.

The “bump-in-the-wire” appliance is a bridge device that helps in composing encryption to the data before it reaches the storage. Recall that NetApp had a FIPS 140 certified product called Decru DataFort, which basically encrypted NAS and SAN traffic en-route to the NetApp FAS storage array.

And according to SNIA SSSI member, Tom Coughlin, SED makes more sense that software-based security. How does SED work?

First of all, SED works with 2 main keys:

  • Authentication Key (AK)
  • Drive Encryption Key (DEK)
The DEK is the most important component, because it is a symmetric key that encrypts and decrypts data on the HDDs or SSDs. This DEK is not for any Tom (sorry Tom), Dick and Harry. In order to gain access to DEK, one has to be authenticated and the authentication is completed by having the right authentication key (AK). Usually the AK is based on a 128/26-bit AES or DES and DEK is of a higher bit range. The diagram below shows the AK and DEK in action:
Because SED occurs at the drive level, it is significantly simpler to implement, with lower costs as well. For software-based encryption, one has to set up some form of security architecture. IPSec comes to mind. This is not only more complex, but also more costly to implement as well. Since it is software, the degree of security compromise is higher, meaning, the security model is less secure when compared to SED. The DEK of the SED does not leave the array, and if the DEK is implemented within the disk enclosure or the security module of SoC (System-on-Chip), this makes even more secure that software-based encryption. Also, the DEK is away from the CPU and memory, thus removing these components as a potential attack vendor that could compromise the data on the disks drives.
Furthermore, software-based encryption takes up CPU cycles, thus slows down the overall performance. In the Tom Coughlin study, based on both SSDs and HDDs, the performance of SED outperforms software-based encryption every time. Here’s a table from that study:
Another security concern is about data erasure. According to an old IBM study, about 90% of the retired HDDs still has data that is readable. That means that data erasure techniques used are either not implemented properly or simply not good enough. For us in the storage industry, an effective but time consuming technique is to overwrite the entire disks with 1s and reusing it. But to hackers, there are ways to “undelete” these bits and make the data readable again.
SED provides crypto erasure that is both effective and very quick. Since the data encryption key (DEK) was used to encrypt and decrypt data, the DEK can be changed and renewed in split seconds, making the content of the disk drive unreadable. The diagram below shows how crypto erasure works:

Data security is already at its highest alert and SEDs are going to be a key component in the IT infrastructure. The open and common standards are coming together, thanks to efforts to many bodies including SNIA. At the same time, product certifications are coming up and more importantly, the price of SED has come to the level that it is almost on par with normal, non-encrypted drives.

Hackers and data thieves are getting smarter all the time and yet, the security of the most important place of where the data rest is the least considered. SNIA and other bodies hope to create more awareness and seek greater adoption of self encrypted drives. We hope you will help spread the word too. Betcha thinking twice now about encrypting your data  on your disk drives now.

The recipe for storage performance modeling

Good morning, afternoon, evening, Ladies & Gentlemen, wherever you are.

Today, we are going to learn how to bake, errr … I mean, make a storage performance model. Before we begin, allow me to set the stage.

Don’t you just hate it when you are asked to do storage performance sizing and you don’t have a freaking idea how to get started? A typical techie would probably say, “Aiya, just use the capacity lah!”, and usually, they will proceed to size the storage according to capacity. In fact, sizing by capacity is the worst way to do storage performance modeling.

Bear in mind that storage is not a black box, although some people wished it was. It is not black magic when it comes to performance sizing because things can be applied in a very scientific and logical manner.

SNIA (Storage Networking Industry Association) has made a storage performance modeling methodology (that’s quite a mouthful), and basically simplified it into these few key ingredients. This recipe is for storage performance modeling in general and I am advising you guys out there to engage your storage vendors professional services. They will know their storage solutions best.

And I am going to say to you – Don’t be cheap and not engage professional services – to get to the experts out there. I was having a chat with an consultant just now at McDonald’s. I have known this friend of mine for about 6-7 years now and his name is Sugen Sumoo, the Director of DBORA Consulting. They specialize in Oracle and database performance tuning and performance forecasting and it is something that a typical DBA can’t do, because DBORA Consulting is the Professional Service that brings expertise and value to Oracle customers. Likewise, you have to engage your respective storage professional services as well.

In a cook book or a cooking show, you are presented with the ingredients used and in this recipe for storage performance modeling, the ingredients (in no particular order) are:

  • Application block size
  • Read and Write ratio
  • Application access patterns
  • Working set size
  • IOPS or throughput
  • Demand intensity

Application Block Size

First of all, the storage is there to serve applications. We always have to look from the applications’ point of view, not storage’s point of view.  Different applications have different block size. Databases typically range from 8K-64K and backup applications usually deal with larger block sizes. Video applications can have 256K block sizes or higher. It all depends.

The best way is to find out from the DBA, email administrator or application developers. The unfortunate thing is most so-called technical people or administrators in Malaysia doesn’t have a clue about the applications they manage. So, my advice to you storage professionals, do your research on the application and take the default value. These clueless fellas are likely to take the default.

Read and Write ratio

Applications behave differently at different times of the day, and at different times of the month (no, it’s not PMS). At the end of the financial year or calendar, there are some tasks that these applications do as well. But in a typical day, there are different weightage or percentage of read operations versus write operations.

Most OLTP (online transaction processing)-based applications tend to be read heavy and write light, but we need to find out the ratio. Typically, it can be a 2:1 ratio or 60%:40%, but it is best to speak to the application administrators about the ratio. DSS (Decision Support Systems) and data warehousing applications could have much higher reads than writes while a seismic-analysis applications can have multiple writes during the analysis periods. It all depends.

To counter the “clueless” administrators, ask lots of questions. Find out the workflow of several key tasks and ask what that particular tasks do at different checkpoints of the application’s processing. If you are lazy (please don’t be lazy, because it degrades your value as a storage professional), use a rule of thumb.

Application access patterns

Applications behave differently in general. They can be sequential, like backup or video streaming. They can be random like emails, databases at certain times of the day, and so on. All these behavioral patterns affect how we design and size the disks in the storage.

Some RAID levels tend to work well with sequential access and others, with random access. It is not difficult to find out about the applications’ pattern and if you read more about the different RAID-levels in storage, you can easily identify the type of RAID levels suitable for each type of behavioral patterns.

Working set size

This variable is a bit more difficult to determine. This means that a chunk of the application has to be loaded into a working area, usually memory and cache memory, to be used and abused by the application users.

Unless someone is well versed with the applications, one would not be able to determine how much of the applications would be placed in memory and in cache memory. Typically, this can only be determined after the application has been running for some time.

The flexibility of having SSDs, especially the DRAM-type of SSDs, are very useful to ensure that there is sufficient “working space” for these applications.

IOPS or Throughput

According to SNIA model, for I/O less than 64K, IOPS should be used as a yardstick to do storage performance modeling. Anything larger, use throughput, in which MB/sec is the measurement unit.

The application guy would be able to tell you what kind of IOPS their application is expecting or what kind of throughput they want. Again, ask a lot of questions, because this will help you determine the type of disks and the kind of performance you give to the application guys.

If the application guy is clueless again, ask someone more senior or ask the vendor. If the vendor engineers cannot give you an answer, then they should not be working for the vendor.

Demand intensity

This part is usually overlooked when it comes to performance sizing. Demand intensity refers to how intense is the I/O requests. It could come from 1 channel or 1 part of the applications, or it could come from several parts of the applications in parallel. It is as if the storage is being ‘bombarded’ by applications and this is the part that is hard to determine as well.

In some applications, the degree of intensity or parallelism can be tuned and to find out, ask the application administrator or developer. If not, ask the vendor. Also do a lot of research on the application’s architecture.

And one last thing. What I have learned is to add buffers to the storage performance model. Typically I would add about 10-20% extra but you never know. As storage professionals, I would strongly encourage to engage professional services, because it is worthwhile, especially in the early stages of the sizing. It is usually a more expensive affair to size it after the applications have been installed and running.

“Failure to plan is planning to fail”.  The recipe isn’t that difficult. Go figure it out.

3TB Seagate – a performance sloth

I can’t get home. I am stuck here at the coffee shop waiting out the traffic jam after the heavily downpour an hour ago.

It has been an interesting week for me, which began last week when we were testing the new Seagate 3TB Constellation ES.2 hard disk drives. It doesn’t matter if it was SAS or SATA because the disks were 7,200 RPM, and basically built the same. SAS or SATA is merely the conduit to the disks and we were out there maneuvering the issue at hand.

Here’s an account of  testing done by my team. My team has been testing the drives meticulously, using every trick in the book to milk performance from the Seagate drives. In the end, it wasn’t the performance we got but more like duds from Seagate where these type of drives are concerned.

How did the tests go?

We were using a Unix operating system to test the sequential writes on different partitions of the disks, each with a sizable GB per partition. In one test, we used 100GB per partition. With each partition, we were testing the outer cylinders to the inner cylinders, and as the storage gurus will tell you, the outer rings runs at a faster speed than the inner rings.

We thought it could be the file system we were using, so we switched the sequential writes to raw disks. We tweaked the OS parameters and tried various combinations of block sizes and so on. And what discovered was a big surprise.

The throughput we got from the sequential writes were horrible, started out with MB/sec lower almost 25% lower than a 2TB Western Digital RE4 disk, and as it went on, the throughput in the inner rings went down to single digit MB/sec. According to reliable sources, the slowest published figures by Seagate were in the high 60’s for MB/sec but what we got were close to 20+MB/sec. The Western Digital RE4 was giving out consistent throughput numbers throughout the test. We couldn’t believe it!

We scoured the forums looking for similar issues, but we did not find much about this.This could be a firmware bug. We are in the midst of opening an escalation channel to Seagate to seek explanation. I would like to share what we have discovered and the issue can be easily reproduced. For customers who have purchased storage arrays with 2 or 3TB Seagate Constellation ES/2 drives, please take note. We were disappointed with the disks but thanks to my team for their diligent approach that resulted in this discovery.

Does all SSDs make sense?

I have been receiving a lot of email updates from Texas Memory Systems for many months now. I am a subscriber to their updates and Texas Memory System is the grand daddy of flash and DRAM-based storage systems. They are not cheap but they are blazingly fast.

Lately, more and more vendors have been coming out with all SSDs storage arrays. Startups such Pure Storage, Violin Memory and Nimbus Data Systems have been pushing the envelope selling all SSDs storage arrays. A few days ago, EMC also announced their all SSDs storage array. As quoted, the new EMC VNX5500-F utilizes 2.5-in, single-level cell (SLC) NAND flash drives to 10 times the performance of the hard-drive based VNX arrays. And that is important because EMC has just become one of the earliest big gorillas to jump into the band wagon.

But does it make sense? Can one justify to invest in an all SSDs storage array?

At this point, especially in this part of the world, I predict that not many IT managers are willing to put their head on the chopping board and invest in an all SSDs storage array. They would become guinea pigs for a very expensive exercise and the state of the economy is not helping. Therefore the automatic storage tiering (AST) might stick better than having an all SSDs storage array. The cautious and prudent approach is less risky as I have mentioned in a past blog.

I wrote about Pure Storage in a previous blog and the notion that SSDs will offer plenty of IOPS and throughput. If the performance gain translates into higher productivity and getting the job done quicker, then I am all for SSDs. In fact, given the extra performance numbers

There is no denying that the fact that the industry is moving towards SSDs and it makes sense. That day will come in the near future but not now for customers in these part of the world.

Playing with NetApp … final usable capacity

This is the third and last blog entry of how do we get the ONTAP final capacity.

In my first blog, we ran through a gamut of explanations how disk rightsizing came about for NetApp’s ONTAP. And the importance of disk rightsizing is to give ONTAP a level set of disks, regardless of manufacturer, model, make, firmware versions and so on, and ONTAP is pretty damn sure that the disks that it gets will not mess up.

In my second blog, progressing from the disk rightsizing stage, was the RAID group sizing stage, where different RAID group size affected the number of disks used for data and for parity in an aggregate. An aggregate, for the uninformed, is the disks pool in which the flexible volume, FlexVol, is derived. In a simple picture below,

OK, the diagram’s in Japanese (I am feeling a bit cheeky today :P)!

But it does look a bit self explanatory with some help which I shall provide now. If you start from the bottom of the picture, 16 x 300GB disks are combined together to create a RAID Group. And there are 4 RAID Groups created – rg0, rg1, rg2 and rg3. These RAID groups make up the ONTAP data structure called an aggregate. From ONTAP version 7.3 onward, there were some minor changes of how ONTAP reports capacity but fundamentally, it did not change much from previous versions of ONTAP. And also note that ONTAP takes a 10% overhead of the aggregate for its own use.

With the aggregate, the logical structure called the FlexVol is created. FlexVol can be as small as several megabytes to as large as 100TB, incremental by any size on-the-fly. This logical structure also allow shrinking of the capacity of the volume online and on-the-fly as well. Eventually, the volumes created from the aggregate become the next-building blocks of NetApp NFS and CIFS volumes and also LUNs for iSCSI and Fibre Channel. Also note that, for a more effective organization of logical structures from the volumes, using qtree is highly recommended for files and ONTAP management reasons.

However, for both aggregate and the FlexVol volumes created from the aggregate, snapshot reserve is recommended. The aggregate takes a 5% overhead of the capacity for snapshot reserve, while for every FlexVol volume, a 20% snapshot reserve is applied. While both snapshot percentage are adjustable, it is recommended to keep them as best practice (except for FlexVol volumes assigned for LUNs, which could be adjusted to 0%)

Note: Even if the snapshot reserve is adjusted to 0%, there are still some other rule sets for these LUNs that will further reduce the capacity. When dealing with NetApp engineers or pre-sales, ask them about space reservations and how they do snapshots for fat LUNs and thin LUNs and their best practices in these situations. Believe me, if you don’t ask, you will be very surprised of the final usable capacity allocated to your applications)

In a nutshell, the dissection of capacity after the aggregate would look like the picture below:

We can easily quantify the overall usable in the little formula that I use for some time:

Rightsized Disks capacity x # Disks x 0.90 x 0.95 = Total Aggregate Usable Capacity

Then remember that each volume takes a 20% snapshot reserve overhead. That’s what you have got to play with when it comes to the final usable capacity.

Though the capacity is not 100% accurate because there are many variables in play but it gives the customer a way to manually calculate their potential final usable capacity.

Please note the following best practices and this is only applied to 1 data aggregate only. For more aggregates, the same formula has to be applied again.

  1. A RAID-DP, 3-disk rootvol0, for the root volume is set aside and is not accounted for in usable capacity
  2. A rule-of-thumb of 2-disks hot spares is applied for every 30 disks
  3. The default RAID Group size is used, depending on the type of disk profile used
  4. Snapshot reserves default of 5% for aggregate and 20% per FlexVol volumes are applied
  5. Snapshots for LUNs are subjected to space reservation, either full or fractional. Note that there are considerations of 2x + delta and 1x + delta (ask your NetApp engineer) for iSCSI and Fibre Channel LUNs, even though snapshot reserves are adjusted to 0% and snapshots are likely to be turned off.
Another note that remember is not to use any of those Capacity Calculators given. These calculators are designed to give advantage to NetApp, not necessarily to the customer. Therefore, it is best to calculate these things by hand.
Regardless of how the customer will get as the overall final usable capacity, it is the importance to understand the NetApp philosophy of doing things. While we have perhaps, went overboard explaining the usable capacity and the nitty gritty that comes with it, all these things are done for a reason to ensure simplicity and ease of navigating data management in the storage networking world. Other NetApp solutions such as SnapMirror and SnapVault and also the SnapManager suite of product rely heavily on this.
And the intangible benefits of NetApp and ONTAP definitely have moved NetApp forward since its early years, into what NetApp is today, a formidable storage juggernaut.

Playing with NetApp … (Capacity) BR

Much has been said about usable disk storage capacity and unfortunately, many of us take the marketing capacity number given by the manufacturer in verbatim. For example, 1TB does not really equate to 1TB in usable terms and that is something you engineers out there should be informing to the customers.

NetApp, ever since the beginning, has been subjected to the scrutiny of the customers and competitors alike about their usable capacity and I intend to correct this misconception. And the key of this misconception is to understand what is the capacity before rightsizing (BR) and after rightsizing (AR).

(Note: Rightsizing in the NetApp world is well documented and widely accepted with different views. It is part of how WAFL uses the disks but one has to be aware that not many other storage vendors publish their rightsizing process, if any)

Before Rightsizing (BR)

First of all, we have to know that there are 2 systems when it comes to system of unit prefixes. These 2 systems can be easily said as

  • Base-10 (decimal) – fit for human understanding
  • Base-2 (binary) – fit for computer understanding

So according the International Systems of Units, the SI prefixes for Base-10 are

Text Factor Unit
kilo 103 1,000
mega 106 1,000,000
giga 109 1,000,000,000
tera 1012 1,000,000,000,000

In computer context, where the binary, Base-2 system is relevant, that SI prefixes for Base-2 are

Text Factor Unit
kilo-byte 210 1,024
mega-byte 220 1,048,576
giga-byte 230 1,073,741,824
tera-byte 240 1,099,511,627,776

And we must know that the storage capacity is in Base-2 rather than in Base-10. Computers are not humans.

With that in mind, the next issue are the disk manufacturers. We should have an axe to grind with them for misrepresenting the actual capacity. When they say their HDD is 1TB, they are using the Base-10 system i.e. 1TB = 1,000,000,000,000 bytes. THIS IS WRONG!

Let’s see how that 1TB works out to be in Gigabytes in the Base-2 system:

1,000,000,000/1,073,741,824 = 931.3225746154785 Gigabytes

Note: 230 =1,073,741,824

That result of 1TB, when rounded, is only about 931GB! So, the disk manufacturers aren’t exactly giving you what they have advertised.

Thirdly, and also the most important factor in the BR (Before Rightsizing) phase is how WAFL handles the actual capacity before the disk is produced to WAFL/ONTAP operations. Note that this is all done before all the logical structures of aggregates, volumes and LUNs are created.

In this third point, WAFL formats the actual disks (just like NTFS formats new disks) and this reduces the usable capacity even further. As a starting point, WAFL uses 4K (4,096 bytes) per block

For Fibre Channel disks, WAFL formats them with a 520 byte per sector. Therefore, for each block, 8 sectors (520 x 8 = 4160 bytes) fill 1 x 4K block, with remainder of 64 bytes (4,160 – 4,096 = 64 bytes) for the checksum of the 1 x 4K block. This additional 64 bytes per block checksum is not displayed by WAFL or ONTAP and not accounted for in its usable capacity.

512 bytes per sector are used for formatting SATA/SAS disks and it consumes 9 sectors (9 x 512 = 4,608 bytes). 8 sectors will be used for WAFL’s 4K per block (4,096/512 = 8 sectors), the remainder of 1 sector (the 9th sector) of 512 bytes is used partially for its 64 bytes checksum. Again, this 448 bytes (512 – 64 = 448 bytes) is not displayed and not part of the usable capacity of WAFL and ONTAP.

And WAFL also compensates for the ever-so-slightly irregularities of the hard disk drives even though they are labelled with similar marketing capacities. That is to say that 1TB from Seagate and 1TB from Hitachi will be different in terms actual capacity. In fact, 1TB Seagate HDD with firmware 1.0a (for ease of clarification) and 1TB Seagate HDD with firmware 1.0b (note ‘a’ and ‘b’) could be different in actual capacity even when both are shipped with a 1.0TB marketing capacity label.

So, with all these things in mind, WAFL does what it needs to do – Right Size – to ensure that nothing get screwed up when WAFL uses the HDDs in its aggregates and volumes. All for the right reason – Data Integrity – but often criticized for their “wrongdoing”. Think of WAFL as your vigilante superhero, wanted by the law for doing good for the people.

In the end, what you are likely to get Before Rightsizing (BR) from NetApp for each particular disk capacity would be:

Manufacturer Marketing Capacity NetApp Rightsized Capacity Percentage Difference
36GB 34.0/34.5GB* 5%
72GB 68GB 5.55%
144GB 136GB 5.55%
300GB 272GB 9.33%
600GB 560GB 6.66%
1TB 847GB 11.3%
2TB 1.69TB 15.5%
3TB 2.48TB 17.3%

* The size of 34.5GB was for the Fibre Channel Zone Checksum mechanism employed prior to ONTAP version 6.5 of 512 bytes per sector. After ONTAP 6.5, block checksum of 520 bytes per sector was employed for greater data integrity protection and resiliency.

From the table, the percentage of “lost” capacity is shown and to the uninformed, this could look significant. But since the percentage value is relative to the Manufacturer’s Marketing Capacity, this is highly inaccurate. Therefore, competitors should not use these figures as FUD and NetApp should use these as a way to properly inform their customers.

You have been informed about NetApp capacity before Right Sizing.

I will follow on another day with what happens next after Right Sizing and the final actual usable capacity to the users and operations. This will be called After Rightsizing (AR). Till then, I am going out for an appointment.