Have We Really Outgrown RAID?
The industry has been talking about this for a few years now, but it’s becoming a real and present concern. RAID protection schemes have been used for a long time to protect data, either by mirroring the same data across redundant drives or by calculating parity across a collection of drives such that upon the failure of any drive (or drives), the data can be reconstructed with the remaining data.
At smaller disk capacities, RAID is a helpful and suitable solution. A RAID 5 protection scheme allows any one disk in the set to fail without any data being lost. The failed disk is replaced, the new drive is populated by the RAID controller by reading data from the remaining drives and calculating parity, and after some time the disk group is healthy again.
When a RAID rebuild occurs, all data from the disk group must be read in order to calculate what data should be written to the replacement disk. With the disk capacities used during RAID’s prime time, two things were true:
- The drives read and wrote fast enough that a RAID rebuild could churn through all the data and have the new disk rebuilt in a short amount of time – usually a few hours.
- The amount of data in comparison to the error rate of the drives was sufficiently low, meaning that the risk of an Unrecoverable Read Error (URE) during the rebuild was relatively low.
As storage technology has increased to the point where modern arrays might use disks with anywhere from 2 to 10 terabyte capacities, there are some concerns with RAID that didn’t exist at smaller sizes.
Rebuild Times Are Prohibitively Long
Although capacity on hard drives has increased recently, throughput has not. This means that there’s linear growth in the amount of time it takes to rebuild a disk as the size of the disk increases. As shown in Figure 1, the mathematical time it would take to write all of an 8TB drive with a 115 MBps write throughput is 20 hours. However, that’s just writing from one drive to another. Using RAID, there’s a parity calculation that introduces overhead, and more importantly, in most cases the disk group is still in use during the rebuild. That means that the surviving disks will be serving production I/O at the same time they’re trying to complete the rebuild, causing a performance hit for both production and the rebuild. In the chart, I took a wild guess and assumed that parity + continuing to serve production I/O would cause a 2x overhead. It could be more or less depending on the system. What you can see is that for an 8 TB drive, it’s now expected to take multiple days to complete the rebuild at best.
The Rebuild May Fail
Some folks have been saying that from a statistical standpoint, RAID is not protecting your data nearly as well as most people think it is. Hard drives aren’t perfect, and at some point they may encounter a failure known as an Unrecoverable Read Error (URE). There are laws of nature and physics principles that mean the probability of a drive having this issue hasn’t decreased much recently. For consumer drives, the probability of any given read returning this error is about 0.000000000000001%. We usually look at these figures with scientific notation instead, because it’s easier to read; consumer drives commonly have a manufacturer’s stated error rate of 1.0E-14 (or 10^-15). Enterprise drives are generally an order of magnitude better with at least 1.0E-15 (10^-16). What does this mean with respect to RAID?
During a rebuild, the entire disk group’s worth of data needs to be read in order to calculate parity. This is more data than is normally read and gives many more opportunities to run into the rare but definitely possible URE. As capacity increases, the number of bits in the array grows. Since you have to read every bit in the array, there are more opportunities to encounter a URE when rebuilding a larger volume.
Figure 2 uses the numbers from HGST’s presentation at Storage Field Day 11. According to their calculations, this chart shows the probability of encountering an Unrecoverable Read Error while reading all the data from a volume to do a rebuild. What you’ll see is that according to this math, encountering a URE during a drive rebuild of a volume using large disks is quite likely. It’s nearly certain, in fact.
For RAID 5, this is being calculated using the following formula:
1 – (1 – [Manufacturer’s Stated Error Rate) ^ ([Disk Size] * ([Disks in Array] – 1))(8(10^9))
Simplified: 1 – ([Error Rate]^[Bits in the Array])
Now this certainly paints a scary picture for many enterprises using RAID today. A 2 TB drive is not all that uncommon, especially in “nearline tier” kind of systems. If this math is accurate, there’s a lot of data in the world that’s very unsafe right now. But…I feel a bit of dissonance as I try to digest this.
These Statistics Seem Odd
What should be happening according to these statistics doesn’t jive with what I’ve seen and experienced in the real world. I’ve never talked to someone who told me “Yeah, more than half of my RAID 5 rebuilds result in data loss.” The formula seems to make some pretty generous assumptions that may make the situation seem more dire than it is. Although the concept of a failure being possible is fair in general, I feel that the presentation of this possibility (especially by folks peddling object storage) could best be described as “alarmist.”
- I don’t believe the error rate of a drive is a constant in reality. It seems logical that the drive is going to be more reliable on Day 1 than on Day 700. This isn’t accounted for in the model, if that’s the case. But I could be wrong.
- Shouldn’t whatever is consuming these raw disks be doing some kind of background data scrubbing to proactively repair and prevent UREs? Don’t modern drives also do some of that internally, even? It seems this way of looking at it assumes none of that.
- Perhaps most importantly: these calculations use the stated error rate of the drives from manufacturers. I believe this number to be a “CYA” figure, meaning it’s stated as a worst case measurement. I have heard anecdotal reports from cloud providers of the experienced error rate being orders of magnitude lower in reality (1.0E-16 or even 1.0E-17). That would change the real world results significantly compared to the theoretical statistics using the safe numbers. And this explains the dissonance I feel between the stats and the real world.
What Actually Happens?
Now, paranoid storage guys are absolutely correct to use the worst case scenario for making decisions. If the #1 rule of storage is to never lose data, then one has to assume the worst case scenario. But, for the sake of settling the conflict between the numbers and my reality, let’s look for a minute at what might be more likely if we look at a slightly more generous error rate.
Let’s say that while the manufacturer’s worst case scenario is that drives encounter a URE with a probability of 1.0E-14, a customer’s experience is that they encounter UREs in reality with a measurement of 1.0E-15. That’s more than possible; I’d even call it likely. What does that mean for RAID 5 and 6? As you can see in Figure 3, RAID 5 looks a lot better, and RAID 6 affords pretty reliable protection. This feels a lot more like what I’ve seen and experienced, which makes me feel better.
I’m not saying that the trend of drives getting bigger isn’t causing us to have to rethink things. It is. We are indeed in an era of change in the way we deal with data. The capacity of today’s drives is forcing systems to use other erasure coding schemes that offer more protection against failures. And the durability issue aside, the rebuild time issue still exists and is as bad as it looks. Considering that the performance of the whole group is degraded during rebuild, multi-day rebuilds are just unacceptable.
But with that being said, I’d also like to caution us to avoid being too alarmist about the current state of things in order to sell more stuff. Yes, we need to be thinking about how we move forward. But no, just because Mr. Customer is running an array full of 1 TB drives in some RAID 5 sets right now does not mean he is guaranteed data loss. Context is very important here: something not considered in these statistics is global data center scope. If you consider the reliability of RAID 6 in Figure 3 in the context of a medium sized business that requires 500 TB of capacity, the probability of data loss over a handful of RAID sets may still be within their risk tolerance. That’s precisely what I intend to have pointed out here. However, in the case of cloud providers and enterprises operating on the order of petabytes, the outlook for RAID 5/6 is rather bleak. A 5% likelihood of failure on rebuild is rather high across the whole data center if you’ve got enough disk groups that you’ve almost always got one rebuilding at any point in time.
Lastly – and this one is for my fellow SFD11 delegate Curtis (Mr. Backup) – don’t forget that just because you’re protecting against failures with something like RAID, you’re not off the hook with regard to backups. A solid backup will save you from a failure during an array rebuild if things get really ugly. And it’s worth noting that while backing up to tape is the opposite of trendy and cool, LTO-7 tapes can have error rates more like 1.0E-18, which no hard drive today can come close to. As much as I hate to say it, Iron Mountain is still your friend!
So to answer the question I asked in the title: have we really outgrown RAID? Yes. We absolutely have in many cases. The situation just might not be quite as dire as you’re being told.