ZFS Optimal Array Size

So there I was looking at a pile of eight drives and an empty storage array, and wondering how to cofigure it for best performance under ZFS. “Everyone knows” the formula right? The best performance in a raidz array comes if you use 2^D+P drives. That’s to say your data drives should be a power of two (i.e. 2,4,8,16) plus however many redundant (parity) drives for the raidz level you desire. This is mentioned quite often in the Lucas book FreeBSD Mastery:ZFS; although it didn’t originate there I’ll call it the Lucas rule anyway

I have my own rule – redundancy should be two drives or 30%. Why? Well drives in an array have a really nasty habit of failing two at a time. It’s not sods law, it’s a real phenomenon caused by the stress of re-silvering shaking out any other drives that are “on the edge”. This means I go for configurations such as 4+2, 5+2, 6+2. From there on I go to raidz3 with 7+3, 8+3, 9+3. As there’s no raidz4, 12 drives is the limit – for 14 drives I’d have two vdevs (LUNs) of 5+2 each.

However, If you merge my rule with the Lucas rule the only valid sizes are 2+2 and 4+2 and 8+3. And I had just eight drives to play with.

I was curious – how was the Lucas rule derived? I dug out the book, and it doesn’t say. Anywhere. Having a highly developed suspicion of anything described as “best practice” I decided to test it on my rag-bag collection of drives in the Dell backplane, and guess what? No statistically significant difference.

Now the trouble with IT “best practice” guides is they’re written by technicians based on observation, not OS programmers who know how stuff actually works. The first approach has a lot of merit, but unless you know the reason for your observations you won’t know when the reason has become irrelevant. Unfortuantely, as an OS programmer, I now had a duty to figure out what this reason might have been.

After wading through the code and finding nothing much helpful, I did what I should have done first and considered the low-level disk layout. It’s actually quite simple.

Your stuff is written to disk in a series of blocks, right? In a striped array, each drive gets a block in turn to spread the load. No problem there. Well there will be a problem if your ZFS block size doesn’t match the block size on the drives, but that’s a complication I’m going to overlook – lets just assume you got that bit right.

So where does the optimal number of disks come from? I contend that on a striped vdev there never was one. The problem only comes when you add redundant drives.

I’m going to digress here to explain how error correcting data happens – in very simple terms. Suppose you have a sequence of numbers such as:

Please generate and paste your ad code here. If left empty, the ad location will be highlighted on your blog pages with a reminder to enter your code. Mid-Post

5 8 2 3

Each number is stored on a separate piece of paper, and to guard against loss you add a fifth number so that when you add them all up you get a total ending in zero. In this example, the total of the first 4 is 18. You can add an extra 2 to make the total 20, which ends in zero, so the fifth number is going to be 2.

5 8 2 3 2

Now, if we lose any one of those five numbers we can work out what it must have been – just work out which digit when added to the remaining four gives you a total ending in zero. For example, supposing ‘3’ when missing. Add up the remainder and you get 17. You need 3 more to get to a zero, so the missing number must be 3.

Digression over. ZFS calculates a block of error correction data for the blocks of data it’s just written and adds this as the last block in the sequence. If If ZFS blocks and sectors were the same size, this would be fine writing another sector is quick. But ZFS blocks no longer match sectors. In fact, they’re tunable over a wide range. We’ve also got 4k sectors instead of the traditional 512b. So, suppose you had 2k ZFS blocks on a 4k sector disk? Your parity data could end up being just half a sector, meaning that ZFS has to read it, overwrite half, and write it back rather than just writing it. This sucks. But if you choose the number of disks carefully, you end up with parity blocks that do fit. So, always make sure you follow Lucas’ rule, and make sure your data drives are a power of two.

Except…

This may have been true once, but now we have variable ZFS blocks sizes, and they tend to be much larger than the sector size anyway. In this situation the “magic” configurations no longer matter. And, now we have lx4 compression, the physical block sizes are variable anyway.

For those not in the know about this, lz4 compression is a no-brainer. It wont’ compress stuff it can’t, and its fast. Most files will compress to at least 2:1, often more – which means when you read a block only half the data needs to travel down the bus to get in memory. Everything suddenly goes twice as fast, at the expense of one core having to do some work. It’s true that the block and sector sizes are nowhere near matching, and this is bound to have a performance hit, but this is more-than eclipsed by the improved transfer rate.

So in summary, forget the 2^D+P “best practice” formula. It was only valid in the early days. Have whatever config you like, but I I do commend my rule about the number of redundant drives. This is based on a hardware issue, and update to the software is going to fix this any time soon.