Detecting failing ZFS drives

FreeBSD relies on a separate daemon to be running to detect failing drives in a zfs pool rather than the kernel handling it, but I’m not convinced it even works.

I’ve been investigating what the current ZFS on FreeBSD 14.2 does with failing drives. It’s a bit worrying. I posted on the FreeBSD mailing list on 17th Feb 2025 in the hope that some would know something, and there as some discussion (“me too”), but we drew a blank.

The problem is that ZFS doesn’t “fault” a drive until it’s taken offline by the OS. So if you’ve got a flaky drive you have to wait for FreeBSD to disconnect it, and only then ZFS will notice. At least that’s how it works out of the box (but read on).

In the past I’ve tested ZFS’s robustness simply by pulling drives, which guaranteed the OS would fail it, but a few troubling events led me to do a proper investigation. I acquired collection of flaky drives (data centre discards) that are unreliable, and set it up to fail so I could watch. ZFS will wait a very long time for a SAS drive to complete an operation, in circumstances when the drive is clearly on its last legs. If the operation fails and retries, FreeBSD logs a CAM error but ZFS doesn’t fail the drive. You can have a SAS drive rattling and groaning away, but FreeBSD patiently waits for it to complete by relocating the block or multiple retries and ZFS is none the wiser. Or maybe ZFS is relocating the block after the CAM error? Either way, ZFS says the drive is “ONLINE” and carries on using it.

The only clue, other than the console log, is that operations can start to take a long time. The tenacity of SAS drives means it can take several minutes, although SATA tends to fail more quickly. You can have a SAS drive taking a minute for each operation and all you know about it is things are going very, very slowly. It does keep error statistics for vdevs, cascading up the chain, but what it does with them and when it logs them isn’t entirely clear.

If you use a stethoscope on the drive (one of my favourite tricks) it’s obvious it’s not happy but FreeBSD won’t offline it until it catches fire. In fact I suspect it would need to explode before it noticed.

zfsd

However, there is an answer! Nine years ago saw a handy little daemon called zfsd from Justin Gibbs and Alan Somers. This provides some of the funcallity of Solaris’ Service Management Facility (SMF), in particular the fault management daemon, fmd. Quite how closely it follows it I’m not certain, but the general idea is the same. Both look to see if the hardware is failing and act accordingly. In the recent Linux ZFS there’s a daemon called zfs-zed but that works a little differently (more later).

On FreeBSD, zfsd listens to devctl/devd (and possibly CAM) and will collect data on drive errors (it calls this a case file). I say “possibly” because it’s not exactly well documented and appears to have remained pretty much unchanged since it appeared in FreeBSD 11. As a result, I’ve been examining the source code, which is in C++ and has been influenced by “Design Patterns” – not a recipe for clear understanding.

Anyway, zfsd definitely listens to the devctl events (the kind of stuff that ends up in the console log) and acts accordingly. For example, if a vdev generates more than eight delayed I/O events in a minute it will mark it as faulted and activate a hot spare if there is one. If there are more than 50 I/O errors a minute it will do the same. 50 checksum error a minute will degrade a vdev. All of this can be found in the man page.

What’s not so clear is how or whether the code actually operates as advertised. It certainly calls something in response to events, in zfsd_event.cc: likely looking functions such as zpool_vdev_detach(), which are part of libzfs. Trying to find the man page for these functions is more problematic, and a search of the OpenZFS documentation also draws a blank. I’ve heard it not documented because it’s an “unstable interface”. Great.

What I have been able to follow through is that it does listen to devctl/devd events, it matches those events to pools/vdev and leaves it to CaseFile (C++ Class) logic to invoke likely looking functions starting with zpool_ which are found libzfs judging by the headers.

Now in my experience of a failing drive, one delayed operation is one too many – two is a sure sign of an imminent apocalypse. I’m not clear how zfsd handles this, because a slow I/O is not a failure and won’t generate a “device detached” event directly; and zfsd can only see what comes through the kernel event channel (devctl). So I took a look in the kernel ZFS module (vdev_disk.c and zio.c). ZFS detects something slow internally (zio has a timeout based, I think on zfs_deadman_synctim_ms) and it will log this but as long as it doesn’t actually generate an error, no event will be sent to devctl (and therefore zfsd won’t see it). I hope I’ve got this wrong, and I’ve seen several versions of the source code but I’m concentrating on the one in the 14.2-RELEASE base system. In other words, I don’t see it calling sEvent::Process() with this stuff.

However, there is logic for handling long operations and error counts and in case_file.cc. There are even tunable values as zpool properties (there is no zfsd config file)

PropertyDescriptionDefault
io_nNumber of I/O errors to trigger fault.50
io_tTime window (seconds) for io_n count.60
slow_io_nNumber of delayed/slow I/O events to trigger fault.8
slow_io_tTime window (seconds) for slow_io_n count.60
checksum_nNumber of checksum errors to mark DEGRADED (not full fault).50
checksum_tTime window (seconds) for checksum_n count.60

These defaults are hard wired into a header file (case_file.h – DEFAULT_ZFS_DEGRADE_IO_COUNT etc), and documented in the vdevprops(7) and zfsd man pages – inconsistently.

You can try to read the current values using the command:

zpool get io_n,io_t,slow_io_n,slow_io_t,checksum_n,checksum_t zroot all-vdevs

The command for “zpool get”, which is not “zfs get” is documented in man zpool-get, and I have to say it can be a bit confusing. The format of the line above includes a list of properties followed by the zpool name followed either by a particular vdev or the special value “all-vdevs”. It’s worth running this to find out what the possible vdevs are, as it may not be what you think!

Chances are they’ll all be set to “default”, and I believe the table above has the correct default values but I can’t be sure. Your output for a simple mirror system should look like this:

NAME    PROPERTY    VALUE      SOURCE
root-0  io_n        -          default
root-0  io_t        -          default
root-0  slow_io_n   -          default
root-0  slow_io_t   -          default
root-0  checksum_n  -          default
root-0  checksum_t  -          default
mirror-0  io_n        -          default
mirror-0  io_t        -          default
mirror-0  slow_io_n   -          default
mirror-0  slow_io_t   -          default
mirror-0  checksum_n  -          default
mirror-0  checksum_t  -          default
ada0p3  io_n        -          default
ada0p3  io_t        -          default
ada0p3  slow_io_n   -          default
ada0p3  slow_io_t   -          default
ada0p3  checksum_n  -          default
ada0p3  checksum_t  -          default
ada1p3  io_n        -          default
ada1p3  io_t        -          default
ada1p3  slow_io_n   -          default
ada1p3  slow_io_t   -          default
ada1p3  checksum_n  -          default
ada1p3  checksum_t  -          default

You can set individual values with commands like:

zpool set checksum_n=3 zroot root-0
zpool set slow_io_n=3 zroot mirror-0
zpool set io_n=3 zroot ada0p3

Unfortunately the documentation is a bit hazy on the effects of setting these values in different places. Do values on leaf vdevs (e.g. ada1p3) take precedence over values set further up (e.g. mirror-0)? What I’m not sure of is whether the root-0 error count can take the whole pool offline, but I suspect it should. In other words, each level keeps its own error count and if one drive is acting up it, can it take a whole vdev or pool offline? The other explanation is that the values always cascade down to the leaf vnode (drive) if it doesn’t have a particular value set – not a chance I’d take if the host is in a data centre a long way off!

What’s worse, I can’t find out which of these values is actually used. Properties aren’t inherited but I’d have assumed zfsd would walk back up the tree from the disk to find the first set value (be that immediately or at the root vdev). I can find no such code, so which one do you set?

And you probably do want to set them, as these values don’t match my real-world experience of drive failures. I believe that Linux has defaults of 10 errors in 10 minutes, which seems a better choice. If a drive is doing that, it’s usually not long for this world but expecting 50 errors in a minute when operations are taking 30 seconds to return while the drive tries its hardest isn’t going to cut it.

I’m also a tad suspicious that they’re all these values are “default” – i.e. not set. This triggers zfsd to use the hard-wired values – values that can only be changed by recompiling. And I have no idea what might be using them other than zfsd and what counts as a “default” for them. I would have expected the values to be set on root-0 (i.e. the pool) when the pool is created and inherited by vdevs unless specifically set. In other words, I smell a rat.

Linux?

I mentioned Linux doesn’t have zfsd, but I believe the kernel modules zfs.ko etc send events to zed, which in turns runs executables or scrips to do the hot-spare swapping and so on. If the kernel detects a device failure, or mark the vdev as DEGRADED or FAULTED. That’s to say it’s a kernel module, not a daemon doing the job of picking up on failed drives. Ilumonos had a similar system, and I assume Solaris still does.

How do you clear a zpool property?

As a bonus, here’s something you won’t find documented anywhere – how do you set a zpool property back to it’s default value? You might be thinking:

zpool inherit io_n zroot ada0p3

Well inherit works with zfs, doesn’t it? No such luck.

zpool io_n=default zroot ada0p3

Nope! Nor does =0 or just =

The way which works is:

zpool io_n=none zroot ada0p3

Jails on FreeBSD are easy without ezjail

I’ve never got the point of ezjail for creating jailed environments (like Solaris Zones) on FreeBSD. It’s easier to do most things manually, and especially since the definitions were removed from rc.conf to their own file, jail.conf. (My biggest problem is remembering whether it’s called “jail” or “jails”!)

jail.conf allows macros, has various macros predefined, and you can set defaults outside of a particular jail definition. If you’re using it as a split-out from rc.conf, you’re missing out.

Here’s an example:

# Set sensible defaults for all jails
path /jail/$name;
exec.start = "/bin/sh /etc/rc";
exec.stop = "/bin/sh /etc/rc.shutdown";
exec.clean;
mount.devfs;
mount.procfs;
host.hostname $name.my.domain.uk;
# Define our jails
tom { ip4.addr = 192.168.0.2 ; }
dick { ip4.addr = 192.168.0.3 ; }
harry { ip4.addr = 192.168.0.4 ; }
mary { ip4.addr = 192.168.0.5 ; }
alice { ip4.addr = 192.168.0.6 ; }
nagios { ip4.addr = 192.168.0.7 ; allow.raw_sockets = 1 ; }
jane { ip4.addr = 192.168.0.8 ; }
test { ip4.addr = 192.168.0.9 ; }
foo { ip4.addr = 192.168.0.10 ; }
bar { ip4.addr = 192.168.0.11 ; }

So what I’ve done here is set sensible default values. Actually, these are probably mostly set what you want anyway, but as I’m only doing it once, re-defining them explicitly is good documentation.

Next I define the jails I want, over-riding any defaults that are unique to the jail. Now here’s one twist – the $name macro inside the {} is the name of the jail being defined. Thus, inside the definition of the jail I’ve called tom, it defines hostname=tom.my.domain.uk. I use this expansion to define the path to the jail too.

If you want to take it further, if you have your name in DNS (which I usually do) you can set ip.addr= using the generated hostname, leaving each individual jail definition as { ;} !

I’ve set the ipv4 address explicitly, as I use a local vlan for jails, mapping ports as required from external IP addresses if an when required.

Note the definition for the nagios jail; it has the extra allow.raw_sockets = 1 setting. Only nagios needs it.

ZFS and FreeBSD Jails.

The other good wheeze that’s become available since the rise of jails is ZFS. Datasets are the best way to do jails.

First off, create your dataset z/jail. (I use z from my default zpool – why use anything longer, as you’ll be typing it a lot?)

Next create your “master” jail dataset: zfs create z/jail/master

Now set it up as a vanilla jail, as per the handbook (make install into it). Then leave it alone (other than creating a snapshot called “fresh” or similar).

When you want a new jail for something, use the following:

zfs clone z/jail/master@fresh z/jail/alice

And you have a new jail, instantly, called alice – just add an entry as above in jail.conf, and edit rc.conf to configure its networ. And what’s even better, alice doesn’t take up any extra space! Not until you start making changes, anyway.

The biggest change you’re likely to make to alice is building ports. So create another dataset for that: z/jail/alice/usr/ports. Then download the ports tree, build and install your stuff, and when you’re done, zfs destroy
z/jail/alice/usr/ports. The only space your jail takes up are the changes from the base system used by your application. Obviously, if you use python in almost every jail, create a master version with python and clone that for maximum benefit.

ZFS Optimal Array Size

So there I was looking at a pile of eight drives and an empty storage array, and wondering how to cofigure it for best performance under ZFS. “Everyone knows” the formula right? The best performance in a raidz array comes if you use 2^D+P drives. That’s to say your data drives should be a power of two (i.e. 2,4,8,16) plus however many redundant (parity) drives for the raidz level you desire. This is mentioned quite often in the Lucas book FreeBSD Mastery:ZFS; although it didn’t originate there I’ll call it the Lucas rule anyway

I have my own rule – redundancy should be two drives or 30%. Why? Well drives in an array have a really nasty habit of failing two at a time. It’s not sods law, it’s a real phenomenon caused by the stress of re-silvering shaking out any other drives that are “on the edge”. This means I go for configurations such as 4+2, 5+2, 6+2. From there on I go to raidz3 with 7+3, 8+3, 9+3. As there’s no raidz4, 12 drives is the limit – for 14 drives I’d have two vdevs (LUNs) of 5+2 each.

However, If you merge my rule with the Lucas rule the only valid sizes are 2+2 and 4+2 and 8+3. And I had just eight drives to play with.

I was curious – how was the Lucas rule derived? I dug out the book, and it doesn’t say. Anywhere. Having a highly developed suspicion of anything described as “best practice” I decided to test it on my rag-bag collection of drives in the Dell backplane, and guess what? No statistically significant difference.

Now the trouble with IT “best practice” guides is they’re written by technicians based on observation, not OS programmers who know how stuff actually works. The first approach has a lot of merit, but unless you know the reason for your observations you won’t know when the reason has become irrelevant. Unfortuantely, as an OS programmer, I now had a duty to figure out what this reason might have been.

After wading through the code and finding nothing much helpful, I did what I should have done first and considered the low-level disk layout. It’s actually quite simple.

Your stuff is written to disk in a series of blocks, right? In a striped array, each drive gets a block in turn to spread the load. No problem there. Well there will be a problem if your ZFS block size doesn’t match the block size on the drives, but that’s a complication I’m going to overlook – lets just assume you got that bit right.

So where does the optimal number of disks come from? I contend that on a striped vdev there never was one. The problem only comes when you add redundant drives.

I’m going to digress here to explain how error correcting data happens – in very simple terms. Suppose you have a sequence of numbers such as:

5 8 2 3

Each number is stored on a separate piece of paper, and to guard against loss you add a fifth number so that when you add them all up you get a total ending in zero. In this example, the total of the first 4 is 18. You can add an extra 2 to make the total 20, which ends in zero, so the fifth number is going to be 2.

5 8 2 3 2

Now, if we lose any one of those five numbers we can work out what it must have been – just work out which digit when added to the remaining four gives you a total ending in zero. For example, supposing ‘3’ when missing. Add up the remainder and you get 17. You need 3 more to get to a zero, so the missing number must be 3.

Digression over. ZFS calculates a block of error correction data for the blocks of data it’s just written and adds this as the last block in the sequence. If If ZFS blocks and sectors were the same size, this would be fine writing another sector is quick. But ZFS blocks no longer match sectors. In fact, they’re tunable over a wide range. We’ve also got 4k sectors instead of the traditional 512b. So, suppose you had 2k ZFS blocks on a 4k sector disk? Your parity data could end up being just half a sector, meaning that ZFS has to read it, overwrite half, and write it back rather than just writing it. This sucks. But if you choose the number of disks carefully, you end up with parity blocks that do fit. So, always make sure you follow Lucas’ rule, and make sure your data drives are a power of two.

Except…

This may have been true once, but now we have variable ZFS blocks sizes, and they tend to be much larger than the sector size anyway. In this situation the “magic” configurations no longer matter. And, now we have lz4 compression, the physical block sizes are variable anyway.

For those not in the know about this, lz4 compression is a no-brainer. It wont’ compress stuff it can’t, and its fast. Most files will compress to at least 2:1, often more – which means when you read a block only half the data needs to travel down the bus to get in memory. Everything suddenly goes twice as fast, at the expense of one core having to do some work. It’s true that the block and sector sizes are nowhere near matching, and this is bound to have a performance hit, but this is more-than eclipsed by the improved transfer rate.

So in summary, forget the 2^D+P “best practice” formula. It was only valid in the early days. Have whatever config you like, but I do commend my rule about the number of redundant drives. This is based on a hardware issue, and no update to the software is going to fix this any time soon.