Detecting failing ZFS drives – Frank Leonhardt's Blog

FreeBSD relies on a separate daemon to be running to detect failing drives in a zfs pool rather than the kernel handling it, but I’m not convinced it even works.

I’ve been investigating what the current ZFS on FreeBSD 14.2 does with failing drives. It’s a bit worrying. I posted on the FreeBSD mailing list on 17th Feb 2025 in the hope that someone would know something, and there as some discussion (“me too”), but we drew a blank.

The problem is that ZFS doesn’t “fault” a drive until it’s taken offline by the OS. So if you’ve got a flaky drive you have to wait for FreeBSD to disconnect it, and only then ZFS will notice. At least that’s how it works out of the box (but read on).

In the past I’ve tested ZFS’s robustness simply by pulling drives, which guaranteed the OS would fail it, but a few troubling events led me to do a proper investigation. I acquired collection of flaky drives (data centre discards) that are unreliable, and set it up to fail so I could watch. ZFS will wait a very long time for a SAS drive to complete an operation, in circumstances when the drive is clearly on its last legs. If the operation fails and retries, FreeBSD logs a CAM error but ZFS doesn’t fail the drive. You can have a SAS drive rattling and groaning away, but FreeBSD patiently waits for it to complete by relocating the block or attempting multiple retries, and ZFS is none the wiser. Or maybe ZFS is relocating the block after the CAM error? Either way, ZFS says the drive is “ONLINE” and carries on using it while your system grinds to a standstill.

The only clue, other than the console log, is that operations can start to take a long time. The tenacity of SAS drives means it can take several minutes to complete an iop, although SATA tends to fail more quickly. You can have a SAS drive taking a minute for each operation and all you know about it is things are going very, very slowly. It does keep error statistics for vdevs, cascading up the chain, but what it does with them and when it logs them isn’t entirely clear.

If you use a stethoscope on the drive (one of my favourite tricks) it’s obvious it’s not happy but FreeBSD won’t offline it until it catches fire. In fact I suspect it would need to explode before it noticed.

zfsd

However, there is an answer! Nine years ago saw the release into base of a handy little daemon called zfsd from Justin Gibbs and Alan Somers. This provides some of the functionality of Solaris’ Service Management Facility (SMF), in particular the fault management daemon, fmd. Quite how closely it follows it I’m not certain, but the general idea is the same. Both look to see if the hardware is failing and act accordingly. In the recent Linux ZFS there’s a daemon called zfs-zed but that works a little differently (more later).

On FreeBSD, zfsd listens to devctl/devd (and possibly CAM) and will collect data on drive errors (it calls this a case file). I say “possibly” because it’s not exactly well documented and appears to have remained pretty much unchanged since it appeared in FreeBSD 11. As a result, I’ve been examining the source code, which is in C++ and has been influenced by “Design Patterns” – not a recipe for clear understanding.

Anyway, zfsd definitely listens to the devctl events (the kind of stuff that ends up in the console log) and takes action if there’s a problem. For example, if a vdev generates more than eight delayed I/O events in a minute it will mark it as faulted and activate a hot spare if there is one. If there are more than 50 I/O errors a minute it will do the same. 50 checksum error a minute will degrade a vdev. All of this can be found in the man page.

What’s not so clear is how or whether the code actually operates as advertised. It certainly calls something in response to events, in zfsd_event.cc: likely looking functions such as zpool_vdev_detach(), which are part of libzfs. Trying to find the man page for these functions is more problematic, and a search of the OpenZFS documentation also draws a blank. I’ve heard it not documented because it’s an “unstable interface”. Great.

What I have been able to follow through is that it does listen to devctl/devd events, it matches those events to pools/vdev and leaves it to CaseFile (C++ Class) logic to invoke likely looking functions starting with “zpool_”, which are found libzfs judging by the headers.

Now in my experience of a failing drive, one delayed operation is one too many – two is a sure sign of an imminent apocalypse. I’m not clear how zfsd handles this, because a slow I/O is not a failure and won’t generate a “device detached” event directly; and zfsd can only see what comes through the kernel event channel (devctl). So I took a look in the kernel ZFS module (vdev_disk.c and zio.c). ZFS detects something slow internally (zio has a timeout based, I think on zfs_deadman_synctim_ms) and it will log this but as long as it doesn’t actually generate an error, no event will be sent to devctl (and therefore zfsd won’t see it). I hope I’ve got this wrong, and I’ve seen several versions of the source code but I’m concentrating on the one in the 14.2-RELEASE base system. In other words, I don’t see it calling sEvent::Process() with this stuff.

However, there is logic for handling long operations and error counts and in case_file.cc. There are even tunable values as zpool properties (there is no “zfsd config file”)

Property	Description	Default
`io_n`	Number of I/O errors to trigger fault.	50
`io_t`	Time window (seconds) for `io_n` count.	60
`slow_io_n`	Number of delayed/slow I/O events to trigger fault.	8
`slow_io_t`	Time window (seconds) for `slow_io_n` count.	60
`checksum_n`	Number of checksum errors to mark DEGRADED (not full fault).	50
`checksum_t`	Time window (seconds) for `checksum_n` count.	60

These defaults are hard wired into a header file (case_file.h – DEFAULT_ZFS_DEGRADE_IO_COUNT etc), and documented in the vdevprops(7) and zfsd man pages – inconsistently.

You can try to read the current values using the command:

zpool get io_n,io_t,slow_io_n,slow_io_t,checksum_n,checksum_t zroot all-vdevs

The command for “zpool get”, which is not the same as “zfs get”, is documented in man zpool-get, and I have to say it can be a bit confusing. The format of the line above includes a list of properties followed by the zpool name followed either by a particular vdev or the special value “all-vdevs”. It’s worth running this to find out what the possible vdevs are, as it may not be what you think!

Chances are they’ll all be set to “default”, and I believe the table above has the correct default values (cribbed from the source code) but I can’t be sure. Your output for a simple mirror system should look like this:

NAME    PROPERTY    VALUE      SOURCE
root-0  io_n        -          default
root-0  io_t        -          default
root-0  slow_io_n   -          default
root-0  slow_io_t   -          default
root-0  checksum_n  -          default
root-0  checksum_t  -          default
mirror-0  io_n        -          default
mirror-0  io_t        -          default
mirror-0  slow_io_n   -          default
mirror-0  slow_io_t   -          default
mirror-0  checksum_n  -          default
mirror-0  checksum_t  -          default
ada0p3  io_n        -          default
ada0p3  io_t        -          default
ada0p3  slow_io_n   -          default
ada0p3  slow_io_t   -          default
ada0p3  checksum_n  -          default
ada0p3  checksum_t  -          default
ada1p3  io_n        -          default
ada1p3  io_t        -          default
ada1p3  slow_io_n   -          default
ada1p3  slow_io_t   -          default
ada1p3  checksum_n  -          default
ada1p3  checksum_t  -          default

You can set individual values with commands like:

zpool set checksum_n=3 zroot root-0
zpool set slow_io_n=3 zroot mirror-0
zpool set io_n=3 zroot ada0p3

Unfortunately the documentation is a bit hazy on the effects of setting these values in different places. Do values on leaf vdevs (e.g. ada1p3) take precedence over values set further up (e.g. mirror-0)? What I’m not sure of is whether the root-0 error count can take the whole pool offline, but I suspect it should. In other words, each level keeps its own error count and if one drive is acting up it, can it take a whole vdev or pool offline? The other explanation is that the values always cascade down to the leaf vnode (drive) if it doesn’t have a particular value set – not a chance I’d take if the host is in a data centre a long way off!

What’s worse, I can’t find out which of these values is actually used. Properties aren’t inherited but I’d have assumed zfsd would walk back up the tree from the disk to find the first set value (be that immediately or at the root vdev). I can find no such code, so which one do you set?

And you probably do want to tune these parameters, as these values don’t match my real-world experience of drive failures. I believe that Linux has defaults of 10 errors in 10 minutes, which seems a better choice. If a drive is doing that, it’s usually not long for this world, but expecting 50 errors in a minute when operations are taking 30 seconds to return while the drive tries its hardest isn’t going to cut it.

I’m also a tad suspicious that all these values are “default” – i.e. not set. This triggers zfsd to use the hard-wired values – values that can only be changed by recompiling. And I have no idea what might be using the values stored as vdev properties other than zfsd, and what counts as a “default” for them. I would have expected the values to be set on root-0 (i.e. the zpool) when the zpool is created, and inherited by vdevs unless specifically set. In other words, I smell a rat.

Linux?

I mentioned Linux doesn’t have zfsd, but I believe the kernel modules zfs.ko etc send events to it’s own zed (part of OpenZFS that FreeBSD doesn’t use), which in turns runs executables or scrips to do the hot-spare swapping and so on. If the kernel detects a device failure, it mark the vdev as DEGRADED or FAULTED. That’s to say it’s a kernel module, not a daemon doing the job of picking up on failed drives. Ilumonos had a similar system, and I assume Solaris still does.

How do you clear a zpool property?

As an aside, here’s something you won’t find documented anywhere – how do you set a zpool property back to it’s default value? You might be thinking:

zpool inherit io_n zroot ada0p3

Well inherit works with zfs, doesn’t it? No such luck.

zpool io_n=default zroot ada0p3

Nope! Nor does =0 or just =

The way which works is:

zpool io_n=none zroot ada0p3

Update: 19-Nov-25

I’m still suspicious of this so I asked Allan Jude and Michael Lucas for Klara during a webinar. Apparently they don’t have a problem with zfsd. (They had a few days pre-warning of the question). I’ve added some trace stuff to it and I’m watching it closely. I never did figure out how it could detect slow operations – if anyone can enlighten me as to how the kernel communicates this to zfsd other than a “timeout, give up” event, I’d really appreciate it.

Update 01/12/2025

A couple of weeks ago Allan Jude very kindly alerted me to an update to zfsd that does indeed tackle long operations, which it didn’t before (I knew I smelled a rat). Unfortunately circumstances have prevented me from taking a proper look yet, but an updated zfsd is in the works.

zfsd

Linux?

How do you clear a zpool property?

Update: 19-Nov-25

Update 01/12/2025

Leave a Reply Cancel reply