Some people seem to think that disabling network pings (ICMP echo requests to be exact) is a great security enhancement. If attackers can’t ping something they won’t know it’s there. It’s called Security through Obscurity and only a fool would live in this paradise.
But supposing you have something on your network that disables pings and you, as the administrator, want to know if it’s up? My favourite method is to send an ARP packet to the IP address in question, and you’ll get a response.
ARP is how you translate an IP address into a MAC address to get the Ethernet packet to the right host. If you want to send an Ethernet packet to 1.2.3.4 you put out an ARP request “Hi, if you’re 1.2.3.4 please send your MAC address to my MAC address”. If a device doesn’t respond to this then it can’t be on an Ethernet network with an IP address at all.
You can quickly write a program to do this in ‘C’, but you can also do it using a shell script, and here’s a proof of concept.
You run this with a single argument (hostname or IP address) and it will print out whether it is down or up.
The first line is simply the shell needed to run the script.
Line 2 bails out if you forget to add an argument.
Line 3, which is commented out, deletes the host from the ARP cache if it’s already there. This probably isn’t necessary in reality, and you need to be root user to do it. IP address mappings are typically deleted after 20 minutes, but as we’re about to initiate a connection in line 4 it’ll be refreshed anyway.
Line 4 sends a ping to the host. We don’t care if it replies. The timeout is set to the minimum 1 second, which means there’s a one second delay if it doesn’t reply. Other ways of tricking the host into replying exist, but every system has ping, so ping it is here.
Live 5 will print <hostname> is up if there is a valid ARP cache entry, which can be determined by the presence of “expires in” in the output. Adjust as necessary.
The last line, if still running, prints <hostname> is down. Obviously.
This only works across Ethernet – you can’t get an ARP resolution on a different network (i.e. once the traffic has got through a router). But if you’re on your organisation’s LAN and looking to see if an IoT devices is offline, lost or stolen then this is a quick way to poll it and check.
FreeBSD relies on a separate daemon to be running to detect failing drives in a zfs pool rather than the kernel handling it, but I’m not convinced it even works.
I’ve been investigating what the current ZFS on FreeBSD 14.2 does with failing drives. It’s a bit worrying. I posted on the FreeBSD mailing list on 17th Feb 2025 in the hope that someone would know something, and there as some discussion (“me too”), but we drew a blank.
The problem is that ZFS doesn’t “fault” a drive until it’s taken offline by the OS. So if you’ve got a flaky drive you have to wait for FreeBSD to disconnect it, and only then ZFS will notice. At least that’s how it works out of the box (but read on).
In the past I’ve tested ZFS’s robustness simply by pulling drives, which guaranteed the OS would fail it, but a few troubling events led me to do a proper investigation. I acquired collection of flaky drives (data centre discards) that are unreliable, and set it up to fail so I could watch. ZFS will wait a very long time for a SAS drive to complete an operation, in circumstances when the drive is clearly on its last legs. If the operation fails and retries, FreeBSD logs a CAM error but ZFS doesn’t fail the drive. You can have a SAS drive rattling and groaning away, but FreeBSD patiently waits for it to complete by relocating the block or attempting multiple retries, and ZFS is none the wiser. Or maybe ZFS is relocating the block after the CAM error? Either way, ZFS says the drive is “ONLINE” and carries on using it while your system grinds to a standstill.
The only clue, other than the console log, is that operations can start to take a long time. The tenacity of SAS drives means it can take several minutes to complete an iop, although SATA tends to fail more quickly. You can have a SAS drive taking a minute for each operation and all you know about it is things are going very, very slowly. It does keep error statistics for vdevs, cascading up the chain, but what it does with them and when it logs them isn’t entirely clear.
If you use a stethoscope on the drive (one of my favourite tricks) it’s obvious it’s not happy but FreeBSD won’t offline it until it catches fire. In fact I suspect it would need to explode before it noticed.
zfsd
However, there is an answer! Nine years ago saw the release into base of a handy little daemon called zfsd from Justin Gibbs and Alan Somers. This provides some of the functionality of Solaris’ Service Management Facility (SMF), in particular the fault management daemon, fmd. Quite how closely it follows it I’m not certain, but the general idea is the same. Both look to see if the hardware is failing and act accordingly. In the recent Linux ZFS there’s a daemon called zfs-zed but that works a little differently (more later).
On FreeBSD, zfsd listens to devctl/devd (and possibly CAM) and will collect data on drive errors (it calls this a case file). I say “possibly” because it’s not exactly well documented and appears to have remained pretty much unchanged since it appeared in FreeBSD 11. As a result, I’ve been examining the source code, which is in C++ and has been influenced by “Design Patterns” – not a recipe for clear understanding.
Anyway, zfsd definitely listens to the devctl events (the kind of stuff that ends up in the console log) and takes action if there’s a problem. For example, if a vdev generates more than eight delayed I/O events in a minute it will mark it as faulted and activate a hot spare if there is one. If there are more than 50 I/O errors a minute it will do the same. 50 checksum error a minute will degrade a vdev. All of this can be found in the man page.
What’s not so clear is how or whether the code actually operates as advertised. It certainly calls something in response to events, in zfsd_event.cc: likely looking functions such as zpool_vdev_detach(), which are part of libzfs. Trying to find the man page for these functions is more problematic, and a search of the OpenZFS documentation also draws a blank. I’ve heard it not documented because it’s an “unstable interface”. Great.
What I have been able to follow through is that it does listen to devctl/devd events, it matches those events to pools/vdev and leaves it to CaseFile (C++ Class) logic to invoke likely looking functions starting with “zpool_”, which are found libzfs judging by the headers.
Now in my experience of a failing drive, one delayed operation is one too many – two is a sure sign of an imminent apocalypse. I’m not clear how zfsd handles this, because a slow I/O is not a failure and won’t generate a “device detached” event directly; and zfsd can only see what comes through the kernel event channel (devctl). So I took a look in the kernel ZFS module (vdev_disk.c and zio.c). ZFS detects something slow internally (zio has a timeout based, I think on zfs_deadman_synctim_ms) and it will log this but as long as it doesn’t actually generate an error, no event will be sent to devctl (and therefore zfsd won’t see it). I hope I’ve got this wrong, and I’ve seen several versions of the source code but I’m concentrating on the one in the 14.2-RELEASE base system. In other words, I don’t see it calling sEvent::Process() with this stuff.
However, there is logic for handling long operations and error counts and in case_file.cc. There are even tunable values as zpool properties (there is no “zfsd config file”)
Property
Description
Default
io_n
Number of I/O errors to trigger fault.
50
io_t
Time window (seconds) for io_n count.
60
slow_io_n
Number of delayed/slow I/O events to trigger fault.
8
slow_io_t
Time window (seconds) for slow_io_n count.
60
checksum_n
Number of checksum errors to mark DEGRADED (not full fault).
50
checksum_t
Time window (seconds) for checksum_n count.
60
These defaults are hard wired into a header file (case_file.h – DEFAULT_ZFS_DEGRADE_IO_COUNT etc), and documented in the vdevprops(7) and zfsd man pages – inconsistently.
You can try to read the current values using the command:
zpool get io_n,io_t,slow_io_n,slow_io_t,checksum_n,checksum_t zroot all-vdevs
The command for “zpool get”, which is not the same as “zfs get”, is documented in man zpool-get, and I have to say it can be a bit confusing. The format of the line above includes a list of properties followed by the zpool name followed either by a particular vdev or the special value “all-vdevs”. It’s worth running this to find out what the possible vdevs are, as it may not be what you think!
Chances are they’ll all be set to “default”, and I believe the table above has the correct default values (cribbed from the source code) but I can’t be sure. Your output for a simple mirror system should look like this:
zpool set checksum_n=3 zroot root-0
zpool set slow_io_n=3 zroot mirror-0
zpool set io_n=3 zroot ada0p3
Unfortunately the documentation is a bit hazy on the effects of setting these values in different places. Do values on leaf vdevs (e.g. ada1p3) take precedence over values set further up (e.g. mirror-0)? What I’m not sure of is whether the root-0 error count can take the whole pool offline, but I suspect it should. In other words, each level keeps its own error count and if one drive is acting up it, can it take a whole vdev or pool offline? The other explanation is that the values always cascade down to the leaf vnode (drive) if it doesn’t have a particular value set – not a chance I’d take if the host is in a data centre a long way off!
What’s worse, I can’t find out which of these values is actually used. Properties aren’t inherited but I’d have assumed zfsd would walk back up the tree from the disk to find the first set value (be that immediately or at the root vdev). I can find no such code, so which one do you set?
And you probably do want to tune these parameters, as these values don’t match my real-world experience of drive failures. I believe that Linux has defaults of 10 errors in 10 minutes, which seems a better choice. If a drive is doing that, it’s usually not long for this world, but expecting 50 errors in a minute when operations are taking 30 seconds to return while the drive tries its hardest isn’t going to cut it.
I’m also a tad suspicious that all these values are “default” – i.e. not set. This triggers zfsd to use the hard-wired values – values that can only be changed by recompiling. And I have no idea what might be using the values stored as vdev properties other than zfsd, and what counts as a “default” for them. I would have expected the values to be set on root-0 (i.e. the zpool) when the zpool is created, and inherited by vdevs unless specifically set. In other words, I smell a rat.
Linux?
I mentioned Linux doesn’t have zfsd, but I believe the kernel modules zfs.ko etc send events to it’s own zed (part of OpenZFS that FreeBSD doesn’t use), which in turns runs executables or scrips to do the hot-spare swapping and so on. If the kernel detects a device failure, it mark the vdev as DEGRADED or FAULTED. That’s to say it’s a kernel module, not a daemon doing the job of picking up on failed drives. Ilumonos had a similar system, and I assume Solaris still does.
How do you clear a zpool property?
As an aside, here’s something you won’t find documented anywhere – how do you set a zpool property back to it’s default value? You might be thinking:
zpool inherit io_n zroot ada0p3
Well inherit works with zfs, doesn’t it? No such luck.
zpool io_n=default zroot ada0p3
Nope! Nor does =0 or just =
The way which works is:
zpool io_n=none zroot ada0p3
Update: 19-Nov-25
I’m still suspicious of this so I asked Allan Jude and Michael Lucas for Klara during a webinar. Apparently they don’t have a problem with zfsd. (They had a few days pre-warning of the question). I’ve added some trace stuff to it and I’m watching it closely. I never did figure out how it could detect slow operations – if anyone can enlighten me as to how the kernel communicates this to zfsd other than a “timeout, give up” event, I’d really appreciate it.
Update 01/12/2025
A couple of weeks ago Allan Jude very kindly alerted me to an update to zfsd that does indeed tackle long operations, which it didn’t before (I knew I smelled a rat). Unfortunately circumstances have prevented me from taking a proper look yet, but an updated zfsd is in the works.
So you have FreeBSD a single drive ZFS machine and you want to add a second drive to mirror the first because it turns out it’s now important. Yes, it’s possible to do this after installation, even if you’re booting off ZFS.
Let’s assume your first drive is ada0, and it’s had the FreeBSD installer set it up a a “stripe on one drive” using GPT partition. You called the existing zpool “zroot” as you have no imagination whatsoever. In other words everything is the default. The new disk is probably going to be ada1 – plug it in and look on the console or /var/messages to be sure. As long as it’s the same size or larger than the first, you’re good to go. (Use diskinfo -v if you’re not sure).
FreeBSD sets up boot partitions and swap on the existing drive, and you’ll probably want to do this on the new one, if for no other reason than if ada0 fails it can boot off ada1.
This gets rid of any old partition table that might be there, copies the existing one from ada0 (which will include the boot and swap partitions as well as the ZFS one).
The third line installs a protective MBR on the disk to avoid non-FreeBSD utilities doing bad things and then adds the ZFS boot code.
If there’s a problem, zero the disk using dd and try again. Make sure you zap the correct drive, of course.
Once you’ve got the partition and boot set up, all you need to do is attach it to the zpool. This is where people get confused as if you do it wrong you may end up with a second vdev rather than a mirror. Note that the ZFS pool is on the third partition on each drive – i.e. adaxp3.
The trick is to specify both the existing and new drives:
zpool attach zroot ada0p3 ada1p3
Run zpool status and you’ll see it (re)silvering the new drive. No interruptions, no reboot.
pool: zroot state: ONLINE scan: resilvered 677M in 00:00:18 with 0 errors on Sat Apr 5 16:13:16 2025 config: NAME STATE READ WRITE CKSUM zroot ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0
This only took 18 seconds to resilver as in this case it’s just a system diskm and ZFS doesn’t bother copying unnecessary blocks.
If you want to remove it and go back to a single drive the command is:
zpool detach zroot ada1p3
Add another to create a three-way mirror. Go a little crazy!
I’ve written about the virtues of Geom Mirror (gmirror) in the
past. Geom Mirror was probably the best way of implementing redundant
storage between FreeBSD 5.3 (2004) until ZFS was introduced in
FreeBSD 7.0 in 2008. Even then, ZFS is heavyweight and the Geom
Mirror was tested and more practical for many years afterwards.
The Geom system also
has a RAID3 driver. RAID3 is weird. It’s the one using a separate
parity drive. It works, but it wasn’t popular. If you had a big
FreeBSD system and wanted an array it was probably better to use an
LSI host bus adapter and have that manage it with mptutil. But for
small servers, especially remotely managed, Geom Mirror was the best.
I’m still running it on a few twin-drive servers, and will probably
continue for some time to come.
The original Unix File System (UFS2) actually has a couple of advantages over ZFS. Firstly it has much lower resource requirements. Secondly, and this is a big one, it has in-place updates. This is a big deal with random access files, such as databases or VM hard disks, as the Copy-on-Write system ZFS uses fragments the disk like crazy. To maintain performance on a massively fragmented file system, ZFS requires a huge amount of cache RAM.
What you need for random access read/write files are in-place updates. Database engines handle transaction groups themselves to ensure that the data structure’s integrity is maintained. ZFS does this at the file level instead of application level, which isn’t really good enough as the application knows what is and what isn’t required. There’s no harm in ZFS doing it too, but it’s a waste. And the file fragmentation is a high price to pay.
So, for database type applications, UFS2 still rules. There’s nothing wrong with having a hybrid system with both UFS and ZFS, even on the same disk. Just mount the UFS /var onto the ZFS tree.
But back to the twin drive system: The FreeBSD installed doesn’t have this as an option. So here’s a handy dandy script wot I rote to do it for you. Boot of a USB stick or whatever and run it.
Script to install FreeBSD on gmirror
Use as much or as little as you like.
At the beginning of the script I define the two drives I will be using. Obviously change these! If the disks are not blank it might not work. The script tries to destroy the old partition data but you may need to do more if you have it set up with something unusual.
Be careful – it will delete everything on both drives without asking!
Read the comments in the script. I have set it up to use a 8g UFS partition, but if you leave out the “-s 8g” the final partition will use all the space, which is probably what you want. For debugging I kept it small.
I have put everything on a single UFS partition. If you want separate / /usr /var then you need to modify it to what you need and create a mirror for each (and run newfs for each). The only think is that I’ve created a swap file on each drive that is NOT mirrored and configured it to use both.
I have not set up everything on the new system, but it will boot and you can configure other stuff as you need by hand. I like to connect to the network and have an admin user so I can work on a remote terminal straight away, so I have created an “admin” user with password “password” and enabled the ssh daemon. As you probably know, FreeBSD names its Ethernet adapters by manufacturer and you don’t know what you’ll have so I just have it try DHCP on every possible interface. Edit the rc.conf file how you need it once it’s running.
If base.txz and kernel.txz are in the current directory, fine. The script tries to download them at present.
And finally, I call my mirrors m0, m1, m2 and so on. Some people like to use gm0. It really doesn’t matter what you call them.
#!/bin/sh # Install FreeBSD on two new disks set up a a gmirror # FJL 2025 # Edit stuff in here as needed. At present it downloads # FreeBSD 14.2-RELEASE and assumes the disks # in use are ada0 and ada1
# Fetch the OS files if needed (and as appropriate) fetch https://download.freebsd.org/ftp/releases/amd64/14.2-RELEASE/kernel.txz fetch https://download.freebsd.org/ftp/releases/amd64/14.2-RELEASE/base.txz
# Disks to use for a mirror. All will be destroyed! Edit these. The -xxxx # is there to save you if you don't D0=/dev/da1-xxxxx D1=/dev/da2-xxxxx
# User name and password to set up initial user. ADMIN=admin ADMINPASS=password
# Make sure the geom mirror module is loaded. kldload geom_mirror
# Set up the first drive echo Clearing $D0 gpart destroy -F $D0 dd if=/dev/zero of=$D0 bs=1m count=10
# Then create p1 (boot), p2 (swap) and p3 (ufs) # Note the size of the UFS partition is set to 8g. If you delete # the -s 8g it will use the rest of the disk by default. For testing # it's better to have something small so newfs finishes quick.
echo Installing boot code on $D0 # -b installs protective MBR, -i the Bootloader. # Assumes partition 1 is freebsd-boot created above. gpart bootcode -b /boot/pmbr -p /boot/gptboot -i 1 $D0
# Set up second drive echo Clearing $D1 gpart destroy -F $D1 dd if=/dev/zero of=$D1 bs=1m count=10
# Copy partition data to second drive and put on boot code gpart backup $D0 | gpart restore $D1 gpart bootcode -b /boot/pmbr -p /boot/gptboot -i 1 $D1
# Mirror partition 3 on both drives gmirror label -v m0 ${D0}p3 ${D1}p3
echo Creating file system newfs -U /dev/mirror/m0 mkdir -p /mnt/freebsdsys mount /dev/mirror/m0 /mnt/freebsdsys
echo Decompressing Kernel tar -x -C /mnt/freebsdsys -f kernel.txz echo Decompressing Base system tar -x -C /mnt/freebsdsys -f base.txz
# Tell the loader where to mount the root system from echo 'geom_mirror_load="YES"' > /mnt/freebsdsys/boot/loader.conf echo 'vfs.root.mountfrom="ufs:/dev/mirror/m0"' \ >> /mnt/freebsdsys/boot/loader.conf
# Set up fstab so it all mounts. echo $D0'p2 none swap sw 0 0' > /mnt/freebsdsys/etc/fstab echo $D1'p2 none swap sw 0 0' >> /mnt/freebsdsys/etc/fstab echo '/dev/mirror/m0 / ufs rw 1 1' >> /mnt/freebsdsys/etc/fstab
# Enable sshd and make ethernet interfaces DHCP configure echo 'sshd_enable="YES"' >/mnt/freebsdsys/etc/rc.conf for int in em0 igb0 re0 bge0 alc0 fxp0 xl0 ue0 igb0 xcgbe0 bnxt0 mlx0 do echo 'ifconfig_'$int'="DHCP"' >>/mnt/freebsdsys/etc/rc.conf done
If you know all about DHCP, feel free to skip this bit.
In the Unix world the network administrator assigns every host (networked computer) the stuff in needs to operate on the network – it’s name and IP address. Other hosts can find it by looking its name up on the DNS server (or hosts list before DNS was invented) and start talking.
The host new its name and IP address because it was set in a configuration file, along with other network stuff like gateway routers and DNS servers.
Microsoft didn’t use IP networking for a long time, using NetBEUI and other protocols to dispense with a network administrator and configure stuff automatically over Ethernet (mainly). Or was that NetBIOS or WINS or ??? Anyway, the usual bugger’s muddle. When Microsoft finally realised the Internet was Important, Windoze machines also worked with Unix networking (IP, DNS and other good things). The stuck with versions of their own crazy file sharing system but that’s another story.
Meanwhile, it was realised that editing a configuration file on every host was a bit of a problem, especially if you had to edit it everywhere if you changed anything network-ish. And Dynamic Host Configuration Protocol (DHCP) was invented in the early 1990s. This combined the best of both worlds – automatic configuration with a network administrator in charge.
DHCP operates using a DHCP server. When a host boots it can get it’s network stuff from the DHCP server before it knows anything about the IP network. It effectively does this using an Ethernet (layer 2) multicast packet, but the details are complicated and not relevant here.
The DHCP server sees this request for details and sends the host back its settings. These could be the next free IP address from a pool, together with other important information like the subnet, gateway, local DNS and domain name. The host says “thank you very much” and configures itself as a fine upstanding and proper member of the domain. Don’t confuse domain with Microsoft Domain stuff, BTW. They used the name wrong. This is the DNS-type domain.
Manual allocation
I said in the bit you skipped reading that the DHCP server could send the client the next free IP address from a pool. But, you can also send precise details you want the host configured with. This means you can keep your network configuration in one file on the DHCP server rather than in startup files on every host, see how everything is set up and make small or large changes with a text editor. Almost. You’ll also need to edit the files on your DNS server to make the names to IP addresses translation work. Having both servers on the same machine makes sense.
How does the DHCP server know who’s asking, and therefore which configuration to send? Easy, it goes but the Ethernet MAC address.
Assuming you know how to configured DNS, here’s how you do it.
dhcpd
You’ll need the DHCP Demon, “dhcpd” from the Internet Software Consortium. Compile it or install the package as necessary. It has a configuration file called dhcpd.conf (usually in /usr/local/etc or /etc) which is where you set everything up. It comes with examples, but you’re looking at something like this.
Let’s assume your organisation is called flubnutz.com and the the DHCP server is on the LANin the London office – i.e. london.flubnutz.com. The hosts on the LAN belong to tom, dick and harry and you’ve got a printer called “printer” and a router called “gateway”, and the local IP addresses are 192.168.3.x with a 255.255.255.0 subnet mask.
The lease times are how long you’re going to allow a host to hold on to a dynamic address from the pool. If it held it forever, you’d eventually run out. “default-lease-time” is how long the address lasts (in seconds) if the client doesn’t ask for anything specific, and max-least-time for when the client does ask but is being greedy. For testing purposes setting these to 60 seconds is not unreasonable. The values above represent 12 or 24 hours.
Next come some options. These are fields sent in the DHCP reply. The stuff you can set on the client. There are a lot of them – see “man dhcp-options” on any Unix-compatible system. Here I want everything on the LAN to know it’s part of london.flubnutz.com, and the DNS server is at 192.168.3.219. Every host asking the DHCP server gets these options set for them.
The next definition is a subnet. Any IP address in that subnet gets those options set – in this case the broadcast address gateway router. These could have been universal options, but for the sake of an example I put them inside the { and }.
Note there’s also a “range” statement in the subnet definition. This is the range of dynamically allocated IP addresses – in this case there are 64, between 100 and 163, and are to cope with people’s smartphones and when people turn up from head office with their swanky laptops. The range doesn’t have to cover the complete subnet, but it can’t be larger.
And that’s pretty much it for the main part. This just leaves the manual definitions which take the form of host statements that look like this:
The DHCP server recognises each host by its MAC address, specified in each block. Other forms of hardware address are possible, but it’s probably going to be a MAC on Ethernet. The fixed address is the one that will be assigned. The subnet definition at the top will be used for the subnet mask, and the other options will be taken from the global options.
If you want something special for one host, just add the option to its definition. For example, if you wanted the printer to use a different gateway router just add a “option router 192.168.1.254” and it’d take precedence.
The host statement needs a name or IP address but we’re not using it for anything here. In fact it can be anything you like in this instance. Unfortunately it’s not the hostname that’s sent, we have to specify in option host-name, and if you want a fqdn you’ll have to specify one. It doesn’t add it to the domain-name option automatically. It think this is a fault of the client, and I haven’t quite figured out why yet.
dhclient
On the host you need to run dhclient to request the address from the DHCP server. This has a configuration file: /etc/dhclient.conf. It’s probably empty as the defaults are normally good enough. However, it does not include setting the host name. You’ll need to add a single line:
request host-name;
And that’s it. How you use it will vary from system to system, but on BSD you use “dhclient re0”, where re0 is the name of the ethernet interface, and it does does the rest. To make this automatic in FreeBSD add this to rc.conf:
ifconfig_re0="DHCP"
Make sure you don’t specify the hostname in rc.conf or it will take precedence, and it will normally have been added by the installer.
Why set the hostname using DHCP?
You might think that it’s more useful for the hostname is fixed on the actual hardware host, and most times it is. However, if you’re pulling disks from one to put them in another you may or may not what the hostname and IP address to transfer. If you do, set them in the config file. If you want DHCP to configure things correctly even if you’ve swapped system disks around, configure things on the DHCP server. If you’re cloning system disks for a large number of servers in a cluster, DHCP is your best friend. Guess what I’m working on?
Everyone knows that you can replace the drives in a ZFS vdev with
larger ones one a time, and when the last one is inserted it
automagically uses the extra space, right?
But who’s actually
done this? It does actually work, kind of.
However, small scale
ZFS users are booting from ZFS, and have been since FreeBSD 10.
Simply swapping out the drives with larger ones isn’t going to
work. It can’t work. You’ve got boot code, swap files and other
stuff to complicate it. But it can be made to work, and here’s how.
The first thing you
need to consider is that ZFS is a volume manager, and normally when
you create an array (RAIDZ or mirror) it expects to manage the whole
disk. When you’re creating a boot environment you need bootstraps
to actually boot from it. FreeBSD can do this, and does by default
since FreeBSD 10 was released in 2014. The installer handles the
tricky stuff about partitioning the disks up and making sure it’ll
still boot when one drive is missing.
If you look at the
partition table on one of the disks in the array you’ll see
something like this:
We’re using the
modern GPT partitioning scheme. You may as well – go with the flow
(but see articles about gmirror). This is a so-called 3Tb SATA disk,
but it’s really 2.7Tb as manufacturers don’t know what a Tb
really is (2^40 bytes). FreeBSD does know what a Tb, Gb, Mb and Kb is
in binary so the numbers you see here won’t always match.
The disk starts with
40 sectors of GPT partition table, followed by the partitions
themselves.
The first partition
is 512K long and contains the freebsd-boot code. 512K is a lot of
boot code, but ZFS is a complicated filing system so it needs quite a
lot to be able to read it before the OS kernel is loaded.
The second partition
is freebsd-swap. This is just a block of disk space the kernel can
use for paging. By labelling it freebsd-swap, FreeBSD can can find it
and use it. On an array, each drive has a bit of paging space so the
load is shared across all of them. It doesn’t have to be this way,
but it’s how the FreeBSD installer does it. If you have an SLOG
drive it might make sense to put all the swap on that.
The third partition
is actually used for ZFS, and is the bulk of the disk.
You might be
wondering what the “- free -” space is all about. For performance
reasons its good practice to align partitions to a particular grain
size, in this case it appears to be 1Mb. I won’t go into it here,
suffice to say that the FreeBSD installer knows what it’s doing,
and has left the appropriate gaps.
As I said, ZFS
expects to have a whole disk to play with, so normally you’d create
an array with something like this:
zpool create mypool
raidz1 da0 da1 da2 da3
This creates a
RAIDZ1 called mypool out of four drives. But ZFS will also work with
geoms (partitions). With the partition scheme show above the creation
command would be:
ZFS would use
partition 3 on all four drives and leave the boot code and swap area
alone. And this is effectively what the installer does. da#p2 would
be used for swap, and da#p1 would be the boot code – replicated but
available on any drive that was still working that the BIOS could
find.
So, if we’re going
to swap out our small drives with larger ones we’re going to have
to sort out the extra complications from being bootable. Fortunately
it’s not too hard. But before we start, if you want the pool to
expand automatically you need to set an option:
zpool set autoexpand=on zroot
However, you can also expand it manually when you online the new drive using the -e option.
From here I’m
going to assume a few things. We have a RAIDZ set up across four
drives: da0, da1, da2 and da3. The new drives are larger, and blank
(no partition). Sometimes you can get into trouble if they have the
wrong stuff in the partition table, so blanking them is best, and if
you blank the whole drive you’ll have some confidence it’s a
good one. It’s also worth mentioning at some point that you can’t
shrink the pool by using smaller drives, so I’ll mention in now.
You can only go bigger.
You’ll also have
to turn the swap off, as we’ll be pulling swap drives. However, if
you’re not using any swap space you should get away with it. Run
swapctl -l to see what’s being used, and use swapoff to turn off
swapping on any drive we’re about to pull. Also, back up everything
to tape or something before messing with any of this, right?
Ready to go?
Starting with da0…
zpool offline zroot da0p3
Pull da0 and put the
new drive in. It’s worth checking the console to make sure the
drive you’ve pulled really is da0, and the new drive is also
identified as da0. If you pull the wrong drive, put it back and used
“zpool online zroot da0” to put it back. The one you actually
pulled will be offline.
We could hand
partition it, but it’s easier to simply copy the partition table
from one of the other drives:
gpart backup da1 | gpart restore da0
This will copy the
wrong partition table over, as all the extra space will be left at
the end of the disk. We can fix this:
gpart resize -i 3 da0
When you don’t specify a new size with -s, this will change the third partition to take up all remaining space. There’s no need to leave an alignment gap at the end, but if you want to do the arithmetic you can. Specify the size as the remaining size/2048 to get the number of 512 byte sectors with 1Mb granularity. The only point I can see for doing this is if you’re going to add another partition afterwards and align it, but you’re not.
I don’t know about you but most of my ZFS arrays are large, using
SAS drives connected via SAS HBAs to expanders that
know which disk is where. I also have multiple redundancy in the
zpool and hot spares, so I don’t need pay a visit just to replace a
failed disk. And if I do I can get the enclosure to flash an LED over
the drive I’m interested in replacing.
Except at home. At
home I’ve got what a lot of people probably have. A small box with
a four-drive cage running RAIDZ1 (3+1). And it’s SATA, because it
really is a redundant array of independent drives. I do, of course,
keep a cold spare I can swap out. Always make sure you you have at
least one spare drive of the dimensions of a RAIDZ group, and know
where you find it.
And to make it even
more fun, you’re booting from the array itself.
After many years I started getting an intermittent CAM error, which isn’t good news. Either one of the drives was loose, or it was failing. And there’s no SAS infrastructure to help. If you’re in a similar position you’ve come to the right place.
WARNING. The examples in this article assume ada1 is the drive that’s failed. Don’t blindly copy/paste into a live system without changing this as appropriate
To change a failed or failing drive:
Find the drive
Remove the old drive
Configure the new drive
Tell the RAIDZ to use it
Finding the failed
drive
First, identify your
failing drive. The console message will probably tell you which one.
ZFS won’t, unless it’s failed to the extent it’s been offlined.
“zpool status” may tell you everything’s okay, but the console
may be telling you:
Feb 2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error Feb 2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error Feb 2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC ) Feb 2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): RES: 41 40 b0 71 20 00 f6 00 00 00 01 Feb 2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): Retrying command, 0 more tries remain Feb 2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 d8 70 20 40 f6 00 00 01 00 00 Feb 2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error Feb 2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC ) Feb 2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): RES: 41 40 b0 71 20 00 f6 00 00 00 01 Feb 2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): Error 5, Retries exhausted
So this tells you that ada1 is in a bad way. But which is ada1? It might be the second one across in your enclosure, or it might not. You’ll need to do some hunting to identify it positively.
Luckily most disks
have their serial number printed on the label, and this is available
to the host. So finding the serial number for ada1 and matching it to
the disk label is the best way – if you’ve only got four drives
to check, anyway.
I know of at least five ways to get a disk serial number in FreeBSD, and I’ll list them all in case one stops working:
dmesg
Just grep for the drive name you’re interested in (ada1). This is probably a good idea as it may give you more information about the failure. If FreeBSD can get the serial number it will display it as it enumerates the drive.
camcontrol identify ada1
This gives you more information that you ever wanted about a particular drive. This does include the serial number.
geom disk list
This will print out
information on all the geoms (i.e. drives), including the serial
number as “ident”
diskinfo -s /dev/ada1
This simply prints the ident for the drive in question. You can specify multiple arguments so diskinfo -s /dev/ada? works (up to ten drives).
smartctl -i /dev/ada1
Smartctl was a utility for managing SATA drives, but later versions have been updated to read information from SAS drives too, and you should probably install it. It’s part of Smartmontools as it gives you the ATA information for drive, including errors rates, current temperature and suchlike – stuff that camcontrol can’t.
Whichever method works for you, once you’ve got your serial number you can identify the drive. Except, of course, if your drive is completely fubared. In that case get the serial numbers of the drives that aren’t and identify it by elimination. Also worth a mention is gpart list, which will produce a lot of information about all or a specific drive’s logical layout, but not the serial number. It may offer some clues.
Saving the partition
table from the failed drive.
In readiness for
replacing it, if you can save its partition table:
gpart backup ada1 >
gpart.ada1
If you can’t read it, just save another one from a different drive in the vdev set – they should be identical, right? If the drive is already mounted you can copy it using gpart backup ada1 | gpart restore ada2
Swapping the bad
drive out.
Next, pull your
faulty drive and replace it with a new one. You might want to turn
the power off, although it’s not necessary. However, it’s
probably safer to reboot as we’re messing with the boot array.
Try zpool status,
and you’ll see something like this:
pool: zrpool: zr state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. scan: scrub in progress since Sun Feb 2 17:42:38 2025 356G scanned at 219/s, 175G issued at 108/s, 10.4T total 0 repaired, 1.65% done, no estimated completion time config: NAME STATE READ WRITE CKSUM zr DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ada0p3 ONLINE 0 0 0 16639665213947936 UNAVAIL 0 0 0 was /dev/ada1p3 ada2p3 ONLINE 0 0 0 ada3p3 ONLINE 0 0 0
It’s complaining because it can’t find the drive with the identity 16639665213947936. ZFS doesn’t care where the drives in a vdev are plugged in, only that they exist somewhere. Device ada1 is ignored – it’s just got some random disk in that zfs isn’t interested in.
Setting up the
replacement drive
So let’s get
things ready to insert the new drive in the RAIDZ.
First restore its
partition table:
gpart restore
/dev/ada1 < gpart.ada1
If you see “gpart:
geom ‘ada1’: File exists”, just run “gpart destroy -F ada1”.
Without the -F it may say the drive is in use, which we know it
isn’t.
Next, if you’ve
got a scrub going on, stop it with “zpool scrub -s zr”
As a sanity check, run “gpart show” and you should see four identical drives.
Boot sector and insertion
Now this is a boot from ZFS situation, common on a home server but not a big one. The guides from Solaris won’t tell you about this step. To make sure the system boots you need to have the boot sector on every drive (ideally). Do this with:
pool: zr state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Sun Feb 2 18:44:58 2025 1.83T scanned at 429M/s, 1.60T issued at 375M/s, 10.4T total 392G resilvered, 15.45% done, 0 days 06:48:34 to go config: NAME STATE READ WRITE CKSUM zr DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ada0p3 ONLINE 0 0 0 replacing-1 UNAVAIL 0 0 0 16639665213947936 UNAVAIL 0 0 0 was /dev/ada1p3/old ada1p3 ONLINE 0 0 0 ada2p3 ONLINE 0 0 0 ada3p3 ONLINE 0 0 0 errors: No known data errors
It’ll chug along in the background re-silvering the whole thing. You can carry on using the system, but it’s performance may be degraded until it’s done. Take a look at the console to make sure there are no CAM errors indicating that the problem wasn’t the drive at all, and go to bed. If you reboot or have a power cut while its rebuilding it will start from scratch, so try to avoid both!
In the morning, zpool status will return to this, and all will be well in the world. But don’t forget to order another cold spare so you’re ready when it happens again.
pool: zr state: ONLINE scan: resilvered 2.47T in 0 days 11:52:45 with 0 errors on Mon Feb 3 06:37:43 2025 config: NAME STATE READ WRITE CKSUM zr ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0 ada2p3 ONLINE 0 0 0 ada3p3 ONLINE 0 0 0 errors: No known data errors
As a final tip, if you use diskinfo -v adaX it will tell you the type of drive and other information, which is really handy if you’re ordering another cold spare.
# diskinfo -v adaa ada2 512 # sectorsize 3000592982016 # mediasize in bytes (2.7T) 5860533168 # mediasize in sectors 4096 # stripesize 0 # stripeoffset 5814021 # Cylinders according to firmware. 16 # Heads according to firmware. 63 # Sectors according to firmware. HGST HUS724030ALE640 # Disk descr. <- Drive to order on eBay! PK2234P9GWJ42Y # Disk ident. No # TRIM/UNMAP support 7200 # Rotation rate in RPM Not_Zoned # Zone Mode
FreeBSD may be the hottest operating system available, but having hot hardware isn’t so good. Modern drives and CPUs can report their temperature, but it’s not easy to see it.
I’ve produced an example script that will report the temperature of whatever it can, with the idea that it can form the basis of whatever you really need to do. I have a multi-server monitoring system using the same methods.
Getting the CPU core temperature is a simple sysctl variable, but it only appears if you have the coretemp.ko module loaded or compiled into the kernel. But coretemp.ko only works for Intel processors; for AMD you need amdtemp.ko instead.
The script tries to determine the CPU type the best way I know how, and loads the appropriate module if necessary. If it loads the module, it unloads the module at the end to leave things as it found them. You can omit the unload so it’s loaded on first, or put the module permanently in loader.conf if you prefer. But this is only an example, and I prefer to keep my scripts independent of host configuration.
Next you need a way to get the temperature from all your drives. This isn’t built in to FreeBSD but you can use the excellent the excellent smartmontools https://www.smartmontools.org by Bruce Allen and Christian Franke.
This was originally intended to access the SMART reporting on ATA drives, but will now extract information for SAS units too. To smartmontools, and the “smartctl” utility in particular, you can build from ports with:
cd /usr/ports/sysutils/smartmontools make install clean
You can also install it as a binary package with: “pkg install smartmontools”
The script tries to enumerate the drives and CPUs on the system. ATA drives are in /dev/ and begin “ada”, SCSI and USB drives begin “da”. The trick is to figure out which devices are drives and which are partitions or slices within a drive – I don’t have a perfect method.
SCSI drives that aren’t disks (i.e. tape) start “sa”, and return some really weird stuff when smartctl queries them. I’ve tested the script with standard LTO tape drives, but you’ll probably need to tweak it for other things. (Do let me know).
Figuring out the CPUs is tricky, as discrete CPUs and cores within a single chip appear the same. The script simply goes on the cores, which results in the same temperature being reported for each.
You can override any of the enumeration by simply assigning the appropriate devices to a list, but where’s the fun? Seriously, this example shows how you can enumerate devices and it’s useful when you’re monitoring disparate hosts using the same script.
Finally, there are three loops that read the temperature for each device type into into “temp” and then print it. Do whatever you want – call “shutdown -p now” if you think something’s too hot; autodial the fire brigade or, as I do, send yourself an email.
The Script
#!/bin/sh # FreeBSD Temperature monitoring example script # (c) FJL 2024 frank@fjl.co.uk # Please feel free to use this as an example for a few techniques # including enumerating devices on a host and extracting temperature # information.
# Full path to utilities in case run with no PATH set GREP=/usr/bin/grep SMARTCTL=/usr/local/sbin/smartctl CUT=/usr/bin/cut SYSCTL=/sbin/sysctl
# Load the AMD CPU monitoring driver if necessary if [ ! $($SYSCTL -n dev.cpu.0.temperature 2>/dev/null) ] then # Let's try to find out if we have Intel or # AMD processor and select correct kernel module if $SYSCTL -n hw.model | $GREP AMD >/dev/null then tempmodule=amdtemp else tempmodule=coretemp fi # Load the CPU temp kernel module kldload $tempmodule # Set command to unload it when we're done (optional) unload="kldunload $tempmodule" fi # Enumerate SATA, USB and SAS disks - everything # in /dev/ starting da or ada disks=$(find /dev -depth 1 -type c \( -name da[0-9+] -o -name ada[0-9+] \) | cut -c 6- | sort )
# Enumerate other SCSI devices, starting in sa. # Normally tape drives. May need tweaking! scsis=$(find /dev -depth 1 -type c \( -name sa[0-9+] \) | cut -c 6- | sort)
# Print all the disks for disk in $disks do temp=$($SMARTCTL -a /dev/$disk | $GREP Temperature_Celsius | $CUT -w -f 10) echo "$disk: ${temp}C" done
# Print all the SCSI devices (e.g. tapes) # NB. This will probably take a lot of fiddling as SCSI units return all sorts to smartctl # Note the -T verypermissive. see man smartctl for details. for scsi in $scsis do temp=$($SMARTCTL -a -T verypermissive /dev/$scsi | $GREP "Current Drive Temperature" | cut -w -f 4) echo "$scsi: ${temp}C" done
# Print all the CPUs for cpu in $cpus do temp=$($SYSCTL -n dev.cpu.$cpu.temperature | $CUT -f 1 -d .) echo "CPU$cpu: ${temp}C" done
# Unload the CPU temp kernel module if we loaded it (optional) $unload
There are two
mysteries things on ZFS that cause a lot of confusion: The ZIL and
the SLOG. This article is about what they are and why you should
care, or not care about them. But I’ll come to them later. Instead
I’ll start with POSIX, and what it says about writing stuff to disk
files.
When you write to
disk it can either be synchronous or asynchronous. POSIX (Portable
Operating System Interface) has requirements for writes through
various system calls and specifications.
With an
asynchronous write the OS takes the data you give it and returns
control to the application immediately, promising to write the data
as soon as possible in the background. No delay. With a synchronous
write the application won’t get control back until the data is
actually written to the disk (or non-volatile storage of some kind).
More or less. Actually, POSIX.1-2017 (IEEE Standard 1003.1-2017)
doesn’t guarantee it’s written, but that’s the expectation.
You’d want
synchronous writes for critical complex files, such as a database,
where the internal structure would break if a transaction was only
half written, and a database engine needs to know that one write has
occurred before making another.
Writes to ZFS can be
long and complicated, requiring multiple blocks be updated for a
single change. This is how it maintains its very high integrity.
However, this means it can take a while to write even the simplest
thing, and a synchronous write could take ages (in computer terms).
To get around this,
ZFS maintains a ZIL – ZFS Intent Log.
In ZFS, the ZIL
primarily serves to ensure the consistency and durability of write
operations, particularly for synchronous writes. But it’s not a
physical thing; it’s a concept or list. It contains transaction
groups that need to be completed in order.
The ZIL can be physically stored in three possible places…
In-Memory (Volatile
Storage):
This is the default location. Initially, all write operations are buffered in RAM. This is where they are held before being committed to persistent storage. This kind of ZIL is volatile because it’s not backed by any permanent storage until written to disk.
Volatility doesn’t
matter, because ZFS guarantees consistency with transaction groups
(TXGs). The power goes off and the in-RAM ZIL is lost, the
transactions are never applied; but the file system is in a
consistent state.
In-Pool (Persistent
Storage):
Without a dedicated log device (the default), the ZIL entries are written to the main storage pool in transaction groups . This happens for both synchronous and asynchronous writes but is more critical for synchronous writes to ensure data integrity in case of system crashes or power failures. All transactions must take place in order, so they all need to be committed to non-volatile storage before a synchronous write can return.
SLOG (Separate
Intent Log Device):
For better
performance with synchronous writes, you can add a dedicated device
to serve as the SLOG. This device is typically a low-latency,
high-speed storage like a short-stroked Rapter, enterprise SSD or
NVRAM. ZFS writes the log entries before they’re committed to the
pool’s main storage.
By storing the
pending TXGs on disk, either in the pool or on an SLOG, ZFS can meet
the POSIX requirement that the transaction is stored in non-volatile
storage before the write returns, and if you’re doing a lot of
synchronous writes then storing them on a high-speed SLOG device
helps. But only if the SLOG device is substantially faster than an
array of standard drives. And it only matters if you do a lot of
synchronous writes. Caching asynchronous writes in RAM is always
going to be faster still
I’d contend that the only times synchronous writes feature heavily are databases and virtual machine disks. And then there’s NFS, which absolutely loves them. See ESXi NFS ZFS and vfs-nfsd-async for more information if this is your problem.
If you still think
yo need an SLOG, install a very fast drive. These days an NVMe SLC
NAND device makes sense. Pricy, but it doesn’t need to be very
large. You can add it to a zpool with:
zpool add poolname
log /dev/daX
Where daX is the
drive name, obviously.
As I mentioned, the
SLOG doesn’t need to be large at all. It only has to cope with five
seconds of writes, as that’s the maximum amount of time data is
“allowed” to reside there. If you’re using NFS over 10Gbit
Ethernet the throughput isn’t going to be above 1.25Gb a seconds.
Assuming that’s flat-out synchronous writes, multiplying that by
five seconds is less than 8Gb. Any more would be unused.
If you’ve got a
really critical system you can add mirrored SLOG drives to a pool
thus:
zpool add poolname
log /dev/daX /dev/daY
You can also remove
them with something like:
zpool remove
poolname log /dev/daY
This may be useful
if adding an SLOG doesn’t give you the performance boost you were
hoping for. It’s very niche!
I’ve had a problem with mysql failing to start with error 35 in the log file:
InnoDB: Unable to lock ./ibdata1, error: 35 InnoDB: Check that you do not already have another mysqld process InnoDB: using the same InnoDB data or log files.
What to do? Google and you get a lot of Linux people saying that the answer is to reboot the box. Hmm. Well you don’t have to.
What causes the error is mysqld crashing, usually when system resources are exhausted. Rebooting will, indeed, unlock ibdata1 but so will killing the process that locks it. Yet the server isn’t running, so how can this be? Well actually part of it is – just not the part the service manager sees.
Run “ps -auxww | grep mysql” and you’ll find a few more. Send them a kill, wait for it to work and then restart. Obviously you can only do this and expect it to work if you’ve sorted out the resource problem.