Configuring host names using DHCP

Background

If you know all about DHCP, feel free to skip this bit.

In the Unix world the network administrator assigns every host (networked computer) the stuff in needs to operate on the network – it’s name and IP address. Other hosts can find it by looking its name up on the DNS server (or hosts list before DNS was invented) and start talking.

The host new its name and IP address because it was set in a configuration file, along with other network stuff like gateway routers and DNS servers.

Microsoft didn’t use IP networking for a long time, using NetBEUI and other protocols to dispense with a network administrator and configure stuff automatically over Ethernet (mainly). Or was that NetBIOS or WINS or ??? Anyway, the usual bugger’s muddle. When Microsoft finally realised the Internet was Important, Windoze machines also worked with Unix networking (IP, DNS and other good things). The stuck with versions of their own crazy file sharing system but that’s another story.

Meanwhile, it was realised that editing a configuration file on every host was a bit of a problem, especially if you had to edit it everywhere if you changed anything network-ish. And Dynamic Host Configuration Protocol (DHCP) was invented in the early 1990s. This combined the best of both worlds – automatic configuration with a network administrator in charge.

DHCP operates using a DHCP server. When a host boots it can get it’s network stuff from the DHCP server before it knows anything about the IP network. It effectively does this using an Ethernet (layer 2) multicast packet, but the details are complicated and not relevant here.

The DHCP server sees this request for details and sends the host back its settings. These could be the next free IP address from a pool, together with other important information like the subnet, gateway, local DNS and domain name. The host says “thank you very much” and configures itself as a fine upstanding and proper member of the domain. Don’t confuse domain with Microsoft Domain stuff, BTW. They used the name wrong. This is the DNS-type domain.

Manual allocation

I said in the bit you skipped reading that the DHCP server could send the client the next free IP address from a pool. But, you can also send precise details you want the host configured with. This means you can keep your network configuration in one file on the DHCP server rather than in startup files on every host, see how everything is set up and make small or large changes with a text editor. Almost. You’ll also need to edit the files on your DNS server to make the names to IP addresses translation work. Having both servers on the same machine makes sense.

How does the DHCP server know who’s asking, and therefore which configuration to send? Easy, it goes but the Ethernet MAC address.

Assuming you know how to configured DNS, here’s how you do it.

dhcpd

You’ll need the DHCP Demon, “dhcpd” from the Internet Software Consortium. Compile it or install the package as necessary. It has a configuration file called dhcpd.conf (usually in /usr/local/etc or /etc) which is where you set everything up. It comes with examples, but you’re looking at something like this.

Let’s assume your organisation is called flubnutz.com and the the DHCP server is on the LANin the London office – i.e. london.flubnutz.com. The hosts on the LAN belong to tom, dick and harry and you’ve got a printer called “printer” and a router called “gateway”, and the local IP addresses are 192.168.3.x with a 255.255.255.0 subnet mask.

dhcpd.conf will start something like this

 default-lease-time 43200;
max-lease-time 86400;

option domain-name "london.flubnutz.com";
option domain-name-servers 192.168.3.219;

subnet 192.168.3.0 netmask 255.255.255.0 {
range 192.168.1.100 192.168.3.163;
option broadcast-address 192.168.3.255;
option routers 192.168.3.2;
}

The lease times are how long you’re going to allow a host to hold on to a dynamic address from the pool. If it held it forever, you’d eventually run out. “default-lease-time” is how long the address lasts (in seconds) if the client doesn’t ask for anything specific, and max-least-time for when the client does ask but is being greedy. For testing purposes setting these to 60 seconds is not unreasonable. The values above represent 12 or 24 hours.

Please generate and paste your ad code here. If left empty, the ad location will be highlighted on your blog pages with a reminder to enter your code. Mid-Post

Next come some options. These are fields sent in the DHCP reply. The stuff you can set on the client. There are a lot of them – see “man dhcp-options” on any Unix-compatible system.
Here I want everything on the LAN to know it’s part of london.flubnutz.com, and the DNS server is at 192.168.3.219. Every host asking the DHCP server gets these options set for them.

The next definition is a subnet. Any IP address in that subnet gets those options set – in this case the broadcast address gateway router. These could have been universal options, but for the sake of an example I put them inside the { and }.

Note there’s also a “range” statement in the subnet definition. This is the range of dynamically allocated IP addresses – in this case there are 64, between 100 and 163, and are to cope with people’s smartphones and when people turn up from head office with their swanky laptops. The range doesn’t have to cover the complete subnet, but it can’t be larger.

And that’s pretty much it for the main part. This just leaves the manual definitions which take the form of host statements that look like this:

host tom {
hardware ethernet c4:34:6b:21:94:10;
option host-name "tom.london.flubntuz.com";
fixed-address 192.168.3.165;
}
host dick {
hardware ethernet 3c:4a:92:77:af:4e;
option host-name "dick.london.flubntuz.com";
fixed-address 192.168.3.166;
}
host printer {
hardware ethernet 2C:76:8A:AD:71:FF;
option host-name "printer.london.flubntuz.com";
fixed-address 192.168.3.200;
}

And so on…

The DHCP server recognises each host by its MAC address, specified in each block. Other forms of hardware address are possible, but it’s probably going to be a MAC on Ethernet. The fixed address is the one that will be assigned. The subnet definition at the top will be used for the subnet mask, and the other options will be taken from the global options.

If you want something special for one host, just add the option to its definition. For example, if you wanted the printer to use a different gateway router just add a “option router 192.168.1.254” and it’d take precedence.

The host statement needs a name or IP address but we’re not using it for anything here. In fact it can be anything you like in this instance. Unfortunately it’s not the hostname that’s sent, we have to specify in option host-name, and if you want a fqdn you’ll have to specify one. It doesn’t add it to the domain-name option automatically. It think this is a fault of the client, and I haven’t quite figured out why yet.

dhclient

On the host you need to run dhclient to request the address from the DHCP server. This has a configuration file: /etc/dhclient.conf. It’s probably empty as the defaults are normally good enough. However, it does not include setting the host name. You’ll need to add a single line:

request host-name;

And that’s it. How you use it will vary from system to system, but on BSD you use “dhclient re0”, where re0 is the name of the ethernet interface, and it does does the rest. To make this automatic in FreeBSD add this to rc.conf:

ifconfig_re0="DHCP"

Make sure you don’t specify the hostname in rc.conf or it will take precedence, and it will normally have been added by the installer.

Why set the hostname using DHCP?

You might think that it’s more useful for the hostname is fixed on the actual hardware host, and most times it is. However, if you’re pulling disks from one to put them in another you may or may not what the hostname and IP address to transfer. If you do, set them in the config file. If you want DHCP to configure things correctly even if you’ve swapped system disks around, configure things on the DHCP server. If you’re cloning system disks for a large number of servers in a cluster, DHCP is your best friend. Guess what I’m working on?

ZFS In-place disk size upgrade

Everyone knows that you can replace the drives in a ZFS vdev with larger ones one a time, and when the last one is inserted it automagically uses the extra space, right?

But who’s actually done this? It does actually work, kind of.

However, small scale ZFS users are booting from ZFS, and have been since FreeBSD 10. Simply swapping out the drives with larger ones isn’t going to work. It can’t work. You’ve got boot code, swap files and other stuff to complicate it. But it can be made to work, and here’s how.

The first thing you need to consider is that ZFS is a volume manager, and normally when you create an array (RAIDZ or mirror) it expects to manage the whole disk. When you’re creating a boot environment you need bootstraps to actually boot from it. FreeBSD can do this, and does by default since FreeBSD 10 was released in 2014. The installer handles the tricky stuff about partitioning the disks up and making sure it’ll still boot when one drive is missing.

If you look at the partition table on one of the disks in the array you’ll see something like this:

=>        40  5860533088  ada0  GPT  (2.7T)
40 1024 1 freebsd-boot (512K)
1064 984 - free - (492K)
2048 4194304 2 freebsd-swap (2.0G)
4196352 5856335872 3 freebsd-zfs (2.7T)
5860532224 904 - free - (452K)

So what’s going on here?

We’re using the modern GPT partitioning scheme. You may as well – go with the flow (but see articles about gmirror). This is a so-called 3Tb SATA disk, but it’s really 2.7Tb as manufacturers don’t know what a Tb really is (2^40 bytes). FreeBSD does know what a Tb, Gb, Mb and Kb is in binary so the numbers you see here won’t always match.

The disk starts with 40 sectors of GPT partition table, followed by the partitions themselves.

The first partition is 512K long and contains the freebsd-boot code. 512K is a lot of boot code, but ZFS is a complicated filing system so it needs quite a lot to be able to read it before the OS kernel is loaded.

The second partition is freebsd-swap. This is just a block of disk space the kernel can use for paging. By labelling it freebsd-swap, FreeBSD can can find it and use it. On an array, each drive has a bit of paging space so the load is shared across all of them. It doesn’t have to be this way, but it’s how the FreeBSD installer does it. If you have an SLOG drive it might make sense to put all the swap on that.

The third partition is actually used for ZFS, and is the bulk of the disk.

You might be wondering what the “- free -” space is all about. For performance reasons its good practice to align partitions to a particular grain size, in this case it appears to be 1Mb. I won’t go into it here, suffice to say that the FreeBSD installer knows what it’s doing, and has left the appropriate gaps.

As I said, ZFS expects to have a whole disk to play with, so normally you’d create an array with something like this:

zpool create mypool
raidz1 da0 da1 da2 da3

This creates a RAIDZ1 called mypool out of four drives. But ZFS will also work with geoms (partitions). With the partition scheme show above the creation command would be:

zpool create mypool
raidz1 da0p3 da1p3 da2p3 da3p3

ZFS would use partition 3 on all four drives and leave the boot code and swap area alone. And this is effectively what the installer does. da#p2 would be used for swap, and da#p1 would be the boot code – replicated but available on any drive that was still working that the BIOS could find.

So, if we’re going to swap out our small drives with larger ones we’re going to have to sort out the extra complications from being bootable. Fortunately it’s not too hard. But before we start, if you want the pool to expand automatically you need to set an option:

zpool set autoexpand=on zroot

However, you can also expand it manually when you online the new drive using the -e option.

From here I’m going to assume a few things. We have a RAIDZ set up across four drives: da0, da1, da2 and da3. The new drives are larger, and blank (no partition). Sometimes you can get into trouble if they have the wrong stuff in the partition table, so blanking them is best, and if you blank the whole drive you’ll have some confidence it’s a good one. It’s also worth mentioning at some point that you can’t shrink the pool by using smaller drives, so I’ll mention in now. You can only go bigger.

You’ll also have to turn the swap off, as we’ll be pulling swap drives. However, if you’re not using any swap space you should get away with it. Run swapctl -l to see what’s being used, and use swapoff to turn off swapping on any drive we’re about to pull. Also, back up everything to tape or something before messing with any of this, right?

Ready to go? Starting with da0…

zpool offline zroot da0p3

Pull da0 and put the new drive in. It’s worth checking the console to make sure the drive you’ve pulled really is da0, and the new drive is also identified as da0. If you pull the wrong drive, put it back and used “zpool online zroot da0” to put it back. The one you actually pulled will be offline.

We could hand partition it, but it’s easier to simply copy the partition table from one of the other drives:

gpart backup da1 | gpart restore da0

This will copy the wrong partition table over, as all the extra space will be left at the end of the disk. We can fix this:

gpart resize -i 3 da0

When you don’t specify a new size with -s, this will change the third partition to take up all remaining space. There’s no need to leave an alignment gap at the end, but if you want to do the arithmetic you can. Specify the size as the remaining size/2048 to get the number of 512 byte sectors with 1Mb granularity. The only point I can see for doing this is if you’re going to add another partition afterwards and align it, but you’re not.

Next we’ll add the boot code:

gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0

And finally put it in the array

zpool replace zroot da0p3

Run zpool status and watch as the array is rebuilt. This may take several hours, or even days.

Once the re-silvering is complete and the array looks good we can do the same with the next drive:

zpool offline zroot da1p3

Swap the old and new disk and wait for it to come online.

gpart backup da0 | gpart restore da1
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da1
zpool replace zroot da1p3

Wait for resilvering to finish

zpool offline zroot da2p3

Swap the old and new disk and wait for it to come online.

gpart backup da0 | gpart restore da2
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da2
zpool replace zroot da2p3

Wait for resilvering to finish

zpool offline zroot da3p3

Swap the old and new disk and wait for it to come online.

gpart backup da0 | gpart restore da3
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da3
zpool replace zroot da3p3

Wait for resilvering to finish. Your pool is now expanded!

If you didn’t have autoexpand enabled you’ll need to manually expand them “zpool offline da#” followed by “zpool online -e da#”.

FreeBSD ZFS RAIDZ failed disk replacement

(for the rest of us)

I don’t know about you but most of my ZFS arrays are large, using SAS drives connected via SAS HBAs

to expanders that know which disk is where. I also have multiple redundancy in the zpool and hot spares, so I don’t need pay a visit just to replace a failed disk. And if I do I can get the enclosure to flash an LED over the drive I’m interested in replacing.

Except at home. At home I’ve got what a lot of people probably have. A small box with a four-drive cage running RAIDZ1 (3+1). And it’s SATA, because it really is a redundant array of independent drives. I do, of course, keep a cold spare I can swap out. Always make sure you you have at least one spare drive of the dimensions of a RAIDZ group, and know where you find it.

And to make it even more fun, you’re booting from the array itself.

After many years I started getting an intermittent CAM error, which isn’t good news. Either one of the drives was loose, or it was failing. And there’s no SAS infrastructure to help. If you’re in a similar position you’ve come to the right place.

WARNING. The examples in this article assume ada1 is the drive that’s failed. Don’t blindly copy/paste into a live system without changing this as appropriate

To change a failed or failing drive:

  • Find the drive
  • Remove the old drive
  • Configure the new drive
  • Tell the RAIDZ to use it

Finding the failed drive

First, identify your failing drive. The console message will probably tell you which one. ZFS won’t, unless it’s failed to the extent it’s been offlined. “zpool status” may tell you everything’s okay, but the console may be telling you:

Feb  2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Feb 2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): RES: 41 40 b0 71 20 00 f6 00 00 00 01
Feb 2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): Retrying command, 0 more tries remain
Feb 2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 d8 70 20 40 f6 00 00 01 00 00
Feb 2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb 2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
Feb 2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): RES: 41 40 b0 71 20 00 f6 00 00 00 01
Feb 2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): Error 5, Retries exhausted

So this tells you that ada1 is in a bad way. But which is ada1? It might be the second one across in your enclosure, or it might not. You’ll need to do some hunting to identify it positively.

Luckily most disks have their serial number printed on the label, and this is available to the host. So finding the serial number for ada1 and matching it to the disk label is the best way – if you’ve only got four drives to check, anyway.

I know of at least five ways to get a disk serial number in FreeBSD, and I’ll list them all in case one stops working:

dmesg

Just grep for the drive name you’re interested in (ada1). This is probably a good idea as it may give you more information about the failure. If FreeBSD can get the serial number it will display it as it enumerates the drive.

geom disk list

This will print out information on all the geoms (i.e. drives), including the serial number as “ident”

camcontrol identify ada1

This gives you more information that you ever wanted about a particular drive. This does include the serial number.

diskinfo -s /dev/ada1

This simply prints the ident for the drive in question. You can specify multiple arguments so diskinfo -s /dev/ada? works (up to ten drives).

smartctl -i /dev/ada1

Smartctl is a utility for managing SATA drives (not SAS!), and you should probably install it. It’s part of Smartmontools as it gives you the ATA information for drive, including errors rates, current temperature and suchlike.

Whichever method works for you, once you’ve got your serial number you can identify the drive. Except, of course, if your drive is completely fubared. In that case get the serial numbers of the drives that aren’t and identify it by elimination.

Saving the partition table from the failed drive.

In readiness for replacing it, if you can save its partition table:

gpart backup ada1 >
gpart.ada1

If you can’t read it, just save another one from a different drive in the vdev set – they should be identical, right?

Swapping the bad drive out.

Next, pull your faulty drive and replace it with a new one. You might want to turn the power off, although it’s not necessary. However, it’s probably safer to reboot as we’re messing with the boot array.

Try zpool status, and you’ll see something like this:

  pool: zrpool: zr
state: DEGRADED
status: One or more devices could not be opened.
Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
scan: scrub in progress since Sun Feb 2 17:42:38 2025
356G scanned at 219/s, 175G issued at 108/s, 10.4T total
0 repaired, 1.65% done, no estimated completion time
config:
NAME STATE READ WRITE CKSUM
  zr DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
ada0p3 ONLINE 0 0 0
  16639665213947936 UNAVAIL 0 0 0 was /dev/ada1p3
ada2p3 ONLINE 0 0 0
ada3p3 ONLINE 0 0 0

It’s complaining because it can’t find the drive with the identity 16639665213947936. ZFS doesn’t care where the drives in a vdev are plugged in, only that they exist somewhere. Device ada1 is ignored – it’s just got some random disk in that zfs isn’t interested in.

Setting up the replacement drive

So let’s get things ready to insert the new drive in the RAIDZ.

First restore its partition table:

gpart restore
/dev/ada1 < gpart.ada1

If you see “gpart: geom ‘ada1’: File exists”, just run “gpart destroy -F ada1”. Without the -F it may say the drive is in use, which we know it isn’t.

Next, if you’ve got a scrub going on, stop it with “zpool scrub -s zr”

As a sanity check, run “gpart show” and you should see four identical drives.

Boot sector and insertion

Now this is a boot from ZFS situation, common on a home server but not a big one. The guides from Solaris won’t tell you about this step. To make sure the system boots you need to have the boot sector on every drive (ideally). Do this with:

gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada1

And Finally… tell ZFS to insert the new drive:

 zpool replace ada1p3

Run “zpool status” and you’ll see it working:

pool: zr
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Feb 2 18:44:58 2025
1.83T scanned at 429M/s, 1.60T issued at 375M/s, 10.4T total
392G resilvered, 15.45% done, 0 days 06:48:34 to go
config:
NAME STATE READ WRITE CKSUM
zr DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
ada0p3 ONLINE 0 0 0
replacing-1 UNAVAIL 0 0 0
16639665213947936 UNAVAIL 0 0 0 was /dev/ada1p3/old
ada1p3 ONLINE 0 0 0
ada2p3 ONLINE 0 0 0
ada3p3 ONLINE 0 0 0
errors: No known data errors


It’ll chug along in the background re-silvering the whole thing. You can carry on using the system, but it’s performance may be degraded until it’s done. Take a look at the console to make sure there are no CAM errors indicating that the problem wasn’t the drive at all, and go to bed.
If you reboot or have a power cut while its rebuilding it will start from scratch, so try to avoid both!

In the morning, zpool status will return to this, and all will be well in the world. But don’t forget to order another cold spare so you’re ready when it happens again.

pool: zr
state: ONLINE
scan: resilvered 2.47T in 0 days 11:52:45 with 0 errors on Mon Feb 3 06:37:43 2025
config:
NAME STATE READ WRITE CKSUM
zr ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ada0p3 ONLINE 0 0 0
ada1p3 ONLINE 0 0 0
ada2p3 ONLINE 0 0 0
ada3p3 ONLINE 0 0 0
errors: No known data errors

As a final tip, if you use diskinfo -v adaX it will tell you the type of drive and other information, which is really handy if you’re ordering another cold spare.

# diskinfo -v adaa
ada2
512 # sectorsize
3000592982016 # mediasize in bytes (2.7T)
5860533168 # mediasize in sectors
4096 # stripesize
0 # stripeoffset
5814021 # Cylinders according to firmware.
16 # Heads according to firmware.
63 # Sectors according to firmware.
HGST HUS724030ALE640 # Disk descr. <- Drive to order on eBay!
PK2234P9GWJ42Y # Disk ident.
No # TRIM/UNMAP support
7200 # Rotation rate in RPM
Not_Zoned # Zone Mode


How hot is your FreeBSD machine?

FreeBSD may be the hottest operating system available, but having hot hardware isn’t so good. Modern drives and CPUs can report their temperature, but it’s not easy to see it.

I’ve produced an example script that will report the temperature of whatever it can, with the idea that it can form the basis of whatever you really need to do. I have a multi-server monitoring system using the same methods.

Getting the CPU core temperature is a simple sysctl variable, but it only appears if you have the coretemp.ko module loaded or compiled into the kernel. But coretemp.ko only works for Intel processors; for AMD you need amdtemp.ko instead.

The script tries to determine the CPU type the best way I know how, and loads the appropriate module if necessary. If it loads the module, it unloads the module at the end to leave things as it found them. You can omit the unload so it’s loaded on first, or put the module permanently in loader.conf if you prefer. But this is only an example, and I prefer to keep my scripts independent of host configuration.

Next you need a way to get the temperature from all your drives. This isn’t built in to FreeBSD but you can use the excellent the excellent smartmontools https://www.smartmontools.org by Bruce Allen and Christian Franke.

This was originally intended to access the SMART reporting on ATA drives, but will now extract information for SAS units too. To smartmontools, and the “smartctl” utility in particular, you can build from ports with:

cd /usr/ports/sysutils/smartmontools
make install clean

You can also install it as a binary package with: “pkg install smartmontools”

The script tries to enumerate the drives and CPUs on the system. ATA drives are in /dev/ and begin “ada”, SCSI and USB drives begin “da”. The trick is to figure out which devices are drives and which are partitions or slices within a drive – I don’t have a perfect method.

SCSI drives that aren’t disks (i.e. tape) start “sa”, and return some really weird stuff when smartctl queries them. I’ve tested the script with standard LTO tape drives, but you’ll probably need to tweak it for other things. (Do let me know).

Figuring out the CPUs is tricky, as discrete CPUs and cores within a single chip appear the same. The script simply goes on the cores, which results in the same temperature being reported for each.

You can override any of the enumeration by simply assigning the appropriate devices to a list, but where’s the fun? Seriously, this example shows how you can enumerate devices and it’s useful when you’re monitoring disparate hosts using the same script.

Finally, there are three loops that read the temperature for each device type into into “temp” and then print it. Do whatever you want – call “shutdown -p now” if you think something’s too hot; autodial the fire brigade or, as I do, send yourself an email.

The Script

#!/bin/sh
# FreeBSD Temperature monitoring example script
# (c) FJL 2024 frank@fjl.co.uk
# Please feel free to use this as an example for a few techniques
# including enumerating devices on a host and extracting temperature
# information.

# Full path to utilities in case run with no PATH set
GREP=/usr/bin/grep SMARTCTL=/usr/local/sbin/smartctl
CUT=/usr/bin/cut
SYSCTL=/sbin/sysctl

# Load the AMD CPU monitoring driver if necessary
if [ ! $($SYSCTL -n dev.cpu.0.temperature 2>/dev/null) ]
then
# Let's try to find out if we have Intel or
# AMD processor and select correct kernel module
if $SYSCTL -n hw.model | $GREP AMD >/dev/null
then
tempmodule=amdtemp
else
tempmodule=coretemp
fi
# Load the CPU temp kernel module
kldload $tempmodule
# Set command to unload it when we're done (optional)
unload="kldunload $tempmodule"
fi
# Enumerate SATA, USB and SAS disks - everything
# in /dev/ starting da or ada
disks=$(find /dev -depth 1 -type c \( -name da[0-9+] -o -name ada[0-9+] \) | cut -c 6- | sort )

# Enumerate other SCSI devices, starting in sa.
# Normally tape drives. May need tweaking!
scsis=$(find /dev -depth 1 -type c \( -name sa[0-9+] \) | cut -c 6- | sort)

# Enumerate the CPUs
cpus="$(seq 0 $(expr $($SYSCTL -n hw.ncpu ) - 1))"

# Print all the disks
for disk in $disks
do
temp=$($SMARTCTL -a /dev/$disk | $GREP Temperature_Celsius | $CUT -w -f 10)
echo "$disk: ${temp}C"
done

# Print all the SCSI devices (e.g. tapes)
# NB. This will probably take a lot of fiddling as SCSI units return all sorts to smartctl
# Note the -T verypermissive. see man smartctl for details.
for scsi in $scsis
do
temp=$($SMARTCTL -a -T verypermissive /dev/$scsi | $GREP "Current Drive Temperature" | cut -w -f 4)
echo "$scsi: ${temp}C"
done

# Print all the CPUs
for cpu in $cpus
do
temp=$($SYSCTL -n dev.cpu.$cpu.temperature | $CUT -f 1 -d .)
echo "CPU$cpu: ${temp}C"
done

# Unload the CPU temp kernel module if we loaded it (optional)
$unload

Example Output

ada0: 35C
ada1: 39C
ada2: 39C
ada3: 38C
da0: 24C
da1: 25C
sa0: 33C
CPU0: 48C
CPU1: 48C

Why people obsess about the ZFS SLOG, but shouldn’t

There are two mysteries things on ZFS that cause a lot of confusion: The ZIL and the SLOG. This article is about what they are and why you should care, or not care about them. But I’ll come to them later. Instead I’ll start with POSIX, and what it says about writing stuff to disk files.

When you write to disk it can either be synchronous or asynchronous. POSIX (Portable Operating System Interface) has requirements for writes through various system calls and specifications.

With an asynchronous write the OS takes the data you give it and returns control to the application immediately, promising to write the data as soon as possible in the background. No delay. With a synchronous write the application won’t get control back until the data is actually written to the disk (or non-volatile storage of some kind). More or less. Actually, POSIX.1-2017 (IEEE Standard 1003.1-2017) doesn’t guarantee it’s written, but that’s the expectation.

You’d want synchronous writes for critical complex files, such as a database, where the internal structure would break if a transaction was only half written, and a database engine needs to know that one write has occurred before making another.

Writes to ZFS can be long and complicated, requiring multiple blocks be updated for a single change. This is how it maintains its very high integrity. However, this means it can take a while to write even the simplest thing, and a synchronous write could take ages (in computer terms).

To get around this, ZFS maintains a ZIL – ZFS Intent Log.

In ZFS, the ZIL primarily serves to ensure the consistency and durability of write operations, particularly for synchronous writes. But it’s not a physical thing; it’s a concept or list. It contains transaction groups that need to be completed in order.

The ZIL can be physically stored in three possible places…

In-Memory (Volatile Storage):

This is the default location. Initially, all write operations are buffered in RAM. This is where they are held before being committed to persistent storage. This kind ofZIL is volatile because it’s not backed by any permanent storage until written to disk.

Volatility doesn’t matter, because ZFS guarantees consistency with transaction groups (TXGs). The power goes off and the in-RAM ZIL is lost, the transactions are never applied; but the file system is in a consistent state.

In-Pool (Persistent Storage):

Without a dedicated log device, the ZIL entries are written to the main storage pool in transaction groups . This happens for both synchronous and asynchronous writes but is more critical for synchronous writes to ensure data integrity in case of system crashes or power failures.

SLOG (Separate Intent Log Device):

For better performance with synchronous writes, you can add a dedicated device to serve as the SLOG. This device is typically a low-latency, high-speed storage like a short-stroked Rapter, enterprise SSD or NVRAM. ZFS writes the log entries before they’re committed to the pool’s main storage.

By storing the pending TXGs on disk, either in the pool or on an SLOG, ZFS can meet the POSIX requirement that the transaction is stored in non-volatile storage before the write returns, and if you’re doing a lot of synchronous writes then storing them on a high-speed SLOG device helps. But only if the SLOG device is substantially faster than an array of standard drives. And it only matters if you do a lot of synchronous writes. Caching asynchronous writes in RAM is always going to be faster still

I’d contend that the only times synchronous writes feature heavily are databases and virtual machine disks. And then there’s NFS, which absolutely loves them. See ESXi NFS ZFS and vfs-nfsd-async for more information if this is your problem.

If you still think yo need an SLOG, install a very fast drive. These days an NVMe SLC NAND device makes sense. Pricy, but it doesn’t need to be very large. You can add it to a zpool with:

zpool add poolname
log /dev/daX

Where daX is the drive name, obviously.

As I mentioned, the SLOG doesn’t need to be large at all. It only has to cope with five seconds of writes, as that’s the maximum amount of time data is “allowed” to reside there. If you’re using NFS over 10Gbit Ethernet the throughput isn’t going to be above 1.25Gb a seconds. Assuming that’s flat-out synchronous writes, multiplying that by five seconds is less than 8Gb. Any more would be unused.

If you’ve got a really critical system you can add mirrored SLOG drives to a pool thus:

zpool add poolname
log /dev/daX /dev/daY

You can also remove them with something like:

zpool remove
poolname log /dev/daY

This may be useful if adding an SLOG doesn’t give you the performance boost you were hoping for. It’s very niche!

FreeBSD 14 ZFS warning

Update 27-Nov-23
Additional information has appeared on the FreeBSD mailing list:
https://lists.freebsd.org/archives/freebsd-stable/2023-November/001726.html

The problem can be reproduced regardless of the block cloning settings, and on FreeBSD 13 as well as 14. It’s possible block cloning simply increased the likelihood of hitting it. There’s no word yet about FreeBSD 12, but this FreeBSD’s own ZFS implementation so there’s a chance it’s good

In the post by Ed Maste, a suggested partial workaround is to set the tunable vfs.zfs.dmu_offset_next_sync to zero, which has been on the forums since Saturday. This is a result of this bug:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275308

There’s a discussion of the issue going on here:

https://forums.freebsd.org/threads/freebsd-sysctl-vfs-zfs-dmu_offset_next_sync-and-openzfs-zfs-issue-15526-errata-notice-freebsd-bug-275308.91136/

I can’t say I’m convinced about any of this.


FreeBSD 14, which was released a couple of days ago, includes OpenZFS 2.2. There’s a lot of suspicion amongst Gentoo Linux users that this has a rather nasty bug in it related to block cloning.

Although this feature is disabled by default, people might be tempted to turn it on. Don’t. Apparently it can lead to lost data.

OpenZFS 2.2.0 was only promoted to stable on 13th October, and in hindsight adding it to a FreeBSD release so soon may seem precipitous. Although there’s a 2.2.1 release you should now be using instead it simply disables it by default rather than fixing the likely bug (and to reiterate, the default is off on FreeBSD 14).

Earlier releases of OpenZFS (2.1.x or earlier) are unaffected as they don’t support block cloning anyway.

Personally I’ll be steering clear of 2.2 until this has been properly resolved. I haven’t seen conclusive proof as to what’s causing the corruption, although it looks highly suspect. Neither have I seen or heard of it affecting the FreeBSD implementation, but it’s not worth the risk.

Having got the warning out of the way, you may be wondering what block cloning is. Firstly, it’s not dataset cloning. That’s been working fine for years, and for some applications it’s just what’s needed.

Block cloning applies to files, not datasets, and it’s pretty neat – or will be. Basically, when you copy a file ZFS doesn’t actually copy the data blocks – it just creates a new file in the directory structure but it points to the existing blocks. They’re shared between the source and destination files. Each block has a reference count in the on-disk Block Reference Table (BRT), and only when a block in the new file changes does a copy-on-write occur; the new block is linked to the new file and the reference count in the BRT is decremented. In familiar Unix fashion, when the reference count for a block gets to zero it joins the free pool.

This isn’t completely automatic – it must be allowed when the copy is made. For example, the cp utility will request clone files by default. This is done using the copy_file_range() system call with the appropriate runes; simply copying a file with open(), read(), write() and close() won’t be affected.

As of BSDCAN 2023, there was talk about making it work with zvols but this was to come later, although clone blocks in files can exist between datasets as long as they’re using the same encryption (including keys).

One tricky problem here is how it works with the ZIL – for example what’s stopping a block pointer from disappearing from the log? There was a lot to go wrong, and it looks like it has.

Release notes for 2.2.1 may be found here.
https://github.com/openzfs/zfs/releases/tag/zfs-2.2.1

Using ddrescue to recover data from a USB flash drive

If you’re in the data recovery, forensics or just storage maintenance business (including as an amateur) you probably already know about ddrescue. Released about twenty years ago by Antonio Diaz Diaz, it was a big improvement over the original concept dd_rescue from Kurt Garloff in 1999. They copy disk images (which are just files in Unix) trying to get as much data extracted when the drive itself has faults.

If you’re using Windows rather than Unix/Linux then you probably want to get someone else to recover your data. This article assumes FreeBSD.

The advantage of using either of these over dd or cp is that they expect to find bad blocks in a device and can retry or skip over them. File copy utilities like dd ignore errors and continue, and cp will just stop. ddrescue is particularly good at retrying failed blocks, and reducing the block size to recover every last readable scrap – and it treats mechanical drives that are on their last legs as gently as possible.

If you’re new to it, the manual for ddrescue can be found here. https://www.gnu.org/software/ddrescue/manual/ddrescue_manual.html

However, for most use cases the command is simple. Assuming the device you want to copy is /dev/da1 and you’re calling it thumbdrive the command would be:

ddrescue /dev/da1
thumbdrive.img thumbdrive.map

The device data would be stored in thumbdrive.img, with ongoing state information stored in thumbdrive.map. This state information is important, as it allows ddrescue to pick up where it left off.

However, ddrescue was written before USB flash drives (pen drives, thumb drives or whatever). That’s not to say it doesn’t work, but they have a few foibles of their own. It’s still good enough that I haven’t modified ddrescue base code to cope, but by using a bit of a shell script to do the necessary.

USB flash drives seem to fail in a different way to Winchester disks. If a block of Flash EPROM can’t be read it’s going to produce a read error – fair enough. But they have complex management software running on them that attempts to make Flash EPROM look like a disk drive, and this isn’t always that great in failure mode. In fact I’ve found plenty of examples where they come across a fault and crash rather than returning an error, meaning you have to turn them off and on to get anything going again (i.e. unplug them and put them back in).

So it doesn’t matter how clever ddrescue is – if it hits a bad block and the USB drive controller crashes the it’s going to be waiting forever for a response and you’ll just have come reset everything manually and resume. One of the great features of ddrescue is that it can be stopped and restarted at any time, so continuing after this happens is “built in”.

In reality you’re going to end up unplugging your USB flash drive many times during recovery. But fortunately, it is possible to turn a USB device off and on again without unplugging it using software. Most USB hardware has software control over its power output, and it’s particularly easy on operating systems like FreeBSD to do this from within a shell script. But first you have to figure out what’s where in the device map – specifically which device represents your USB drive in /dev and which USB device it is on the system. Unfortunately I can’t find a way of determining it automatically, even on FreeBSD. Here’s how you do it manually; if you’re using a version of Linux it’ll be similar.

When you plug a USB storage device into the system it will appear as /dev/da0 for the first one; /dev/da1 for the second and so on. You can read/write to this device like a file. Normally you’d mount it so you can read the files stored on it, but for data recovery this isn’t necessary.

So how do you know which /dev/da## is your media? This easy way to tell is that it’ll appear on the console when you first plug it in. If you don’t have access to the console it’ll be in /var/log/messages. You’ll see something like this.

Jun 10 17:54:24 datarec kernel: umass0 on uhub5
kernel: umass0: <vendor 0x13fe USB DISK 3.0, class 0/0, rev 2.10/1.00, addr 2> on usbus1
kernel: umass0 on uhub5
kernel: umass0: on usbus1
kernel: umass0: SCSI over Bulk-Only; quirks = 0x8100
kernel: umass0:7:0: Attached to scbus7
kernel: da0 at umass-sim0 bus 0 scbus7 target 0 lun 0
< USB DISK 3.0 PMAP> Removable Direct Access SPC-4 SCSI device
kernel: da0: Serial Number 070B7126D1170F34
kernel: da0: 40.000MB/s transfers
kernel: da0: 59088MB (121012224 512 byte sectors)
kernel: da0: quirks=0x3
kernel: da0: Write Protected

So this is telling us that it’s da0 (i.e /dev/da0)

The hardware identification is “<vendor 0x13fe USB DISK 3.0, class 0/0, rev 2.10/1.00, addr 2> on usbus1” which means it’s on USB bus 1, address 2.

You can confirm this using the usbconfig utility with no arguments:

ugen5.1:  at usbus5, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=SAVE (0mA)
...snip...
ugen1.1: at usbus1, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=SAVE (0mA)
ugen1.2: at usbus1, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (300mA)

There it is again, last line.

usbconfig has lots of useful commands, but the ones we’re interested are power_off and power_on. No prizes for guessing what they do. However, unless you specify a target then it’ll switch off every USB device on the system – including your keyboard, probably.

There are two ways of specifying the target, but I’m using the -d method. We’re after device 1.2 so the target is -d 1.2

Try it and make sure you can turn your USB device off and on again. You’ll have to wait for it to come back online, of course.

There are ways of doing this on Linux by installing extra utilities such as hub-ctrl. You may also be able to do it by writing stuff to /sys/bus/usb/devices/usb#/power/level” – see the manual that came with your favourite Linux distro.

The next thing we need to do is provide an option for ddrescue so that it actually times out if the memory stick crashes. The default is to wait forever. The –timeout=25 or -T 25 option (depending on your optional taste) sees to that, making it exit if it hasn’t been able to read anything for 25 seconds. This isn’t entirely what we’re after, as a failed read would also indicate that the drive hadn’t crashed. Unfortunately there’s no such tweak for ddrescue, but failed reads tend to be quick so you’d expect a good read within a reasonable time anyway.

So as an example of putting it all into action, here’s a script for recovering a memory stick called duracell (because it’s made by Duracell) on USB bus 1 address 2.

#!/bin/sh
while ! ddrescue -T 25 -u /dev/da0 duracell.img duracell.map
do
echo ddrescue returned $?
usbconfig -d 1.2 power_off
sleep 5
usbconfig -d 1.2 power_on
sleep 15
echo Restarting
done

A few notes on the above. Firstly, ddrescue’s return code isn’t defined. However, it appears to do what one might expect so the above loop will drop out if it ever completes. I’ve set the timeout for time since last good read to 25 seconds, which seems about right. Turning off the power for 5 seconds and then waiting for 15 seconds for the system to recognise it may be a bit long – tune as required. I’m also using the -u option to tell ddrescue to only go forward through the drive as it’s easier to read the status when it’s always incrementing. Going backwards and forwards makes sense with mechanical drives, but not flash memory.

Aficionados of ddrescue might want to consider disabling scraping and/or trimming (probably trimming) but I’ve seen it recover data with both enabled. Data recovery is an art, so tweak away as you see fit – I wanted to keep this example simple.

Now this system isn’t prefect. I’m repurposing ddrescue, which does a fine job on mechanical drives, to recover data from a very different animal. I may well write a special version for USB Flash drives but this method does actually work quite well. Let me know how you get on.

Proper Case in a shell script

How do you force a string into proper case in a Unix shell script? (That is to say, capitalise the first letter and make the rest lower case). Bash4 has a special feature for doing it, but I’d avoid using it because, well, I want to be Unix/POSIX compatible.

It’s actually very easy once you’ve realised tr won’t do it all for you. The tr utility has no concept on where in the input stream it is, but combining tr with cut works a treat.

I came across this problem when I was writing a few lines to automatically create directory layouts for interpreted languages (in this case the Laminas framework). Languages of this type like capitalisation of class names, but other names have be lower case.

Before I get started, I note about expressing character ranges in tr. Unfortunately different systems have done it in different ways. The following examples assume BSD Unix (and POSIX). Unix System V required ranges to be in square brackets – e.g. A-Z becomes “[A-Z]”. And the quotes are absolutely necessary to stop the shell globing once you’ve introduced the square brackets!

Also, if you’re using a strange character set, consider using \[:lower:\] and \[:upper:\] instead of A-Z if your version of tr supports it (most do). It’s more compatible with foreign character sets although I’d argue it’s not so easy on the eye!

Anyway, these examples use A-Z to specify ASCII characters 0x41 to 0x5A – adjust to suit your tr if your Unix is really old.

To convert a string ($1) into lower case, use this:

lower=$(echo $1 | tr A-Z a-z)

To convert it into upper case, use the reverse:

upper=$(echo $1 | tr a-z A-Z)

To capitalise the first letter and force the rest to lower case, split using cut and force the first character to be upper and the rest lower:

proper=$(echo $1 | cut -c 1 | tr a-z A-Z)$(echo $1 | cut -c 2- | tr A-Z a-z)

A safer version would be:

proper=$(echo $1 | cut -c 1 | tr "[:lower:]" "[:upper:]")$(echo $1 | cut -c 2- | tr "[:upper:]" [":lower:"])

This is tested on FreeBSD in /bin/sh, but should work on all BSD and bash-based Linux systems using international character sets.

You could, if you wanted to, use sed to split up a multi-word string and change each word to proper case, but I’ll leave that as an exercise to the reader.

Reply-To: gmail spam and Spamassassin

Over the last few months I’ve noticed huge increase is spam with a “Reply To:” field set to a gmail address. What the miscreants are doing is hijacking a legitimate mail server (usually a Microsoft one) and pumping out spam advertising a service of some kind. These missives only work if the mark is able to reply, and as even a Microsoft server will be locked down sooner or later, so they’ll never get the reply.

The reason for sending this way is, of course, spam from a legitimate mail server isn’t going to be blacklisted or blocked. SPF and other flags will be good. So these spams are likely to land in inboxes, and a few marks will reply based on the law of numbers.

To get the reply they’re using the email “Reply-To:” field, which will direct the reply to an alternative address – one which Google is happy to supply them for nothing.

The obvious way of detecting this would be to examine the Reply-To: field, and if it’s gmail whereas the original sender isn’t, flag it as highly suspect.

I was about to write a Spamassassin rule to do just this, when I discovered there is one already – and it’s always been there. The original idea came from Henrik Krohns in 2009, but it’s time has now definitely arrived. However, in a default install, it’s not enabled – and for a good reason (see later). The rule you want is FREEMAIL_FORGED_REPLYTO, and it’s found in 20_freemail.cf

Enabling FREEMAIL_FORGED_REPLYTO in Spamassassin

If you check 20_freemail.cf you’ll see the rules require Mail::SpamAssassin::Plugin::FreeMail, The FreeMail.pm plugin is part of the standard install, but it’s very likely disabled. To enable this (or any other plugin) edit the init.pre file in /usr/local/etc/mail/spamassassin/ Just add the following to the end of the file:

# Freemail checks
#
loadplugin Mail::SpamAssassin::Plugin::FreeMail FreeMail.pm

You’ll then need to add a list of what you consider to be freemail accounts in your local.cf (/usr/local/etc/mail/spamassassin/local.cf). As an example:

freemail_domains aol.* gmail.* gmail.*.* outlook.com hotmail.* hotmail.*.*

Note the use of ‘*’ as a wildcard. ‘?’ matches a single character, but neither match a ‘.’. It’s not a regex! There’s also a local.cf setting “freemail_whitelist”, and other things documented in FreeMail.pm.

Then restart spamd (FreeBSD: service spamd restart) and you’re away. Except…

The problem with this Rule

If you look at 20_freemail.cf you’ll see the weighting is very low (currently 0.1). If this is such a good rule, why so little? The fact is that there’s a lot of spam appearing in this form, and it’s the best heuristic for detecting it, but it’s also going to lead to false positives in some cases.

Consider those silly “contact forms” beloved by PHP Web Developers. They send an email from a web server but with a “faked” reply address to the person filling in the form. This becomes indistinguishable from the heuristic used to spot the spammers.

If you know this is going to happen you can, of course add an exception. You can even have the web site use a local submission port and send it to a local mailbox without filtering. But in a commercial hosting environment this gets a bit complicated – you don’t know what Web Developers are doing. (How could you? They often don’t).

If you have control over your users, it’s probably safe to up the weighting. I’d say 3.0 is a good starting point. But it may be safer to leave it at 0.1 and examine the results for what would have been false positives.

Minecraft server in a FreeBSD Jail

You may have no interest in the game Minecraft, but that won’t stop people asking you to set up a server. Having read about how to do this on various forums and Minecraft fan sites (e.g. this one) I came to the conclusion that no one knew how to do it on current FreeBSD. So here is how you do it, jailed or otherwise.

First off, there isn’t a pre-compiled package. The best way to install it is from the ports, where it exists as /usr/ports/games/minecraft-server

Be warned – this one’s a monster! Run “make config-recursive” first, or it’ll go on stopping for options all the way through. Then run “make install”. It’s going to take quite some time.

The first configuration option screen asks if you want to make it as a service or stand-alone. I picked “service”, which sets up the start-up scripts for you but doesn’t actually tell you it’s done it. It does, however, stop it trying to run in graphics mode on your data centre server so I’m not complaining too much.

The good news is that this all works perfectly in a jail, so while it’s compiling (it could be hours) you can set up the required routing, assuming you’re using an internal network between jails – in this case 192.168.2.0/24. Using pf this will look something like:

externalip="123.123.123.123"
minecraft="192.168.2.3"
extinterface="fx0"
scrub in all
nat pass on $extinterface from 192.168.2.0/24 to any -> $externalip
rdr pass on $extinterface proto tcp from any to $externalip port 25565 -> $minecraft
rdr pass on $extinterface proto tcp from any to $externalip port
{19132,19133,25565} -> $minecraft

And that’s it. You’re basically forwarding on TCP and three UDP ports. If you’re not using a jail, you obviously don’t need to forward anything. For instructions on setting up jails properly, see here, and for networking jails see elsewhere on this blog.

One thing that’s very important – this is written in Java, so as part of the build you’ll end up with OpenJDK. This requires some special file systems are mounted – and if you’re using a jail this will have to be in the host fstab, not the jails!

# Needed for OpenJDK
fdesc /dev/fd fdescfs rw 0 0
proc /proc procfs rw 0 0

If you’re using a jail, make sure the jail definition includes the following, or Java still won’t see them:

mount.devfs;
mount.procfs;

Once you’ve finished building you might bet tempted to follow some of these erroneous instructions in forums and try to run “minecraft-server”. It won’t exist!

To create the basic configuration files run “service minecraft onestart”. This will create the configuration files for you in /usr/local/etc/minecraft-server. It will also create a file called eula.txt. You need to edit this change “eula=false” to “eula=true”.

You can make the minecraft service run on startup with the usual “minecraft_enable=yes” in /etc/rc.conf

And that’s really it. There are plenty of fan guides on tweaking the server settings to your requirements, and they should apply to any installation.

This assumes you’re handy with FreeBSD, understand jails and networking; if you’re not so handy then please leave a comment or contact me. Everyone has to start somewhere, and it’s hard to know what level to pitch instructions like this. Blame me for assuming to much!