ZFS – Frank Leonhardt's Blog

ZFS or UFS?

I started writing the last post as a discussion of ZFS and UFS and it ended up an explainer about how UFS was viable with gmirror. You need to read it to understand the issues if you want redundant storage. But in simple terms, as to which is better, ZFS is. Except when UFS has the advantage.

UFS had a huge problem. If the music stopped (the kernel crashed or the power was cut) the file system was in a huge mess as it wasn’t updated in the right order as it went along. This file system was also know as FS or FFS (Fast File System) but it was more or less the same thing, and is now history. UFS2 came along (and JFS2 on AIX), which had journaling so that if the music stopped it could probably catch up with itself when the power came back. We’re really talking about UFS2 here, which is pretty solid.

Then along comes ZFS, which combines a next generation volume manager and next generation file system in one. In terms of features and redundancy it’s way ahead. Some key advantages are built and very powerful RAID, Copy-on-Write for referential integrity following a problem, snapshots, compression, scalability – the list is long. If you want any of these good features you probably want ZFS. But there are two instances where you want UFS2.

Cost

The first problem with ZFS is that all this good stuff comes at a cost. It’s not a huge cost by modern standards – I’ve always reckoned an extra 2Gb of RAM for the cache and suchlike covers the resource and performance issues . But on a very small system, 2Gb of RAM is significant.

The second problem is more nuanced. Copy on Write. Basically, to get the referential integrity and snapshots, if you change the contents of a block within a file in ZFS it doesn’t overwrite the block, writes a new block in free space. If the old block isn’t needed as part of a snapshot it will be marked as free space afterwards. This means that if there’s a failure while the block is half written, no problem – the old block is there and the write never happened. Reboot and you’re at the last consistent state no more than five seconds before some idiot dug up the power cable.

Holy CoW!

So Copy-on-Write makes sense in many ways, but as you can imagine, if you’re changing small bits of a large random access file, that file is going to end seriously fragmented. And there’s no way to defragment it. This is exactly what a database engine does to its files. Database engines enforce their own referential integrity using synchronous writes, so they’re going to be consistent anyway – but if you’re insisting all transactions in a group are written in order, synchronously, and the underlying file system is spattering blocks all over the disk before returning you’ve got a double whammy – fragmentation and slow write performance. You can put a lot of cache in to try and hide the problem, but you can’t cache a write if the database insists it won’t proceed until it’s actually stored on disk.

In this one use case, UFS2 is a clear winner. It also doesn’t degrade so badly as the disk becomes full. (The ZFS answer is that if the zpool is approaching 80% capacity, add more disks).

Best of Both

There is absolutely nothing stopping you having ZFS and UFS2 on the same system – on the same drives even. Just create a partition for your database, format it using makefs and mount it on the ZFS tree wherever it’s needed. You probably want it mirrored, so use gmirror. You won’t be able to snapshot it, or otherwise back it up while it’s running, but you can dump it to a ZFS dataset and have that replicated along with all the others.

You can also boot of UFS2 and create a zpool on additional drives or partitions if you prefer, mounting them on the UFS tree. Before FreeBSD 10 had full support for booting direct of ZFS this was the normal way of using it. The advantages of having the OS on ZFS (easy backup, snapshot and restore) mean it’s probably preferable to use it for the root.

16-August-258-October-25

Swap Disks in the 21st Century

Although some of this is BSD specific, the principles apply to any Unix or Linux.

When you install your Unix like OS across several disks, either with a mirror or RAID system (particularly ZFS RAIDZ) you’ll be asked if you want to set up a swap partition, and if you want it mirrored.

The default (for FreeBSD) is to add a swap partition on every disk and not mirror it. This is actually the most efficient configuration apart from having dedicated swap drives, but is also a spectacularly bad idea. More on this later.

What is a swapfile/drive anyway?

The name is a hangover from early swapping multi tasking systems. Only a few programs could fit in main memory, so when their time allocation ran out they were swapped with others on a disk until it was their turn again.

These days we have “virtual memory”, where a Memory Management Unit (MMU) fixed it so blocks of memory known as pages are stored on disk when not in use and automatically loaded when needed again. This is much more effective than swapping out entire programs but needs MMU hardware, which was once complex, slow and expensive.

What an MMU does is remap the CPU address space so the running process believes it has a certain amount of RAM starting at address zero and going up as high as needed. It “thinks” it has the complete processor and all the RAM. However, the operating system is lying to the process, as not all the RAM the process believes it has is mapped to actual RAM. If the process tries to access an address that’s not mapped to anything the OS kernel takes over with a hardware interrupt called a “page fault”. The kernel than brings that page of RAM in from where it’s stored on the swap disk, gets the MMU to map it to the process address space, and restarts the process where it left off. The process doesn’t even know this has happened. If the kernel notices that some areas of RAM aren’t being used by the process it copies them to the swap disk and uses the physical RAM for some other purpose – until the next page fault.

So the swap partition should really be called the paging partition now, and Microsoft actually got the name right on Windows. But we still call it the swap partition.

What you need to remember is that parts of a running programs memory may be in the swap partition instead of RAM at any time, and that includes parts of the operating system.

Strategies

There are several ideas for swap partitions in the 2020s.

No swap partition

Given RAM is so cheap, you can decide not to bother with one, and this is a reasonable approach. Virtual memory is slow, and if you can, get RAM instead. It can still pay to have one though, as some pages of memory are rarely, if ever, used again once created. Parts of a large program that aren’t actually used, and so on. The OS can recognise this and page them out, using the RAM for something useful.

You may also encounter a situation where the physical RAM runs out, which will mean no further programs can be run and those already running won’t be able to allocate any more. This leads to two problems: Firstly “Developers” don’t often program for running out of memory and their software doesn’t handle the situation gracefully. Secondly, if the program your need to run is you login shell you’ll be locked out of your server.

For these reasons I find it better to have a swap partition, but install enough RAM that it’s barely used. As a rule of thumb, I go for having the same swap space as there is physical RAM.

Dedicated Swap Drive(s)

This is the classic gold standard. Use a small fast drive (and expensive), preferably short stroked, so your virtual memory goes as fast as possible. If you’re really using VM this is probably the way to go, and having multiple dedicated drives spreads the load and increases performance.

Swap partition on single drive

If you’ve got a single drive system, just create a swap partition. It’s what most installers do.

Use a swap file

You don’t need a drive or even a partition. Unix treats devices and files the same, so you can create a normal file and use that.

truncate -s 16G /var/swapfile
swapon /var/swapfile

You can swap on any number of files or drives, and use “swapoff” to stop using a particular one.

Unless you’re going for maximum performance, this has a lot going for it. You can allocate larger or smaller swap files as required and easily reconfigure a running system. Also, if your file system is redundant, your swap system is too.

Multiple swap partitions

This is what the FreeBSD installer will offer by default if you set up a ZFS mirror or RAIDZ. It spreads the load across all drives. The only problem is that the whole point of a redundant drive system is that it will keep going after a hardware failure. With a bit of swap space on every drive, the system will fail if any of the drives fails, even if the filing system carries on. Any process with RAM paged out to swap gets knocked out, including the operating system. It’s like pulling out RAM chips and hoping it’s not going to crash. SO DON’T DO IT.

If you are going to use a partition on a data drive, just use one. On an eight drive system the chances of a failure on one of eight drives is eight times higher than one one specific unit, so you reduce the probability of failure considerably by putting all your eggs in one basket. Counterintuitive? Consider that if one basket falls on a distributed swap, they all do anyway.

Mirrored swap drives/partitions

This is sensible. The FreeBSD installer will do this if you ask it, using geom mirror. I’ve explained gmirror in posts passem, and there is absolutely no problem mixing it with ZFS (although you might want to read earlier posts to avoid complications with GPT). But the installer will do it automatically, so just flip the option. It’s faster than a swap file, although this will only matter if your job mix actually uses virtual memory regularly. If you have enough RAM, it shouldn’t.

You might think that mirroring swap drives is slower – and to an extent it is. Everything has to be written twice, and the page-out operation will only complete when both drives have been updated. However, on a page-in the throughput is doubled, given the mirror can read either drive to satisfy the request. The chances are there will be about the same, or slightly more page-ins so it’s not the huge performance hit it might seem at first glance.

Summary

Method	Pros	Cons
No swap	Simple Fastest	Wastes RAM Can lead to serious problems if you run out of RAM
Dedicated Swap Drive(s)	Simple Optimal performance	Each drive is a single point of failure for the whole system
Multiple Swap Partitions	Improved performance Lower cost than dedicated	Each drive is a single point of failure for the whole system
Single swap partition (multi-drive system)	Simple Lower probability of single point of failure occurring.	Reduced performance Still has single point of failure
Mirrored drives or partitions	No single point of failure for the whole system	Reduced performance
Swap file	Flexible even on live system Redundancy the same as drive array	Reduced performance

Quick summary of different swap/paging device strategies.

Conclusion

Having swap paritions on multiple drives increases your risk of a fault taking down a server that would otherwise keep running. Either use mirrored swap partitions/drives, or use a swap file on redundant storage. The choice depends on the amount of virtual memory you use in normal circumstances.

8-June-2519-November-25

Detecting failing ZFS drives

FreeBSD relies on a separate daemon to be running to detect failing drives in a zfs pool rather than the kernel handling it, but I’m not convinced it even works.

I’ve been investigating what the current ZFS on FreeBSD 14.2 does with failing drives. It’s a bit worrying. I posted on the FreeBSD mailing list on 17th Feb 2025 in the hope that some would know something, and there as some discussion (“me too”), but we drew a blank.

The problem is that ZFS doesn’t “fault” a drive until it’s taken offline by the OS. So if you’ve got a flaky drive you have to wait for FreeBSD to disconnect it, and only then ZFS will notice. At least that’s how it works out of the box (but read on).

In the past I’ve tested ZFS’s robustness simply by pulling drives, which guaranteed the OS would fail it, but a few troubling events led me to do a proper investigation. I acquired collection of flaky drives (data centre discards) that are unreliable, and set it up to fail so I could watch. ZFS will wait a very long time for a SAS drive to complete an operation, in circumstances when the drive is clearly on its last legs. If the operation fails and retries, FreeBSD logs a CAM error but ZFS doesn’t fail the drive. You can have a SAS drive rattling and groaning away, but FreeBSD patiently waits for it to complete by relocating the block or multiple retries and ZFS is none the wiser. Or maybe ZFS is relocating the block after the CAM error? Either way, ZFS says the drive is “ONLINE” and carries on using it.

The only clue, other than the console log, is that operations can start to take a long time. The tenacity of SAS drives means it can take several minutes, although SATA tends to fail more quickly. You can have a SAS drive taking a minute for each operation and all you know about it is things are going very, very slowly. It does keep error statistics for vdevs, cascading up the chain, but what it does with them and when it logs them isn’t entirely clear.

If you use a stethoscope on the drive (one of my favourite tricks) it’s obvious it’s not happy but FreeBSD won’t offline it until it catches fire. In fact I suspect it would need to explode before it noticed.

zfsd

However, there is an answer! Nine years ago saw a handy little daemon called zfsd from Justin Gibbs and Alan Somers. This provides some of the funcallity of Solaris’ Service Management Facility (SMF), in particular the fault management daemon, fmd. Quite how closely it follows it I’m not certain, but the general idea is the same. Both look to see if the hardware is failing and act accordingly. In the recent Linux ZFS there’s a daemon called zfs-zed but that works a little differently (more later).

On FreeBSD, zfsd listens to devctl/devd (and possibly CAM) and will collect data on drive errors (it calls this a case file). I say “possibly” because it’s not exactly well documented and appears to have remained pretty much unchanged since it appeared in FreeBSD 11. As a result, I’ve been examining the source code, which is in C++ and has been influenced by “Design Patterns” – not a recipe for clear understanding.

Anyway, zfsd definitely listens to the devctl events (the kind of stuff that ends up in the console log) and acts accordingly. For example, if a vdev generates more than eight delayed I/O events in a minute it will mark it as faulted and activate a hot spare if there is one. If there are more than 50 I/O errors a minute it will do the same. 50 checksum error a minute will degrade a vdev. All of this can be found in the man page.

What’s not so clear is how or whether the code actually operates as advertised. It certainly calls something in response to events, in zfsd_event.cc: likely looking functions such as zpool_vdev_detach(), which are part of libzfs. Trying to find the man page for these functions is more problematic, and a search of the OpenZFS documentation also draws a blank. I’ve heard it not documented because it’s an “unstable interface”. Great.

What I have been able to follow through is that it does listen to devctl/devd events, it matches those events to pools/vdev and leaves it to CaseFile (C++ Class) logic to invoke likely looking functions starting with zpool_ which are found libzfs judging by the headers.

Now in my experience of a failing drive, one delayed operation is one too many – two is a sure sign of an imminent apocalypse. I’m not clear how zfsd handles this, because a slow I/O is not a failure and won’t generate a “device detached” event directly; and zfsd can only see what comes through the kernel event channel (devctl). So I took a look in the kernel ZFS module (vdev_disk.c and zio.c). ZFS detects something slow internally (zio has a timeout based, I think on zfs_deadman_synctim_ms) and it will log this but as long as it doesn’t actually generate an error, no event will be sent to devctl (and therefore zfsd won’t see it). I hope I’ve got this wrong, and I’ve seen several versions of the source code but I’m concentrating on the one in the 14.2-RELEASE base system. In other words, I don’t see it calling sEvent::Process() with this stuff.

However, there is logic for handling long operations and error counts and in case_file.cc. There are even tunable values as zpool properties (there is no zfsd config file)

Property	Description	Default
`io_n`	Number of I/O errors to trigger fault.	50
`io_t`	Time window (seconds) for `io_n` count.	60
`slow_io_n`	Number of delayed/slow I/O events to trigger fault.	8
`slow_io_t`	Time window (seconds) for `slow_io_n` count.	60
`checksum_n`	Number of checksum errors to mark DEGRADED (not full fault).	50
`checksum_t`	Time window (seconds) for `checksum_n` count.	60

These defaults are hard wired into a header file (case_file.h – DEFAULT_ZFS_DEGRADE_IO_COUNT etc), and documented in the vdevprops(7) and zfsd man pages – inconsistently.

You can try to read the current values using the command:

zpool get io_n,io_t,slow_io_n,slow_io_t,checksum_n,checksum_t zroot all-vdevs

The command for “zpool get”, which is not “zfs get” is documented in man zpool-get, and I have to say it can be a bit confusing. The format of the line above includes a list of properties followed by the zpool name followed either by a particular vdev or the special value “all-vdevs”. It’s worth running this to find out what the possible vdevs are, as it may not be what you think!

Chances are they’ll all be set to “default”, and I believe the table above has the correct default values but I can’t be sure. Your output for a simple mirror system should look like this:

NAME    PROPERTY    VALUE      SOURCE
root-0  io_n        -          default
root-0  io_t        -          default
root-0  slow_io_n   -          default
root-0  slow_io_t   -          default
root-0  checksum_n  -          default
root-0  checksum_t  -          default
mirror-0  io_n        -          default
mirror-0  io_t        -          default
mirror-0  slow_io_n   -          default
mirror-0  slow_io_t   -          default
mirror-0  checksum_n  -          default
mirror-0  checksum_t  -          default
ada0p3  io_n        -          default
ada0p3  io_t        -          default
ada0p3  slow_io_n   -          default
ada0p3  slow_io_t   -          default
ada0p3  checksum_n  -          default
ada0p3  checksum_t  -          default
ada1p3  io_n        -          default
ada1p3  io_t        -          default
ada1p3  slow_io_n   -          default
ada1p3  slow_io_t   -          default
ada1p3  checksum_n  -          default
ada1p3  checksum_t  -          default

You can set individual values with commands like:

zpool set checksum_n=3 zroot root-0
zpool set slow_io_n=3 zroot mirror-0
zpool set io_n=3 zroot ada0p3

Unfortunately the documentation is a bit hazy on the effects of setting these values in different places. Do values on leaf vdevs (e.g. ada1p3) take precedence over values set further up (e.g. mirror-0)? What I’m not sure of is whether the root-0 error count can take the whole pool offline, but I suspect it should. In other words, each level keeps its own error count and if one drive is acting up it, can it take a whole vdev or pool offline? The other explanation is that the values always cascade down to the leaf vnode (drive) if it doesn’t have a particular value set – not a chance I’d take if the host is in a data centre a long way off!

What’s worse, I can’t find out which of these values is actually used. Properties aren’t inherited but I’d have assumed zfsd would walk back up the tree from the disk to find the first set value (be that immediately or at the root vdev). I can find no such code, so which one do you set?

And you probably do want to set them, as these values don’t match my real-world experience of drive failures. I believe that Linux has defaults of 10 errors in 10 minutes, which seems a better choice. If a drive is doing that, it’s usually not long for this world but expecting 50 errors in a minute when operations are taking 30 seconds to return while the drive tries its hardest isn’t going to cut it.

I’m also a tad suspicious that they’re all these values are “default” – i.e. not set. This triggers zfsd to use the hard-wired values – values that can only be changed by recompiling. And I have no idea what might be using them other than zfsd and what counts as a “default” for them. I would have expected the values to be set on root-0 (i.e. the pool) when the pool is created and inherited by vdevs unless specifically set. In other words, I smell a rat.

Linux?

I mentioned Linux doesn’t have zfsd, but I believe the kernel modules zfs.ko etc send events to zed, which in turns runs executables or scrips to do the hot-spare swapping and so on. If the kernel detects a device failure, or mark the vdev as DEGRADED or FAULTED. That’s to say it’s a kernel module, not a daemon doing the job of picking up on failed drives. Ilumonos had a similar system, and I assume Solaris still does.

How do you clear a zpool property?

As a bonus, here’s something you won’t find documented anywhere – how do you set a zpool property back to it’s default value? You might be thinking:

zpool inherit io_n zroot ada0p3

Well inherit works with zfs, doesn’t it? No such luck.

zpool io_n=default zroot ada0p3

Nope! Nor does =0 or just =

The way which works is:

zpool io_n=none zroot ada0p3

Update: 19-Nov-25

I’m still suspicious of this so I asked Allan Jude and Michael Lucas on a webinar. Apparently they don’t have a problem with it. (They had a few days pre-warning of the question). I’ve added some trace stuff to it and I’m watching it closely. I never did figure out how it could detect slow operations – if anyone can enlighten me as to how the kernel communicates this to zfsd other than a “timeout, give up” event, I’d really appreciate it.

5-April-255-April-25

Add mirror to single ZFS disk

So you have FreeBSD a single drive ZFS machine and you want to add a second drive to mirror the first because it turns out it’s now important. Yes, it’s possible to do this after installation, even if you’re booting off ZFS.

Let’s assume your first drive is ada0, and it’s had the FreeBSD installer set it up a a “stripe on one drive” using GPT partition. You called the existing zpool “zroot” as you have no imagination whatsoever. In other words everything is the default. The new disk is probably going to be ada1 – plug it in and look on the console or /var/messages to be sure. As long as it’s the same size or larger than the first, you’re good to go. (Use diskinfo -v if you’re not sure).

FreeBSD sets up boot partitions and swap on the existing drive, and you’ll probably want to do this on the new one, if for no other reason than if ada0 fails it can boot off ada1.

gpart destroy -F ada1
gpart backup ada0 | gpart restore ada1
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada1

This gets rid of any old partition table that might be there, copies the existing one from ada0 (which will include the boot and swap partitions as well as the ZFS one).

The third line installs a protective MBR on the disk to avoid non-FreeBSD utilities doing bad things and then adds the ZFS boot code.

If there’s a problem, zero the disk using dd and try again. Make sure you zap the correct drive, of course.

dd if=/dev/zero of=/dev/ada1 bs=32m status=progress

Once you’ve got the partition and boot set up, all you need to do is attach it to the zpool. This is where people get confused as if you do it wrong you may end up with a second vdev rather than a mirror. Note that the ZFS pool is on the third partition on each drive – i.e. adaxp3.

The trick is to specify both the existing and new drives:

zpool attach zroot ada0p3 ada1p3

Run zpool status and you’ll see it (re)silvering the new drive. No interruptions, no reboot.

pool: zroot
  state: ONLINE
   scan: resilvered 677M in 00:00:18 with 0 errors on Sat Apr  5 16:13:16 2025
 config:
     NAME        STATE     READ WRITE CKSUM
     zroot       ONLINE       0     0     0
       mirror-0  ONLINE       0     0     0
         ada0p3  ONLINE       0     0     0
         ada1p3  ONLINE       0     0     0

This only took 18 seconds to resilver as in this case it’s just a system diskm and ZFS doesn’t bother copying unnecessary blocks.

If you want to remove it and go back to a single drive the command is:

zpool detach zroot ada1p3

Add another to create a three-way mirror. Go a little crazy!

28-February-2528-February-25

ZFS In-place disk size upgrade

Everyone knows that you can replace the drives in a ZFS vdev with larger ones one a time, and when the last one is inserted it automagically uses the extra space, right?

But who’s actually done this? It does actually work, kind of.

However, small scale ZFS users are booting from ZFS, and have been since FreeBSD 10. Simply swapping out the drives with larger ones isn’t going to work. It can’t work. You’ve got boot code, swap files and other stuff to complicate it. But it can be made to work, and here’s how.

The first thing you need to consider is that ZFS is a volume manager, and normally when you create an array (RAIDZ or mirror) it expects to manage the whole disk. When you’re creating a boot environment you need bootstraps to actually boot from it. FreeBSD can do this, and does by default since FreeBSD 10 was released in 2014. The installer handles the tricky stuff about partitioning the disks up and making sure it’ll still boot when one drive is missing.

If you look at the partition table on one of the disks in the array you’ll see something like this:

=>        40  5860533088  ada0  GPT  (2.7T)
          40        1024     1  freebsd-boot  (512K)
        1064         984        - free -  (492K)
        2048     4194304     2  freebsd-swap  (2.0G)
     4196352  5856335872     3  freebsd-zfs  (2.7T)
  5860532224         904        - free -  (452K)

So what’s going on here?

We’re using the modern GPT partitioning scheme. You may as well – go with the flow (but see articles about gmirror). This is a so-called 3Tb SATA disk, but it’s really 2.7Tb as manufacturers don’t know what a Tb really is (2^40 bytes). FreeBSD does know what a Tb, Gb, Mb and Kb is in binary so the numbers you see here won’t always match.

The disk starts with 40 sectors of GPT partition table, followed by the partitions themselves.

The first partition is 512K long and contains the freebsd-boot code. 512K is a lot of boot code, but ZFS is a complicated filing system so it needs quite a lot to be able to read it before the OS kernel is loaded.

The second partition is freebsd-swap. This is just a block of disk space the kernel can use for paging. By labelling it freebsd-swap, FreeBSD can can find it and use it. On an array, each drive has a bit of paging space so the load is shared across all of them. It doesn’t have to be this way, but it’s how the FreeBSD installer does it. If you have an SLOG drive it might make sense to put all the swap on that.

The third partition is actually used for ZFS, and is the bulk of the disk.

You might be wondering what the “- free -” space is all about. For performance reasons its good practice to align partitions to a particular grain size, in this case it appears to be 1Mb. I won’t go into it here, suffice to say that the FreeBSD installer knows what it’s doing, and has left the appropriate gaps.

As I said, ZFS expects to have a whole disk to play with, so normally you’d create an array with something like this:

zpool create mypool
raidz1 da0 da1 da2 da3

This creates a RAIDZ1 called mypool out of four drives. But ZFS will also work with geoms (partitions). With the partition scheme show above the creation command would be:

zpool create mypool
raidz1 da0p3 da1p3 da2p3 da3p3

ZFS would use partition 3 on all four drives and leave the boot code and swap area alone. And this is effectively what the installer does. da#p2 would be used for swap, and da#p1 would be the boot code – replicated but available on any drive that was still working that the BIOS could find.

So, if we’re going to swap out our small drives with larger ones we’re going to have to sort out the extra complications from being bootable. Fortunately it’s not too hard. But before we start, if you want the pool to expand automatically you need to set an option:

zpool set autoexpand=on zroot

However, you can also expand it manually when you online the new drive using the -e option.

From here I’m going to assume a few things. We have a RAIDZ set up across four drives: da0, da1, da2 and da3. The new drives are larger, and blank (no partition). Sometimes you can get into trouble if they have the wrong stuff in the partition table, so blanking them is best, and if you blank the whole drive you’ll have some confidence it’s a good one. It’s also worth mentioning at some point that you can’t shrink the pool by using smaller drives, so I’ll mention in now. You can only go bigger.

You’ll also have to turn the swap off, as we’ll be pulling swap drives. However, if you’re not using any swap space you should get away with it. Run swapctl -l to see what’s being used, and use swapoff to turn off swapping on any drive we’re about to pull. Also, back up everything to tape or something before messing with any of this, right?

Ready to go? Starting with da0…

zpool offline zroot da0p3

Pull da0 and put the new drive in. It’s worth checking the console to make sure the drive you’ve pulled really is da0, and the new drive is also identified as da0. If you pull the wrong drive, put it back and used “zpool online zroot da0” to put it back. The one you actually pulled will be offline.

We could hand partition it, but it’s easier to simply copy the partition table from one of the other drives:

gpart backup da1 | gpart restore da0

This will copy the wrong partition table over, as all the extra space will be left at the end of the disk. We can fix this:

gpart resize -i 3 da0

When you don’t specify a new size with -s, this will change the third partition to take up all remaining space. There’s no need to leave an alignment gap at the end, but if you want to do the arithmetic you can. Specify the size as the remaining size/2048 to get the number of 512 byte sectors with 1Mb granularity. The only point I can see for doing this is if you’re going to add another partition afterwards and align it, but you’re not.

Next we’ll add the boot code:

gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0

And finally put it in the array

zpool replace zroot da0p3

Run zpool status and watch as the array is rebuilt. This may take several hours, or even days.

Once the re-silvering is complete and the array looks good we can do the same with the next drive:

zpool offline zroot da1p3

Swap the old and new disk and wait for it to come online.

gpart backup da0 | gpart restore da1
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da1
zpool replace zroot da1p3

Wait for resilvering to finish

zpool offline zroot da2p3

Swap the old and new disk and wait for it to come online.

gpart backup da0 | gpart restore da2
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da2
zpool replace zroot da2p3

Wait for resilvering to finish

zpool offline zroot da3p3

Swap the old and new disk and wait for it to come online.

gpart backup da0 | gpart restore da3
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da3
zpool replace zroot da3p3

Wait for resilvering to finish. Your pool is now expanded!

If you didn’t have autoexpand enabled you’ll need to manually expand them “zpool offline da#” followed by “zpool online -e da#”.

2-February-257-September-25

FreeBSD ZFS RAIDZ failed disk replacement

(for the rest of us)

I don’t know about you but most of my ZFS arrays are large, using SAS drives connected via SAS HBAs to expanders that know which disk is where. I also have multiple redundancy in the zpool and hot spares, so I don’t need pay a visit just to replace a failed disk. And if I do I can get the enclosure to flash an LED over the drive I’m interested in replacing.

Except at home. At home I’ve got what a lot of people probably have. A small box with a four-drive cage running RAIDZ1 (3+1). And it’s SATA, because it really is a redundant array of independent drives. I do, of course, keep a cold spare I can swap out. Always make sure you you have at least one spare drive of the dimensions of a RAIDZ group, and know where you find it.

And to make it even more fun, you’re booting from the array itself.

After many years I started getting an intermittent CAM error, which isn’t good news. Either one of the drives was loose, or it was failing. And there’s no SAS infrastructure to help. If you’re in a similar position you’ve come to the right place.

WARNING. The examples in this article assume ada1 is the drive that’s failed. Don’t blindly copy/paste into a live system without changing this as appropriate

To change a failed or failing drive:

Find the drive
Remove the old drive
Configure the new drive
Tell the RAIDZ to use it

Finding the failed drive

First, identify your failing drive. The console message will probably tell you which one. ZFS won’t, unless it’s failed to the extent it’s been offlined. “zpool status” may tell you everything’s okay, but the console may be telling you:

Feb  2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
Feb  2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
 Feb  2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
 Feb  2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): RES: 41 40 b0 71 20 00 f6 00 00 00 01
 Feb  2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): Retrying command, 0 more tries remain
 Feb  2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 d8 70 20 40 f6 00 00 01 00 00
 Feb  2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): CAM status: ATA Status Error
 Feb  2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
 Feb  2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): RES: 41 40 b0 71 20 00 f6 00 00 00 01
 Feb  2 17:56:23 zfs1 kernel: (ada1:ahcich1:0:0:0): Error 5, Retries exhausted

So this tells you that ada1 is in a bad way. But which is ada1? It might be the second one across in your enclosure, or it might not. You’ll need to do some hunting to identify it positively.

Luckily most disks have their serial number printed on the label, and this is available to the host. So finding the serial number for ada1 and matching it to the disk label is the best way – if you’ve only got four drives to check, anyway.

I know of at least five ways to get a disk serial number in FreeBSD, and I’ll list them all in case one stops working:

dmesg

Just grep for the drive name you’re interested in (ada1). This is probably a good idea as it may give you more information about the failure. If FreeBSD can get the serial number it will display it as it enumerates the drive.

camcontrol identify ada1

This gives you more information that you ever wanted about a particular drive. This does include the serial number.

geom disk list

This will print out information on all the geoms (i.e. drives), including the serial number as “ident”

diskinfo -s /dev/ada1

This simply prints the ident for the drive in question. You can specify multiple arguments so diskinfo -s /dev/ada? works (up to ten drives).

smartctl -i /dev/ada1

Smartctl was a utility for managing SATA drives, but later versions have been updated to read information from SAS drives too, and you should probably install it. It’s part of Smartmontools as it gives you the ATA information for drive, including errors rates, current temperature and suchlike – stuff that camcontrol can’t.

Whichever method works for you, once you’ve got your serial number you can identify the drive. Except, of course, if your drive is completely fubared. In that case get the serial numbers of the drives that aren’t and identify it by elimination. Also worth a mention is gpart list, which will produce a lot of information about all or a specific drive’s logical layout, but not the serial number. It may offer some clues.

Saving the partition table from the failed drive.

In readiness for replacing it, if you can save its partition table:

gpart backup ada1 >
gpart.ada1

If you can’t read it, just save another one from a different drive in the vdev set – they should be identical, right? If the drive is already mounted you can copy it using gpart backup ada1 | gpart restore ada2

Swapping the bad drive out.

Next, pull your faulty drive and replace it with a new one. You might want to turn the power off, although it’s not necessary. However, it’s probably safer to reboot as we’re messing with the boot array.

Try zpool status, and you’ll see something like this:

  pool: zrpool: zr
  state: DEGRADED
 status: One or more devices could not be opened. 
         Sufficient replicas exist for
         the pool to continue functioning in a degraded state.
 action: Attach the missing device and online it using 'zpool online'.
   scan: scrub in progress since Sun Feb  2 17:42:38 2025
         356G scanned at 219/s, 175G issued at 108/s, 10.4T total
         0 repaired, 1.65% done, no estimated completion time
 config:
     NAME                   STATE     READ WRITE CKSUM
     zr                     DEGRADED     0     0     0
       raidz1-0             DEGRADED     0     0     0
         ada0p3             ONLINE       0     0     0 
         16639665213947936  UNAVAIL      0     0     0  was /dev/ada1p3
         ada2p3             ONLINE       0     0     0
         ada3p3             ONLINE       0     0     0

It’s complaining because it can’t find the drive with the identity 16639665213947936. ZFS doesn’t care where the drives in a vdev are plugged in, only that they exist somewhere. Device ada1 is ignored – it’s just got some random disk in that zfs isn’t interested in.

Setting up the replacement drive

So let’s get things ready to insert the new drive in the RAIDZ.

First restore its partition table:

gpart restore
/dev/ada1 < gpart.ada1

If you see “gpart: geom ‘ada1’: File exists”, just run “gpart destroy -F ada1”. Without the -F it may say the drive is in use, which we know it isn’t.

Next, if you’ve got a scrub going on, stop it with “zpool scrub -s zr”

As a sanity check, run “gpart show” and you should see four identical drives.

Boot sector and insertion

Now this is a boot from ZFS situation, common on a home server but not a big one. The guides from Solaris won’t tell you about this step. To make sure the system boots you need to have the boot sector on every drive (ideally). Do this with:

gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada1

And Finally… tell ZFS to insert the new drive:

 zpool replace ada1p3

Run “zpool status” and you’ll see it working:

pool: zr
  state: DEGRADED
 status: One or more devices is currently being resilvered.  The pool will
         continue to function, possibly in a degraded state.
 action: Wait for the resilver to complete.
   scan: resilver in progress since Sun Feb  2 18:44:58 2025
         1.83T scanned at 429M/s, 1.60T issued at 375M/s, 10.4T total
         392G resilvered, 15.45% done, 0 days 06:48:34 to go
 config:
     NAME                     STATE     READ WRITE CKSUM
     zr                       DEGRADED     0     0     0
       raidz1-0               DEGRADED     0     0     0
         ada0p3               ONLINE       0     0     0
         replacing-1          UNAVAIL      0     0     0
           16639665213947936  UNAVAIL      0     0     0  was /dev/ada1p3/old
           ada1p3             ONLINE       0     0     0
         ada2p3               ONLINE       0     0     0
         ada3p3               ONLINE       0     0     0
 errors: No known data errors

It’ll chug along in the background re-silvering the whole thing. You can carry on using the system, but it’s performance may be degraded until it’s done. Take a look at the console to make sure there are no CAM errors indicating that the problem wasn’t the drive at all, and go to bed.
If you reboot or have a power cut while its rebuilding it will start from scratch, so try to avoid both!

In the morning, zpool status will return to this, and all will be well in the world. But don’t forget to order another cold spare so you’re ready when it happens again.

pool: zr
  state: ONLINE
   scan: resilvered 2.47T in 0 days 11:52:45 with 0 errors on Mon Feb  3 06:37:43 2025
 config:
     NAME        STATE     READ WRITE CKSUM
     zr          ONLINE       0     0     0
       raidz1-0  ONLINE       0     0     0
         ada0p3  ONLINE       0     0     0
         ada1p3  ONLINE       0     0     0
         ada2p3  ONLINE       0     0     0
         ada3p3  ONLINE       0     0     0
 errors: No known data errors

As a final tip, if you use diskinfo -v adaX it will tell you the type of drive and other information, which is really handy if you’re ordering another cold spare.

# diskinfo -v adaa
 ada2
         512             # sectorsize
         3000592982016   # mediasize in bytes (2.7T)
         5860533168      # mediasize in sectors
         4096            # stripesize
         0               # stripeoffset
         5814021         # Cylinders according to firmware.
         16              # Heads according to firmware.
         63              # Sectors according to firmware.
         HGST HUS724030ALE640    # Disk descr. <- Drive to order on eBay!
         PK2234P9GWJ42Y  # Disk ident.
         No              # TRIM/UNMAP support
         7200            # Rotation rate in RPM
         Not_Zoned       # Zone Mode

15-June-247-November-25

Why people obsess about the ZFS SLOG, but shouldn’t

There are two mysteries things on ZFS that cause a lot of confusion: The ZIL and the SLOG. This article is about what they are and why you should care, or not care about them. But I’ll come to them later. Instead I’ll start with POSIX, and what it says about writing stuff to disk files.

When you write to disk it can either be synchronous or asynchronous. POSIX (Portable Operating System Interface) has requirements for writes through various system calls and specifications.

With an asynchronous write the OS takes the data you give it and returns control to the application immediately, promising to write the data as soon as possible in the background. No delay. With a synchronous write the application won’t get control back until the data is actually written to the disk (or non-volatile storage of some kind). More or less. Actually, POSIX.1-2017 (IEEE Standard 1003.1-2017) doesn’t guarantee it’s written, but that’s the expectation.

You’d want synchronous writes for critical complex files, such as a database, where the internal structure would break if a transaction was only half written, and a database engine needs to know that one write has occurred before making another.

Writes to ZFS can be long and complicated, requiring multiple blocks be updated for a single change. This is how it maintains its very high integrity. However, this means it can take a while to write even the simplest thing, and a synchronous write could take ages (in computer terms).

To get around this, ZFS maintains a ZIL – ZFS Intent Log.

In ZFS, the ZIL primarily serves to ensure the consistency and durability of write operations, particularly for synchronous writes. But it’s not a physical thing; it’s a concept or list. It contains transaction groups that need to be completed in order.

The ZIL can be physically stored in three possible places…

In-Memory (Volatile Storage):

This is the default location. Initially, all write operations are buffered in RAM. This is where they are held before being committed to persistent storage. This kind of ZIL is volatile because it’s not backed by any permanent storage until written to disk.

Volatility doesn’t matter, because ZFS guarantees consistency with transaction groups (TXGs). The power goes off and the in-RAM ZIL is lost, the transactions are never applied; but the file system is in a consistent state.

In-Pool (Persistent Storage):

Without a dedicated log device (the default), the ZIL entries are written to the main storage pool in transaction groups . This happens for both synchronous and asynchronous writes but is more critical for synchronous writes to ensure data integrity in case of system crashes or power failures. All transactions must take place in order, so they all need to be committed to non-volatile storage before a synchronous write can return.

SLOG (Separate Intent Log Device):

For better performance with synchronous writes, you can add a dedicated device to serve as the SLOG. This device is typically a low-latency, high-speed storage like a short-stroked Rapter, enterprise SSD or NVRAM. ZFS writes the log entries before they’re committed to the pool’s main storage.

By storing the pending TXGs on disk, either in the pool or on an SLOG, ZFS can meet the POSIX requirement that the transaction is stored in non-volatile storage before the write returns, and if you’re doing a lot of synchronous writes then storing them on a high-speed SLOG device helps. But only if the SLOG device is substantially faster than an array of standard drives. And it only matters if you do a lot of synchronous writes. Caching asynchronous writes in RAM is always going to be faster still

I’d contend that the only times synchronous writes feature heavily are databases and virtual machine disks. And then there’s NFS, which absolutely loves them. See ESXi NFS ZFS and vfs-nfsd-async for more information if this is your problem.

If you still think yo need an SLOG, install a very fast drive. These days an NVMe SLC NAND device makes sense. Pricy, but it doesn’t need to be very large. You can add it to a zpool with:

zpool add poolname
log /dev/daX

Where daX is the drive name, obviously.

As I mentioned, the SLOG doesn’t need to be large at all. It only has to cope with five seconds of writes, as that’s the maximum amount of time data is “allowed” to reside there. If you’re using NFS over 10Gbit Ethernet the throughput isn’t going to be above 1.25Gb a seconds. Assuming that’s flat-out synchronous writes, multiplying that by five seconds is less than 8Gb. Any more would be unused.

If you’ve got a really critical system you can add mirrored SLOG drives to a pool thus:

zpool add poolname
log /dev/daX /dev/daY

You can also remove them with something like:

zpool remove
poolname log /dev/daY

This may be useful if adding an SLOG doesn’t give you the performance boost you were hoping for. It’s very niche!

23-November-2327-November-23

FreeBSD 14 ZFS warning

Update 27-Nov-23
Additional information has appeared on the FreeBSD mailing list:
https://lists.freebsd.org/archives/freebsd-stable/2023-November/001726.html

The problem can be reproduced regardless of the block cloning settings, and on FreeBSD 13 as well as 14. It’s possible block cloning simply increased the likelihood of hitting it. There’s no word yet about FreeBSD 12, but this FreeBSD’s own ZFS implementation so there’s a chance it’s good

In the post by Ed Maste, a suggested partial workaround is to set the tunable vfs.zfs.dmu_offset_next_sync to zero, which has been on the forums since Saturday. This is a result of this bug:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275308

There’s a discussion of the issue going on here:

https://forums.freebsd.org/threads/freebsd-sysctl-vfs-zfs-dmu_offset_next_sync-and-openzfs-zfs-issue-15526-errata-notice-freebsd-bug-275308.91136/

I can’t say I’m convinced about any of this.

–
FreeBSD 14, which was released a couple of days ago, includes OpenZFS 2.2. There’s a lot of suspicion amongst Gentoo Linux users that this has a rather nasty bug in it related to block cloning.

Although this feature is disabled by default, people might be tempted to turn it on. Don’t. Apparently it can lead to lost data.

OpenZFS 2.2.0 was only promoted to stable on 13th October, and in hindsight adding it to a FreeBSD release so soon may seem precipitous. Although there’s a 2.2.1 release you should now be using instead it simply disables it by default rather than fixing the likely bug (and to reiterate, the default is off on FreeBSD 14).

Earlier releases of OpenZFS (2.1.x or earlier) are unaffected as they don’t support block cloning anyway.

Personally I’ll be steering clear of 2.2 until this has been properly resolved. I haven’t seen conclusive proof as to what’s causing the corruption, although it looks highly suspect. Neither have I seen or heard of it affecting the FreeBSD implementation, but it’s not worth the risk.

Having got the warning out of the way, you may be wondering what block cloning is. Firstly, it’s not dataset cloning. That’s been working fine for years, and for some applications it’s just what’s needed.

Block cloning applies to files, not datasets, and it’s pretty neat – or will be. Basically, when you copy a file ZFS doesn’t actually copy the data blocks – it just creates a new file in the directory structure but it points to the existing blocks. They’re shared between the source and destination files. Each block has a reference count in the on-disk Block Reference Table (BRT), and only when a block in the new file changes does a copy-on-write occur; the new block is linked to the new file and the reference count in the BRT is decremented. In familiar Unix fashion, when the reference count for a block gets to zero it joins the free pool.

This isn’t completely automatic – it must be allowed when the copy is made. For example, the cp utility will request clone files by default. This is done using the copy_file_range() system call with the appropriate runes; simply copying a file with open(), read(), write() and close() won’t be affected.

As of BSDCAN 2023, there was talk about making it work with zvols but this was to come later, although clone blocks in files can exist between datasets as long as they’re using the same encryption (including keys).

One tricky problem here is how it works with the ZIL – for example what’s stopping a block pointer from disappearing from the log? There was a lot to go wrong, and it looks like it has.

Release notes for 2.2.1 may be found here.
https://github.com/openzfs/zfs/releases/tag/zfs-2.2.1

11-October-1914-February-22

Jails on FreeBSD are easy without ezjail

I’ve never got the point of ezjail for creating jailed environments (like Solaris Zones) on FreeBSD. It’s easier to do most things manually, and especially since the definitions were removed from rc.conf to their own file, jail.conf. (My biggest problem is remembering whether it’s called “jail” or “jails”!)

jail.conf allows macros, has various macros predefined, and you can set defaults outside of a particular jail definition. If you’re using it as a split-out from rc.conf, you’re missing out.

Here’s an example:

# Set sensible defaults for all jails
path /jail/$name;
exec.start  = "/bin/sh /etc/rc";
exec.stop  = "/bin/sh /etc/rc.shutdown";
exec.clean;
mount.devfs;
mount.procfs;
host.hostname $name.my.domain.uk;

# Define our jails
tom { ip4.addr = 192.168.0.2 ; }
dick { ip4.addr = 192.168.0.3 ;  }
harry { ip4.addr = 192.168.0.4 ;  }
mary { ip4.addr = 192.168.0.5 ;  }
alice { ip4.addr = 192.168.0.6 ;  }
nagios { ip4.addr = 192.168.0.7 ; allow.raw_sockets = 1 ; }
jane { ip4.addr = 192.168.0.8 ; }
test { ip4.addr = 192.168.0.9 ; }
foo { ip4.addr = 192.168.0.10 ; }
bar { ip4.addr = 192.168.0.11 ; }

So what I’ve done here is set sensible default values. Actually, these are probably mostly set what you want anyway, but as I’m only doing it once, re-defining them explicitly is good documentation.

Next I define the jails I want, over-riding any defaults that are unique to the jail. Now here’s one twist – the $name macro inside the {} is the name of the jail being defined. Thus, inside the definition of the jail I’ve called tom, it defines hostname=tom.my.domain.uk. I use this expansion to define the path to the jail too.

If you want to take it further, if you have your name in DNS (which I usually do) you can set ip.addr= using the generated hostname, leaving each individual jail definition as { ;} !

I’ve set the ipv4 address explicitly, as I use a local vlan for jails, mapping ports as required from external IP addresses if an when required.

Note the definition for the nagios jail; it has the extra allow.raw_sockets = 1 setting. Only nagios needs it.

ZFS and FreeBSD Jails.

The other good wheeze that’s become available since the rise of jails is ZFS. Datasets are the best way to do jails.

First off, create your dataset z/jail. (I use z from my default zpool – why use anything longer, as you’ll be typing it a lot?)

Next create your “master” jail dataset: zfs create z/jail/master

Now set it up as a vanilla jail, as per the handbook (make install into it). Then leave it alone (other than creating a snapshot called “fresh” or similar).

When you want a new jail for something, use the following:

zfs clone z/jail/master@fresh z/jail/alice

And you have a new jail, instantly, called alice – just add an entry as above in jail.conf, and edit rc.conf to configure its networ. And what’s even better, alice doesn’t take up any extra space! Not until you start making changes, anyway.

The biggest change you’re likely to make to alice is building ports. So create another dataset for that: z/jail/alice/usr/ports. Then download the ports tree, build and install your stuff, and when you’re done, zfs destroy
z/jail/alice/usr/ports. The only space your jail takes up are the changes from the base system used by your application. Obviously, if you use python in almost every jail, create a master version with python and clone that for maximum benefit.

31-October-178-January-18

Del FS12-NV7 and other 2U server (e.g. C6100) disk system hacking

(Photographs to follow)

A while back I reviewed the Dell FS12-NV7 – a 2U rack server being sold cheap by all and sundry. It’s a powerful box, even by modern standards, but one of its big drawbacks is the disk system it comes with. But it needn’t be.

There are two viable solutions, depending on what you want to do. You can make use of the SAS backplane, using SAS and/or SATA drives, or you can go for fewer SATA drives and free up one or more PCIe slots as Plan B. You probably have an FS12 because it looks good for building a drive array (or even FreeNAS) so I’ll deal with Plan A first.

Like most Dell servers, this comes with a Dell PERC RAID SAS controller – a PERC6/i to be precise. This ‘I’ means it has internal connectors; the /E is the same but its sockets are external.

The PERC connects to a twelve-slot backplane forming a drive array at the front of the box. More on the backplane later; it’s the PERCs you need to worry about.

The PERC6 is actually an LSI Megaraid 1078 card, which is just the thing you need if you’re running an operating system like Windows that doesn’t support a volume manager, striping and other grown-up stuff. Or if your OS does have these features, but you just don’t trust it. If you are running such an OS you may as well stick to the PERC6, and good luck to you. If you’re using BSD (including FreeNAS), Solaris or a Linux distribution that handles disk arrays, read on. The PERC6 is a solution to a problem you probably don’t have, but in all other respects its a turkey. You really want a straightforward HBA (Host Bus Adapter) that allows your clever operating system to talk directly with the drives.

Any SAS card based on the 1078 (such as the PERC6) is likely to have problems with drives larger than 2Tb. I’m not completely sure why, but I suspect it only applies to SATA. Unfortunately I don’t have any very large SAS drives to test this theory. A 2Tb limit isn’t really such a problem when you’re talking about a high performance array, as lots of small drives are a better option anyway. But it does matter if you’re building a very large datastore and don’t mind slower access and very significant resilvering times when you replace a drive. And for large datastores, very large SATA drives save you a whole lot of cash. The best capacity/cost ratio is for 5Gb SATA drives

Some Dell PERCs can be re-flashed with LSI firmware and used as a normal HBA. Unfortunately the PERC6 isn’t one of them. I believe the PERC6/R can be, but those I’ve seen in a FS12 are just a bit too old. So the first thing you’ll need to do is dump them in the recycling or try and sell them on eBay.

There are actually two PERC6 cards in most machine, and they each support eight SAS channels through two SFF-8484 connectors on each card. Given there are twelve drives slots, one of the PERCs is only half used. Sometimes they have a cable going off to a battery located near the fans. This is used in a desperate attempt to keep the data in the card’s cache safe in order to avoid write holes corrupting NTFS during a power failure, although the data on the on-drive caches won’t be so lucky. If you’re using a file system like that, make sure you have a UPS for the whole lot.

But we’re going to put the PERCs out of our misery and replace them with some nice new LSI HBAs that will do our operating system’s bidding and let it talk to the drives as it knows best. But which to pick? First we need to know what we’re connecting.

Moving to the front of the case there are twelve metal drive slots with a backplane behind. Dell makes machines with either backplanes or expanders. A backplane has a 1:1 SAS channel to drive connection; an expander takes one SAS channel and multiplexes it to (usually) four drives. You could always swap the blackplane with an expander, but I like the 1:1 nature of a backplane. It’s faster, especially if you’re configured as an array. And besides, we don’t want to spend more money than we need to, otherwise we wouldn’t be hot-rodding a cheap 2U server in the first place – expanders are expensive. Bizarrely, HBAs are cheap in comparison. So we need twelve channels of SAS that will connect to the sockets on the backplane.

The HBA you will probably want to go with is an LSI, as these have great OS support. Other cards are available, but check that the drivers are also available. The obvious choice for SAS aficionados is the LSI 9211-8i, which has eight internal channels. This is based on an LSI 2000 series chip, the 2008, which is the de-facto standard. There’s also four-channel -4i version, so you could get your twelve channels using one of each – but the price difference is small these days, so you might as well go for two -8i cards. If you want cheaper there are 1068-based equivalent cards, and these work just fine at about half the price. They probably won’t work with larger disks, only operate at 3Gb and the original SAS standard. However, the 2000 series is only about £25 extra and gives you more options for the future. A good investment. Conversely, the latest 3000 series cards can do some extra stuff (particularly to do with active cables) but I can’t see any great advantage in paying megabucks for one unless you’re going really high-end – in which case the NV12 isn’t the box for you anyway. And you’d need some very fast drives and a faster backplane to see any speed advantage. And probably a new motherboard….

Whether the 6Gb SAS2 of the 9211-8i is any use on the backplane, which was designed for 3Gb, I don’t know. If it matters that much to you you probably need to spend a lot more money. A drive array with a direct 3Gb to each drive is going to shift fast enough for most purposes.

Once you have removed the PERCs and plugged in your modern-ish 9211 HBAs, your next problem is going to be the cable. Both the PERCs and the backplane have SFF-8484 multi-lane connectors, which you might not recognise. SAS is a point-to-point system, the same as SATA, and a multi-lane cable is simply four single cables in a bundle with one plug. (Newer versions of SAS have more). SFF-8484 multi-lane connectors are somewhat rare, (but unfortunately this doesn’t make them valuable if you were hoping to flog them on eBay). The world switched quickly to the SFF-8087 for multi-lane SAS. The signals are electrically the same, but the connector is not.

So there are two snags with this backplane. Firstly it’s designed to work with PERC controllers; secondly it has the old SFF-8484 connectors on the back, and any SAS cables you find are likely to have SFF-8087.

First things first – there is actually a jumper on the backplane to tell it whether it’s talking to a PERC or a standard LSI HBA. All you need to do is find it and change it. Fortunately there are very few jumpers to choose from (i.e. two), and you know the link is already in the wrong place. So try them one at a time until it works. The one you want may be labelled J15, but I wouldn’t like to say this was the same on every variant.

Second problem: the cable. You can get cables with an SFF-8087 on one end and an SFF-8484 on the other. These should work. But they’re usually rather expensive. If you want to make your own, it’s a PITA but at least you have the connectors already (assuming you didn’t bin the ones on the PERC cables).

I don’t know what committee designed SAS cable connectors, but ease of construction wasn’t foremost in their collective minds. You’re basically soldering twisted pair to a tiny PCB. This is mechanically rubbish, of course, as the slightest force on the cable will lift the track. Therefore its usual to cover the whole joint in solidified gunk (technical term) to protect it. Rewiring SAS connectors is definitely not easy.

I’ve tried various ways of soldering to them, none of which were satisfactory or rewarding. One method is to clip the all bare wires you wish to solder using something like a bulldog clip so they’re at lined up horizontally and then press then adjust the clamp so they’re gently pressed to the tracks on the board, making final adjustments with a strong magnifying glass and a fine tweezers. You can then either solder them with a fine temperature-controlled iron, or have pre-coated the pads with solder paste and flash across it with an SMD rework station. I’d love to know how they’re actually manufactured – using a precision jig I assume.

The “easy” way is to avoid soldering the connectors at all; simply cut existing cables in half and join one to the other. I’ve used prototyping matrix board for this. Strip and twist the conductors, push them through a hole and solder. This keeps things compact but manageable. We’re dealing with twisted pair here, so maintain the twists as close as possible to the board – it actually works quite well.

However, I’ve now found a reasonably-priced source of the appropriate cable so I don’t do this any more. Contact me if you need some in the UK.

So all that remains is to plug your HBAs to the backplane, shove in some drives and you’re away. If you’re at this stage, it “just works”. The access lights for all the drives do their thing as they should. The only mystery is how you can get the ident LED to come on; this may be controlled by the PERC when it detects a failure using the so-called sideband channel, or it may be operated by the electronics on the backplane. It’s workings are, I’m afraid, something of a mystery still – it’s got too much electronics on board to be a completely passive backplane.

Plan B: SATA

If you plan to use only SATA drives, especially if you don’t intend using more than six, it makes little sense to bother with SAS at all. The Gigabyte motherboard comes with half a dozen perfectly good 3Gb SATA channels, and if you need more you can always put another controller in a PCIe slot, or even USB. The advantages are lower cost and you get to free up two PCIe slots for more interesting things.

The down-side is that you can’t use the SAS backplane, but you can still use the mounting bays.

Removing the backplane looks tricky, but it really isn’t when you look a bit closer. Take out the fans first (held in place by rubber blocks), undo a couple of screws and it just lifts and slides out. You can then slot and lock in the drives and connect the SATA connectors directly to the back of the drives. You could even slide them out again without opening the case, as long as the cable was long enough and you manually detached the cable it when it was withdrawn. And let’s face it – drives are likely to last for years so even with half a dozen it’s not that great a hardship to open the case occasionally.

Next comes power. The PSU has a special connector for the backplane and two standard SATA power plugs. You could split these three ways using an adapter, but if you have a lot of drives you might want to re-wire the cables going to the backplane plug. It can definitely power twelve drives.

And that’s almost all there is to it. Unfortunately the main fans are connected to the backplane, which you’ve just removed. You can power them from an adapter on the drive power cables, but there are unused fan connectors on the motherboard. I’m doing a bit more research on cooling options, but this approach has promising possibilities for noise reduction.