Jails on FreeBSD are easy without ezjail

I’ve never got the point of ezjail for creating jailed environments (like Solaris Zones) on FreeBSD. It’s easier to do most things manually, and especially since the definitions were removed from rc.conf to their own file, jail.conf. (My biggest problem is remembering whether it’s called “jail” or “jails”!)

jail.conf allows macros, has various macros predefined, and you can set defaults outside of a particular jail definition. If you’re using it as a split-out from rc.conf, you’re missing out.

Here’s an example:

# Set sensible defaults for all jails
path /jail/$name;
exec.start = "/bin/sh /etc/rc";
exec.stop = "/bin/sh /etc/rc.shutdown";
exec.clean;
mount.devfs;
mount.procfs;
host.hostname $name.my.domain.uk;
# Define our jails
tom { ip4.addr = 192.168.0.2 ; }
dick { ip4.addr = 192.168.0.3 ; }
harry { ip4.addr = 192.168.0.4 ; }
mary { ip4.addr = 192.168.0.5 ; }
alice { ip4.addr = 192.168.0.6 ; }
nagios { ip4.addr = 192.168.0.7 ; allow.raw_sockets = 1 ; }
jane { ip4.addr = 192.168.0.8 ; }
test { ip4.addr = 192.168.0.9 ; }
foo { ip4.addr = 192.168.0.10 ; }
bar { ip4.addr = 192.168.0.11 ; }

So what I’ve done here is set sensible default values. Actually, these are probably mostly set what you want anyway, but as I’m only doing it once, re-defining them explicitly is good documentation.

Next I define the jails I want, over-riding any defaults that are unique to the jail. Now here’s one twist – the $name macro inside the {} is the name of the jail being defined. Thus, inside the definition of the jail I’ve called tom, it defines hostname=tom.my.domain.uk. I use this expansion to define the path to the jail too.

If you want to take it further, if you have your name in DNS (which I usually do) you can set ip.addr= using the generated hostname, leaving each individual jail definition as { ;} !

I’ve set the ipv4 address explicitly, as I use a local vlan for jails, mapping ports as required from external IP addresses if an when required.

Please generate and paste your ad code here. If left empty, the ad location will be highlighted on your blog pages with a reminder to enter your code. Mid-Post

Note the definition for the nagios jail; it has the extra allow.raw_sockets = 1 setting. Only nagios needs it.

ZFS and FreeBSD Jails.

The other good wheeze that’s become available since the rise of jails is ZFS. Datasets are the best way to do jails.

First off, create your dataset z/jail. (I use z from my default zpool – why use anything longer, as you’ll be typing it a lot?)

Next create your “master” jail dataset: zfs create z/jail/master

Now set it up as a vanilla jail, as per the handbook (make install into it). Then leave it alone (other than creating a snapshot called “fresh” or similar).

When you want a new jail for something, use the following:

zfs clone z/jail/master@fresh z/jail/alice

And you have a new jail, instantly, called alice – just add an entry as above in jail.conf, and edit rc.conf to configure its networ. And what’s even better, alice doesn’t take up any extra space! Not until you start making changes, anyway.

The biggest change you’re likely to make to alice is building ports. So create another dataset for that: z/jail/alice/usr/ports. Then download the ports tree, build and install your stuff, and when you’re done, zfs destroy
z/jail/alice/usr/ports. The only space your jail takes up are the changes from the base system used by your application. Obviously, if you use python in almost every jail, create a master version with python and clone that for maximum benefit.

ZFS Optimal Array Size

So there I was looking at a pile of eight drives and an empty storage array, and wondering how to cofigure it for best performance under ZFS. “Everyone knows” the formula right? The best performance in a raidz array comes if you use 2^D+P drives. That’s to say your data drives should be a power of two (i.e. 2,4,8,16) plus however many redundant (parity) drives for the raidz level you desire. This is mentioned quite often in the Lucas book FreeBSD Mastery:ZFS; although it didn’t originate there I’ll call it the Lucas rule anyway

I have my own rule – redundancy should be two drives or 30%. Why? Well drives in an array have a really nasty habit of failing two at a time. It’s not sods law, it’s a real phenomenon caused by the stress of re-silvering shaking out any other drives that are “on the edge”. This means I go for configurations such as 4+2, 5+2, 6+2. From there on I go to raidz3 with 7+3, 8+3, 9+3. As there’s no raidz4, 12 drives is the limit – for 14 drives I’d have two vdevs (LUNs) of 5+2 each.

However, If you merge my rule with the Lucas rule the only valid sizes are 2+2 and 4+2 and 8+3. And I had just eight drives to play with.

I was curious – how was the Lucas rule derived? I dug out the book, and it doesn’t say. Anywhere. Having a highly developed suspicion of anything described as “best practice” I decided to test it on my rag-bag collection of drives in the Dell backplane, and guess what? No statistically significant difference.

Now the trouble with IT “best practice” guides is they’re written by technicians based on observation, not OS programmers who know how stuff actually works. The first approach has a lot of merit, but unless you know the reason for your observations you won’t know when the reason has become irrelevant. Unfortuantely, as an OS programmer, I now had a duty to figure out what this reason might have been.

After wading through the code and finding nothing much helpful, I did what I should have done first and considered the low-level disk layout. It’s actually quite simple.

Your stuff is written to disk in a series of blocks, right? In a striped array, each drive gets a block in turn to spread the load. No problem there. Well there will be a problem if your ZFS block size doesn’t match the block size on the drives, but that’s a complication I’m going to overlook – lets just assume you got that bit right.

So where does the optimal number of disks come from? I contend that on a striped vdev there never was one. The problem only comes when you add redundant drives.

I’m going to digress here to explain how error correcting data happens – in very simple terms. Suppose you have a sequence of numbers such as:

5 8 2 3

Each number is stored on a separate piece of paper, and to guard against loss you add a fifth number so that when you add them all up you get a total ending in zero. In this example, the total of the first 4 is 18. You can add an extra 2 to make the total 20, which ends in zero, so the fifth number is going to be 2.

5 8 2 3 2

Now, if we lose any one of those five numbers we can work out what it must have been – just work out which digit when added to the remaining four gives you a total ending in zero. For example, supposing ‘3’ when missing. Add up the remainder and you get 17. You need 3 more to get to a zero, so the missing number must be 3.

Digression over. ZFS calculates a block of error correction data for the blocks of data it’s just written and adds this as the last block in the sequence. If If ZFS blocks and sectors were the same size, this would be fine writing another sector is quick. But ZFS blocks no longer match sectors. In fact, they’re tunable over a wide range. We’ve also got 4k sectors instead of the traditional 512b. So, suppose you had 2k ZFS blocks on a 4k sector disk? Your parity data could end up being just half a sector, meaning that ZFS has to read it, overwrite half, and write it back rather than just writing it. This sucks. But if you choose the number of disks carefully, you end up with parity blocks that do fit. So, always make sure you follow Lucas’ rule, and make sure your data drives are a power of two.

Except…

This may have been true once, but now we have variable ZFS blocks sizes, and they tend to be much larger than the sector size anyway. In this situation the “magic” configurations no longer matter. And, now we have lz4 compression, the physical block sizes are variable anyway.

For those not in the know about this, lz4 compression is a no-brainer. It wont’ compress stuff it can’t, and its fast. Most files will compress to at least 2:1, often more – which means when you read a block only half the data needs to travel down the bus to get in memory. Everything suddenly goes twice as fast, at the expense of one core having to do some work. It’s true that the block and sector sizes are nowhere near matching, and this is bound to have a performance hit, but this is more-than eclipsed by the improved transfer rate.

So in summary, forget the 2^D+P “best practice” formula. It was only valid in the early days. Have whatever config you like, but I do commend my rule about the number of redundant drives. This is based on a hardware issue, and no update to the software is going to fix this any time soon.