Linux swap files

Last year I wrote a piece about swap file strategies in general (Unix/Linux), but on further investigation I have discovered Linux has a twist in the way it handles swap devices. On Unix there has never been any metadata written to a swap device – it’s just a sequence of blocks, and that’s what I’ve always told people. It turns out that with modern Linux, there is.

Before Linux will use a file or block device for swapping you need to use the mkswap command on it (there’s no Unix equivalent). The man page doesn’t exactly explain what it does as the low level, and I was curious. Unix hasn’t needed metadata on swap (or page) files from 1971 to the present day, so what is Linux doing differently?

Dumping a zeroed swap device before and after running mkswap on it revealed the following had been added:

00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000400  01 00 00 00 ff ff 03 00  00 00 00 00 ad 40 e5 4b  |.............@.K|
00000410  87 1a 46 d2 b0 36 ee 60  6d 40 08 d1 00 00 00 00  |..F..6.`m@......|
00000420  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000ff0  00 00 00 00 00 00 53 57  41 50 53 50 41 43 45 32  |......SWAPSPACE2|
00001000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
40000000

So it’s definitely doing something. As you can see the first 1K remains unchanged, which is presumably to avoid messing up partition tables and suchlike. There’s some binary data at 0x400 (1K if you don’t speak hex) and then right on the 4K boundary there’s what appears to be a signature, “SWAPSPACE2”.

Time to go digging into the source code (not so easy with a Linux, as it’s not normally present). It turns out that this is mapped by a union of two structures, and confirmed my suspisions:

union swap_header {
    struct {
        char reserved[PAGE_SIZE - 10];
        char magic[10];
    } magic;
    struct {
        char bootbits[1024];
        __u32 version;
        __u32 last_page;
        __u32 nr_badpages;
        __u32 padding[117];
        __u32 badpages[1];
    } info;
};

If you’re not clear with ‘C’ structures, this is basically saying the header is PAGE_SIZE long, with the last ten bytes being used as a signature (called “magic” – the SWAPSPACE2 in the dump). PAGE_SIZE is, logically, the size of the page the kernel is using for virtual memory, so the header occupies exactly one slot. It’s found in arch/x86/include/asm/page.h, and is the result of a shift calculation based on another tunable, PAGE_SHIFT and results in 4K on most systems but might be adjusted to suit the MMU on non-Intel platforms.

The second structure in the swap_header, which is overlaid on the first because it’s a union, skips the first 1K, calling it “bootbits”, which is fair enough. There might be a bootstrap loader or partition table there. In fact there may well be if the whole drive is being used for swap. It’s not necessary to put a header on a raw block device but I suspect some Linux tools like to see one.

Then comes the version number, which appears to always be ‘1’, and the address of the last page in the device in case the kernel doesn’t know. Finally there’s the size of bad page table. After padding it up to 0x600 there’s a bad page table, four bytes per entry, listing blocks on the page device that can’t be used (the apparent one entry array is a ‘C’ pointer trick). Since the introduction of IDE/SCSI drives around 1990, the disk drives themselves handle bad block mapping so this is only really relevant to ST506 type interfaces, which indicates it must have been in the design of Linux since the very beginning – and even then it would have been there for legacy drives.

There’s another twist I found in the code – the first sixteen bytes of the padding (0x40C to 0x41B) may contain a UUID (in binary, not ASCII) and 0x41C to 0x42B might contain a volume label if you set one using the -L option on mkswap (which is the source I looked at to find out what was going on).

If you try to run swapon on a device or file that doesn’t have this header it will bork. I had a look a the kernel, but I don’t think it cares – however I dare say some userland utilities do (including the aforementioned swapon).

Does this matter? Well it means you’re being short-changed by 4K of swap space on each device, but that’s hardly a big deal. The advantage is that utilities can detect and refuse to swap on a drive or partition you didn’t intend for swapping. The important thing is to remember to put the header on if you’re using Linux instead of Unix.

Large swap files on FreeBSD die with mystery “Killed” – howto add lots of swap space

Adding extra swap space to FreeBSD is easy, right? Just find a spare block storage device and run swapon with its name as an argument. I’ll put a step-by-step on how you actually do this at the end of the post in case this is news to you.

However, I’ve just found a very interesting gotcha, which could bite anyone running a 64-bit kernel and 8Gb+ of RAM.

From here we’re getting into the FreeBSD kernel – if you just want to know how to set up a lot of swap space, skip to the end…

I’ve been running a program to process a very large XML file into a large binary file – distilling 100Gb of XML into 1Gb of binary. This is the excuse for needing 16Gb of working storage (please excuse my 1970’s computer science terminology, but it’s a lot more precise than the modern “memory” and it makes a difference here).

I was using 2Gb of core and 8Gb of swap space, but this was too little so I added an extra 32Gb of swap file. Problem sorted? Well top and vmstat both reported 40Gb of swap space available so it looked good. However, on running the code it bombed out at random, with an enigmatic message “Killed” on the user console. Putting trace lines in the code narrowed it down to a random point while traversing a large array of pointers to pointers to the 15Gb heap, about an hour into the run. It looked for all the world like pointer corruption causing a Segmentation Fault or Bus Error, but if the process had a got that kind of signal it should have done a core dump, and it wasn’t happening. The output suggested a SIGKILL. But it wasn’t me sending it, and there were no other users logged in. Even a stack space error, which might have happened as qsort() was involved, was ruled out as the cause – and the kernel would have sent an ABORT, not a KILL in this case.

I finally tracked it down to a rather interesting “undocumented” feature. Within the kernel there is a structure called swblock in the John Dyson/Matthew Dillon VM handler, and a pointer called “swap” points to a chain of these these structures. Its size is limited by the value of kern.maxswzone, which you can tweak in /boot/loader.conf. The default (AMD64 8.2-Release) allows for about 14Gb of swap space, but because it’s a radix tree you’ll probably get a headache if you try to work it out directly. However, if you increase the swap space beyond this it’ll report as being there, but when when you try to use the excess, crunch!

Although this variable is tunable, it’s also hard-limited in include/param.h to 32M entries; each entry can manage 16 pages (if I’ve understood the code correctly). If you want to see exactly what’s happening, look at vm/swap_pager.c.

The hard limit to the size number of swblock entries is set as VM_SWZONE_SIZE_MAX in include/param.h. I have no idea why, and I haven’t yet tried messing with it as I have no need.

So, what was happening to my process? Well it was being killed by vm_pageout_oom() in vm/vm_pageout.c. This gets called when swap space OR the swblock space is exhausted, either in vm_pageout.c or swap_pager.c. In some circumstances it prints “swap zone exhausted, increase kern.maxswzone\n” beforehand, but not always. It’s effect is to find the largest running non-system process on the system and shoot it using killproc().

Mystery solved.

So, here’s how to set up 32Gb of USABLE swap space.

First, find your swap device. You can have as many as you want. This is either a disk slice available in /dev or, if you want to swap to a file, you need ramdisk to do the mapping. You can have as many swap devices as you like and FreeBSD will balance their use.

If you can’t easily add another drive, you’re best option is to add an extra swap file in the form of a ram disk on the existing filing system. You’ll need to be the root user for this.

To create a ram disk you’ll need a file to back it. The easy way to create one is using “dd”:

dd if=/dev/zero of=/var/swap0 bs=1G count=32

This creates a 32G file in /var filled with nulls – adjust as required.

It’s probably a good idea to make this file inaccessible to anyone other than root:

chmod 0600 /var/swap0

Next, create your temporary file-backed RAM disk:

mdconfig -a -t vnode -f /var/swap0 -u 0

This will create a device called /dev/md with a unit number specified by -u; in this case md0. The final step is to tell the system about it:

swapon /dev/md0

If you wish, you can make this permanent by adding the following to /etc/rc.conf:

swapfile="/var/swap0"

Now here’s the trick – if your total swap space is greater than 14Gb (as of FreeBSD 8.2) you’ll need to increase the value of kern.maxswzone in /boot/loader.conf. To check the current value use:

sysctl kern.maxswzone

The default output is:

kern.maxswzone: 33554432

That’s 0x2000000 32M. For 32Gb of VM I’m pretty sure you’d be okay with 0x5000000 (in round numbers), which translates to 83886080, so add this line to /boot/loader.conf (create the file if it doesn’t exist) and reboot.

kern.maxswzone="83886080"