ZFS is not always the answer. Bring back gmirror!

The ZFS bandwaggon has momentum, but ZFS isn’t for everyone. UFS2 has a number of killer advantages in some applications.

ZFS is great if you want to store a very large number of normal files safely. It’s copy-on-write (COW) is a major advantage for backup, archiving and general data safety, and datasets allow you to fine-tune almost any way you can think of. However, in a few circumstances, UFS2 is better. In particular, large random-access files do badly with COW.

Unlike traditional systems, a block in a file isn’t overwritten in place, it always ends up at a different location. If a file started off contiguous it’ll pretty soon be fragmented to hell and performance will go off a cliff. Obvious victims will be databases and VM hard disk images. You can tune for these, but to get acceptable performance you need to throw money and resources to bring ZFS up to the same level. Basically you need huge RAM caches, possibly an SLOG, and never let your pool get more than 50% full. If you’re unlucky enough to end up at 80% full ZFS turns off speed optimisations to devote more RAM to caching as things are going to get very bad fragmentation-wise.

If these costs are a problem, stuck with UFS. And for redundancy, there is still good old GEOM Mirror (gmirror). Unfortunately the documentation of this now-poor relation has lagged a bit, and what once worked as standard, doesn’t. So here are some tips.

The most common use of gmirror (with me anyway) is a twin-drive host. Basically I don’t want things to fail when a hard disk dies, so I add a second redundant drive. Such hosts (often 1U servers) don’t have space for more than two drives anyway – and it pays to keep things simple.

Setting up a gmirror is really simple. You create one using the “gmirror label” command. There is no “gmirror create” command; it really is called “label”, and it writes the necessary metadata label so that mirror will recognise it (“gmirror destroy” is present and does exactly what you might expect).

So something like:

gmirror label gm0 ada1 ada2

will create a device called /dev/mirror/gm0 and it’ll contain ada1’s contents mirrored on to ada2 (once it’s copied it all in the background). Just use /dev/mirror/gm0 as any other GEOM (i.e. disk). Instead of calling it gm0 I could have called it gm1, system, data, flubnutz or anything else that made sense, but gm0 is a handy reminder that it’s the first geom mirror on the system and it’s shorter to type.

The eagle eyed might have noticed I used ada1 and ada2 above. You’ve booted off ada0, right? So what happens if you try mirroring yourself with “gmirror label gm0 ada0 ada1“? Well this used to work, but in my experience it doesn’t any more. And on a twin-drive system, this is exactly what you want to do. But it is still possible, read on…

How to set up a twin-drive host booting from a geom mirror

First off, before you do anything (even installing FreeBSD) you need to set up your disks. Since the IBM XT, hard disks have been partitioned using an MBR (Master Boot Record) at the start. This is really old, naff, clunky and Microsoft. Those in the know have been using the far superior GPT system for ages, and it’s pretty cross-platform now. However, it doesn’t play nice with gmirror, so we’re going to use MBR instead. Trust me on this.

For the curious, know that GPT keeps a copy of the partition table at the beginning and end of the disk, but MBR only has one, stored at the front. gmirror keeps its metadata at the end of the disk, well away from the MBR but unfortunately in exactly the same spot as the spare GPT. You can hack the gmirror code so it doesn’t do this, or frig around with mirroring geoms rather than whole disks and somehow get it to boot, but my advice is to stick to MBR partitioning or BSDlabels, which is an extension. There’s not a lot of point in ever mounting your BSD boot drive on a non-BSD system, so you’re not losing much whatever you choose.

Speaking of metadata, both GPT and gmirror can get confused if they find any old tables or labels on a “new” disk. GPT will find old backup partition tables and try to restore them for you, and gmirror will recognise old drives as containing precious data and dig its heels in when you try to overwrite it. Both gpart and gmirror have commands to erase their metadata, but I prefer to use dd to overwrite the whole disk with zeros anyway before re-use. This checks that the disk is actually good, which is nice to know up-front. You could just erase the start and end if you were in a hurry and wanted to calculate the offsets.

Please generate and paste your ad code here. If left empty, the ad location will be highlighted on your blog pages with a reminder to enter your code. Mid-Post

The next thing you’ll need to do is load the geom_mirror kernel module. Either recompile the kernel with it added, or if this fills you with horror,  just add ‘load_geom_mirror=”yes”‘ to /boot/loader.conf. This does bring it in early enough in the process to let you boot from it. The loader will boot from one drive or the other and then switch to mirror mode when it’s done.

So, at this point, you’ve set up FreeBSD as you like on one drive (ada0), selecting BSDlabels or MBR as the partition method and UFS as the file system. You’ve set it to load the geom_mirror module in loader.conf.  You’re now looking at a root prompt on the console, and I’m assuming your drives are ada0 and ada1, and you want to call your mirror gm0.

Try this:

gmirror label gm0 ada0

Did it work? Well it used to once, but now you’ll probably get an error message saying it could not write metadata to ada0. If (when) this happens I know of one answer, which I found after trying everything else. Don’t be tempted to try everything else yourself (such as seeing if it works with ada1). Anything you do will either fail if you’re lucky, or make things worse. So just reboot, and select single-user mode from the loader menu.

Once you’re at the prompt, type the command again, and this time it should say that gm0 is created. My advice is to now reboot rather than getting clever.

When you do reboot it will fail to mount the root partition and stop, asking for help to find it. Don’t panic. We know where it’s gone. Mount it with “ufs:/dev/mirror/gm0s1a” or whatever slice you had it on if you’ve tried to be clever. Forgot to make a note? Don’t worry, somewhere on the boot long visible on the screen it actually tell you the name of the partition it couldn’t find.

After this you should be “in”. And to avoid this inconvenience next time you boot you’ll need to tweak /etc/fstab using an editor of your choice, although real computer nerds only use vi. What you need to do is replace all references to the actual drive with the gm0 version. Therefore /dev/ada0s1a should be edited to read /dev/mirror/gm0s1a. On a current default install, which no longer partitions the drive, this will only apply the root mount point and the swap file.

Save this, reboot (to test) and you should be looking good. Now all that remains is to add the second drive (ada1 in the example) with the line:

gmirror insert gm0 ada1

You can see the effect by running:

gmirror status

Unless your drive is very small, gm0 will be DEGRADED and it will say something about being rebuilt. The precise wording has changed over time. Rebuilding takes hours, not seconds so leave it. Did I mention it’s a good idea to do this when the system isn’t busy?

ZFS Optimal Array Size

So there I was looking at a pile of eight drives and an empty storage array, and wondering how to cofigure it for best performance under ZFS. “Everyone knows” the formula right? The best performance in a raidz array comes if you use 2^D+P drives. That’s to say your data drives should be a power of two (i.e. 2,4,8,16) plus however many redundant (parity) drives for the raidz level you desire. This is mentioned quite often in the Lucas book FreeBSD Mastery:ZFS; although it didn’t originate there I’ll call it the Lucas rule anyway

I have my own rule – redundancy should be two drives or 30%. Why? Well drives in an array have a really nasty habit of failing two at a time. It’s not sods law, it’s a real phenomenon caused by the stress of re-silvering shaking out any other drives that are “on the edge”. This means I go for configurations such as 4+2, 5+2, 6+2. From there on I go to raidz3 with 7+3, 8+3, 9+3. As there’s no raidz4, 12 drives is the limit – for 14 drives I’d have two vdevs (LUNs) of 5+2 each.

However, If you merge my rule with the Lucas rule the only valid sizes are 2+2 and 4+2 and 8+3. And I had just eight drives to play with.

I was curious – how was the Lucas rule derived? I dug out the book, and it doesn’t say. Anywhere. Having a highly developed suspicion of anything described as “best practice” I decided to test it on my rag-bag collection of drives in the Dell backplane, and guess what? No statistically significant difference.

Now the trouble with IT “best practice” guides is they’re written by technicians based on observation, not OS programmers who know how stuff actually works. The first approach has a lot of merit, but unless you know the reason for your observations you won’t know when the reason has become irrelevant. Unfortuantely, as an OS programmer, I now had a duty to figure out what this reason might have been.

After wading through the code and finding nothing much helpful, I did what I should have done first and considered the low-level disk layout. It’s actually quite simple.

Your stuff is written to disk in a series of blocks, right? In a striped array, each drive gets a block in turn to spread the load. No problem there. Well there will be a problem if your ZFS block size doesn’t match the block size on the drives, but that’s a complication I’m going to overlook – lets just assume you got that bit right.

So where does the optimal number of disks come from? I contend that on a striped vdev there never was one. The problem only comes when you add redundant drives.

I’m going to digress here to explain how error correcting data happens – in very simple terms. Suppose you have a sequence of numbers such as:

5 8 2 3

Each number is stored on a separate piece of paper, and to guard against loss you add a fifth number so that when you add them all up you get a total ending in zero. In this example, the total of the first 4 is 18. You can add an extra 2 to make the total 20, which ends in zero, so the fifth number is going to be 2.

5 8 2 3 2

Now, if we lose any one of those five numbers we can work out what it must have been – just work out which digit when added to the remaining four gives you a total ending in zero. For example, supposing ‘3’ when missing. Add up the remainder and you get 17. You need 3 more to get to a zero, so the missing number must be 3.

Digression over. ZFS calculates a block of error correction data for the blocks of data it’s just written and adds this as the last block in the sequence. If If ZFS blocks and sectors were the same size, this would be fine writing another sector is quick. But ZFS blocks no longer match sectors. In fact, they’re tunable over a wide range. We’ve also got 4k sectors instead of the traditional 512b. So, suppose you had 2k ZFS blocks on a 4k sector disk? Your parity data could end up being just half a sector, meaning that ZFS has to read it, overwrite half, and write it back rather than just writing it. This sucks. But if you choose the number of disks carefully, you end up with parity blocks that do fit. So, always make sure you follow Lucas’ rule, and make sure your data drives are a power of two.

Except…

This may have been true once, but now we have variable ZFS blocks sizes, and they tend to be much larger than the sector size anyway. In this situation the “magic” configurations no longer matter. And, now we have lz4 compression, the physical block sizes are variable anyway.

For those not in the know about this, lz4 compression is a no-brainer. It wont’ compress stuff it can’t, and its fast. Most files will compress to at least 2:1, often more – which means when you read a block only half the data needs to travel down the bus to get in memory. Everything suddenly goes twice as fast, at the expense of one core having to do some work. It’s true that the block and sector sizes are nowhere near matching, and this is bound to have a performance hit, but this is more-than eclipsed by the improved transfer rate.

So in summary, forget the 2^D+P “best practice” formula. It was only valid in the early days. Have whatever config you like, but I do commend my rule about the number of redundant drives. This is based on a hardware issue, and no update to the software is going to fix this any time soon.

ESXi, NFS, ZFS and vfs.nfsd.async

So there I was, reading the source code to FreeBSD’s nfsd (as you do), trying to figure out why ESXi’s performance was so bad when used with an NFS datastore in a ZFS dataset. Actually, I had some idea. There’s a lot out there on the interweb about whether it’s safe to tweak it to ignore requests to flush the write cache using the sysctl tunable vfs.zfs.cache_flush_disable. (For what it’s worth, I’d say that if your drives are on a UPS it’s fine).

But why does ESXis suck so badly in this respect with NFS connected datastores? What is this excessive cache flushing all about? I decided to install it on an HP Microserver and get to some serious debugging.

Okay, here is how ZFS writes work. When you write something it doesn’t actually write, it puts it in the ZIL. This is an Intent Log – i.e. writes intended to happen.  Not exactly a write cache, but it has the same effect, and because of the way ZFS works it’s perfectly safe for avoiding data corruption. If a transaction is waiting in the ZIL when the music stops, the transaction is lots but the disk isn’t trashed. (NB. It’s also possible to put a ZIL on a log drive rather than RAM – I won’t discuss this here).

This should speed things up, right? Normally it does, but not when NFS is being abused. Let me explain. NFS has a transaction commit instruction. The client can tell NFS to flush everything in a transaction to permanent storage and not return until it’s finished. Sometimes you really need this, like if you’re updating the super-block in a database structure. Most of the time you don’t.

Enter ESXi running brain-dead Windows guest machines. How does it know when they’re writing something it isn’t a super-block? It doesn’t. So its solution (as far as I can tell) is to send NFS a commit after every single write and hang around waiting until it’s done it. There’s no point in having the ZIL at all, as it needs to be flushed every time. Putting the ZIL on disk is even worse, as you get an extra write/read for each transaction. I’ve seen people trying to put fast SSDs on the system to try and overcome this – best of luck with that.

As you move further down the chain, FreeBSD, being POSIX compliant whenever possible, will pass on the request for a synchronous write all the way to the disk. Send a block to a SATA or SAS drive and it will initially be cached, right? The write will then complete and the data actually written in the background while the rest of the system zips along. Except that it then issues a SATA or SAS “flush cache” command and waits until everything in its cache has been committed.

In tests this paranoid behaviour lead to running at 20% throughput or less.

Now, if you’re backing an emulated Windows disk you’re always at risk of data corruption, because FAT and NTFS are corruptable. And, dare I say it, crash rather too often. Let’s face it, if you’re worried about stuff like that you wouldn’t be running Windows – never mind as a VM, So lets be sensible about it.

So why was I reading the nfsd code? Well the obvious answer to this performance problem would be to simply ignore NTFS commit commands coming from the client. This is better than killing off all synchronous writes using the tunable vfs.zfs.cache_flush_disable because ZFS itself might be updating its uberblock and have a valid reason for doing it.

My plan was to hack the code – I’ve seen this done elsewhere. But wanting to do things properly I thought I should make it a system tunable. So I took a look at where the synchronous writes were happening – vdev_disk.c and vdev_geom.c (depending on whether you were hitting the raw drive or the GEOM). Lo and behold there was a global called nfs_sync that was compared along with the SYNC flag, and if either were true the sync request was ignored.  So where did nfs_async come from? Digging further back it comes from nfs_nfsdserv.c , where it’s set by a system tuneable – vfs.nfsd.async. Now that’s an interesting name! Follow the stable auto variable in nfsrvd_write() and the nfs_async global if you want to see what I’m on about.

A quick Google for vfs.nfsd.async revealed – nothing. I seem to have found another useful tunable that’s yet to be documented. although it’s been in the source since at least 10.0. So I’ll get on to documenting after I’ve done a few more tests.

But if you’re having Windows/NFS problems, especially with ESXi, try setting  vfs.nfsd.async instead of crudely disabling cache flushing with vfs.zfs.cache_flush_disable. Let me know how you get on.

Incidentally, you can disable synchronous writes to a dataset using the “sync=disabled” ZFS option. It helps, but not much. I’m still digging to find out why.
Or you could just use Virtualbox instead.

 

FreeBSD, ZFS and Denial of Service

I’ve been using ZFS since FreeBSD 8, and it has it’s uses. It’s pretty be wonderful and all that, but I was actually pretty happy with UFS, and switching to ZFS isn’t a no-brainer.

So what’s the up-side to ZFS? Well you get more error checking and correction and it’s great for managing huge filing systems. You can snapshot and roll back, and do lots of other wonderful stuff with datasets and rive arrays. And it’s more “auto” when it comes to allocating disk space. But call me old fashioned if you like; I don’t like “auto” if I can avoid it.

Penguinistas might not “get” this next bit, but on a UNIX system you didn’t normally have One Big Disk. Instead you had several, and even if you only had one, you’d partition the slice it up so it looked like several. And then, of course, you’d mount disks or partitions on to the root filing system wherever you wanted them to appear.

For reliability, you could also create mirrors and striped RAIDs, put a FS on them and mount them wherever you wanted. And demount them, and mount them somewhere else, and so on.

ZFS does all this good stuff, but automatically, and often as One Big Disk. A good thing? Well… if you must. But there are a few points you might want to consider before diving in.

First off, I like to know where and on which disk my data actually resides. I’m really uneasy with ZFS deciding for me. If ZFS loses it, I want to know where to find it. I also like having a FS on each drive or partition, so I can pull the drive out and mount it wherever I want to get data off – or move it from machine to machine. It’s my data, I’ll do what I want to with it, dammit! You can do this virtually with ZFS datasets, but you can’t unplug a dataset and hold it in your hand. Datasets, of course, are fluid rather than fixed in size, so you don’t need to guess how much space to allocate.

Secondly, with UFS I get to decide what hardware is used for each kind of file. Parts of the FS that are rarely used can be put on slow, cheap, huge disks. The database goes on a velociraptor or better, and the swap partitions – well! Okay, you can use multiple zpools for difference performance situations but then you’re using it like UFS.

Thirdly, there’s a price for all this ZFS wonderfulness. Apart from the software overhead, the Copy-on-Write business needs a lot of RAM to maintain good performance. Fragmentation no the physical drive is guaranteed. If you’re running software (e.g. a database) that uses random access files and lots of transaction, UFS with its in-place modification wins out. A DBMS will take care of its own consistency and storage optimisation, and it has the edge as it knows what the data represents at the application level.

But what of the Denial of Service problem in the headline? Okay, it’s been a bit of a ramble, but this is something you must consider.

There are always management issues with One Big Disk. Linux users seem oblivious to this, but this doesn’t mean putting everything on a big partition is a great plan – even if you’re using a single disk in practice.

With the old way of having multiple partitions, each with an FS, mounted on the directory tree, when an FS on a partition/drive filled up, it is was full. You couldn’t create more files on it. You either have to delete unwanted stuff, or you can mount a bigger drive in its place. With One Big Disk, when it’s full, it’s also full. The difference is that you can’t write any data anywhere on the entire FS. And this is where DoS comes it.

Take, for example, /var/log. Any UNIX admin with a bit of sense will have this in its own partition. If some script kiddie then did something that caused a lot of log file activity, eventually you’d run out of space in /var/log. But the rest of the system would still be alive. With UFS the default installation process created partitions with sensible sizes. Using the One Big Disk principle, ZFS satisfies the requests of any disk-eating process until there isn’t a single byte left anywhere, and then rolls over saying the zpool is full. Or it would say it if there was a monitor connected to the server in a data centre miles away, and there was someone there to look at it.

With ZFS you can set a limit to the size on a dataset-by-dataset basis and prevent this sort of thing from happening. But it doesn’t happen by default, so set your quotas manually if you’re plonking the OS, and in particular /var on it.

Okay, this might sound a bit anti-ZFS, and I’ve yet to have a disaster with a ZFS system that’s required me to move drives around, so I don’t really know how possible it is when the chips are down. And ZFS has is a nice unified way of doing stuff, rather than fiddling around with geom and the FS separately. But after a couple of years with FreeBSD 10, where it became practical to boot from ZFS, shouldn’t I be feeling a bit more enthusiastic about it?

Having a ZFS pool attached as a data store rather than as a boot device is, of course, a different story. That’s when you see the benefits. But it does also eat resources, so I want the benefits to be worth it for the particular application. For the time being I’m putting the OS on UFS, usually with a data partition for databases to thrash, and userland putting simple files on ZFS – best of both worlds.

UbuntuBSD – lovechild of Linux and FreeBSD

It’s no secret that Linux users with good taste have viewed the FreeBSD kernel with envious eyes for many years. A while back Debian distributions started having the FreeBSD kernel as an option instead of the Linux one. (Yes, you read that correctly). But now things seem to have been turned up a notch with UbuntuBSD.

It seems a group of penguinistas regard the Ubuntu world’s adoption of systemd as a step too far, and forked. And rather than keeping with Linux, they’ve opted to dump the whole kernel and bolt the Ubuntu front-end on to FreeBSD instead, getting kernel technology like ZFS and jails but “…keeping the familiarity of Ubuntu”.

Where could this be going? We already have PC-BSD for a “shrink wrapped” graphical desktop environment. Is anyone actually using it? I’m not. I’m sure we’ve all downloaded it out of curiosity, but if I want a Windows PC I’ll have a Windows PC. With BSD I’m more than happy with a command line, thank you very much.

UbuntuBSD could be different. Linux users actually use the graphical desktop, and most can’t cope with a command line. If they were to switch to FreeBSD instead, UbuntuBSD would make a lot of sense.

Although it’s only been around a month, in early beta form, its Sourceforge page is showing a lot of downloads. If I wanted to run a graphical desktop on top of FreeBSD, UbuntuBSD would make a lot of sense over PC-BSD, because I get the vibes that Ubuntu has desktop applications more together.

The project has just launched its own web site too, at www.ubuntubsd.org.

So does this spell the end of PC-BSD, Ubuntu Linux, Windows 10 or none of the above? It’s surely a strong vote against systemd.

FreeBSD 10.3 hangs on upgrade – beware!

There seems to be a bit of a problem with upgrades to FreeBSD 10.3-RELEASE. Basically, shutdown -r is hanging, requiring you to manually reset the machine (turn it off and on again). This is annoying unless the machine in question happens to be at a data centre on a different continent, in which “annoying” really doesn’t cut it.

This was a known issue with 10.3-STABLE., but it appears to have made it in to -RELEASE too.

I suggest not using freebsd-update. Basically if you follow the official instructions you may need someone on hand to reboot the old fashioned way.

FreeBSD Device Driver Memory Allocation

Yesterday someone asked me how to allocate memory in a FreeBSD device driver. Although not quite as simple as a user-space malloc(), it’s relatively simple – but could I remember the name/parameter order? Not confidently, so I suggested RTFM.

A quick look at the manual doesn’t actually cover it very well. Basically there are special versions of malloc()/free() and they’re have exactly the same names, except the parameters are different. For example, malloc() has two extra parameters; one is the memory type (used for kernal statistics purposes), and one is a flags field, with options whether you’re prepared to wait, or is this a critical situation and using the reserve pool is okay.

For details, see “man 9 malloc”. The ‘9’ is important, as otherwise you’ll get the user-land version in libc. (Incidentally, a read through the libc code should put you off algorithms making wanton use dynamic memory allocation if you weren’t already).

Now what the FreeBSD documentation doesn’t tell you (and something for my to-do list) is how to actually make use of this in a device driver. I had to go back to code I’d written ten years ago to remind me, as I’m just as guilty of copying and tweaking my standard code many times over without really remembering what it does.

But before you go worrying about allocating dynamic memory in a device driver, consider that there’s no reason why you can’t just use static memory – just allocate in BSS in the normal way. Okay, this won’t suit every eventuality but on on most of my simple drivers, which have been to mess with custom hardware for a single process, it’s not actually a problem.

Okay, so you still want to use dynamic? Well to get the kernel versions instead of the the libc ones you need to include instead. As I mentioned above, for some reason using the same names must have seemed like a good idea at the time, but the parameters are different.

The other thing you should be aware of is when about allocating kernel memory you are talking about non-paged. Don’t go crazy.

There is also a memory allocation tracker and statistics dumper available in the libc version (see /etc/malloc.conf), which will help you out if you’ve messed up memory allocation. Don’t expect any such help with the kernel. However, if you compile the kernel with the INVARIANTS option set it will scrub freed memory with 0xdeadc0de, which is handy if you find yourself using unallocated or free kernel RAM. Actually, this is a pretty good idea if you’re writing KLDs anyway, as it stops and does a core dump at the first sign you’ve screwed up any kernel structures.

The documentation in “man 9 malloc” should be enough to cope with the extra parameters; basically the malloc_type. Note that the first parameter to the MALLOC_DEFINE macro is actually a name you make up! By convention it’s in the form M_XXXXX, in upper case.

Also note that when you’re freeing memory it’s not normally zeroed. Therefore someone else using kernel memory might be able to allocate it and read what your driver wrote. Okay, bug deal – if the bad guys are installing kernel modules it’s game anyway. But… consider the bad guys cause a kernel panic and get a core dump.

 

FreeBSD sysarch kernel panic vulnerability

A bug has been found and fixed in the FreeBSD kernel that would allow someone with malicious intent to crash a running system. It’d be difficult to achieve unless the attacker had console access. However it’s been patched for all supported systems. See here for all the details (which I won’t repeat).

The problem was found by Core Security, and they have provided an excellent write-up here.

But if you want it in plain English:

The sysarch() system call is used to get/set processor-specific stuff. You’re not supposed to call it directly; you’re supposed to call a processor-specific library if you want to do things like that, but you still can call it if you want to. On processors that support memory segments, such as i386,  there is a Local Descriptor Table (LDT) to manage them if you want to mess with specific stuff like that. However, for security reasons, you can only modify the LDT using the sysarch() call, which checks what you’re trying to do and prevents applications from doing anything crazy.

Unfortunately the AMD64 implementation of the code gets the checking wrong. If you use a signed integer it’s always going to be less than another unsigned value, and when it compares the two parameters to make sure that one is less than the other it passes when it shouldn’t, and the rogue parameter causes it to go funky-deux and overwrite a shed load of stuff.

This is in all in:

/sys/amd64/amd64/sys_machdep.c

in the function:

int amd64_set_ldt(td, uap, descs)

The FreeBSD advisory contains a patch for all “supported” versions; but what if you’re using an older one? Using the information from Core it’s easy enough to patch. But what else is affected?

To save you the trouble, I’ve looked back at earlier versions. The problem code definitely exists in the AMD64 versions for 8.x, but isn’t present in any 7.x, as far as I can tell. The system call simply doesn’t exist. On i386 versions, I can’t see any obvious problem with the code.

How worried should we be? If someone breaks in to a system with shell access, they will be able to crash it. However, I think it’s very unlikely that any service is written in such a way that malicious data could cause the necessary parameters to be sent to sysarch() call. In fact, on checking the ports collection, it’s not exactly used all over the place. You’re highly unlikely to be running any application that even makes the call.

How to stop Samba users deleting their home directory and email

Samba Carnival Helsinki summer 2009
Samba Carnival (the real Samba logo is sooo boring)

UNIX permissions can send you around the twist sometimes. You can set them up to do anything, not. Here’s a good case in point…

Imagine you have Samba set up to provide users with a home directory. This is a useful feature; if you log in to the server with the name “fred” you (and only you) will see a network share called “fred”, which contains the files in your UNIX/Linux home directory. This is great for knowledgeable computer types, but is it such a great idea for normal lusers? If you’re running IMAP email it’s going to expose your mail directory, .forward and a load of other files that Windoze users might delete on a whim, and really screw things up.

Is there a Samba option to share home directories but to leave certain subdirectories alone? No. Can you just change the ownership and permissions of the critical files to  root and deny write access? No! (Because mail systems require such files to be owned by their user for security reasons). Can you use permission bits or even an ACL? Possibly, but you’ll go insane trying.

A bit of lateral thinking is called for here. Let’s start with the standard section in smb.conf for creating automatic shares for home directories:

[homes]
    comment = Home Directories
    browseable = no
    writable = yes

The “homes” section is special – the name “homes” is reserved to make it so. Basically it auto-creates a share with a name matching the user when someone logs in, so that they can get to their home directory.

First off, you could make it non-writable (i.e. set writable = no). Not much use to use luser, but it does the job of stopping them deleting anything. If read-only access is good enough, it’s an option.

The next idea, if you want it to be useful, is to use the directive “hide dot files” in the definition. This basically returns files beginning in a ‘.’ as “hidden” to Windoze users, hiding the UNIX user configuration files and other stuff you don’t want deleted. Unfortunately the “mail” directory, containing all your loverly IMAP folders is still available for wonton destruction, but you can hide this too by renaming it .mail. All you then need to do is tell your mail server to use the new name. For example, in dovecot.conf, uncomment and edit the line thus:

mail_location = mbox:~/.mail/:INBOX=/var/mail/%u

(Note the ‘.’ added at the front of ~/mail/)

You then have to rename each of the user’s “mail” folders to “.mail”, restart dovecot and the job is done.

Except when you have lusers who have turned on the “Show Hidden Files” option in Windoze, of course. A surprising number seem to think this is a good idea. You could decide that hidden files allows advanced users control of their mail and configuration, and anyone messing with a hidden file can presumably be trusted to know what you’re doing. You could even mess with Windoze policies to stop them doing this (ha!). Or you may take the view that all lusers and dangerous and if there is a way to mess things up, they’ll find it and do it. In this case, here’s Plan B.

The trick is to know that the default path to shares in [homes] is ‘~’, but you can actually override this! For example:

[homes]
    path = /usr/data/flubnutz
    ...

This  maps users’ home directories in a single directory called ‘flubnutz’. This is not that useful, and I haven’t even bothered to try it myself. When it becomes interesting is when you can add a macro to the path name. %S is a good one to use because it’s the name as the user who has logged in (the service name). %u, likewise. You can then do stuff like:

[homes]
     path = /usr/samba-files/%S
     ....

This stores the user’s home directory files in a completely different location, in a directory matching their name. If you prefer to keep the user’s account files together (like a sensible UNIX admin) you can use:

[homes]
     comment = Home Directories
     path = /usr/home/%S/samba-files
     browseable = no
     writable = yes<

As you can imagine, this stores their Windows home directory files in a sub-directory to their home directory; one which they can’t escape from. You have to create “~/samba-files” and give them ownership of it for this to work. If you don’t want to use the explicit path, %h/samba-files should do instead.

I’ve written a few scripts to create directories and set permissions, which I might add to this if anyone expresses an interest.