ESXi, NFS, ZFS and vfs.nfsd.async

So there I was, reading the source code to FreeBSD’s nfsd (as you do), trying to figure out why ESXi’s performance was so bad when used with an NFS datastore in a ZFS dataset. Actually, I had some idea. There’s a lot out there on the interweb about whether it’s safe to tweak it to ignore requests to flush the write cache using the sysctl tunable vfs.zfs.cache_flush_disable. (For what it’s worth, I’d say that if your drives are on a UPS it’s fine).

But why does ESXis suck so badly in this respect with NFS connected datastores? What is this excessive cache flushing all about? I decided to install it on an HP Microserver and get to some serious debugging.

Okay, here is how ZFS writes work. When you write something it doesn’t actually write, it puts it in the ZIL. This is an Intent Log – i.e. writes intended to happen.  Not exactly a write cache, but it has the same effect, and because of the way ZFS works it’s perfectly safe for avoiding data corruption. If a transaction is waiting in the ZIL when the music stops, the transaction is lots but the disk isn’t trashed. (NB. It’s also possible to put a ZIL on a log drive rather than RAM – I won’t discuss this here).

This should speed things up, right? Normally it does, but not when NFS is being abused. Let me explain. NFS has a transaction commit instruction. The client can tell NFS to flush everything in a transaction to permanent storage and not return until it’s finished. Sometimes you really need this, like if you’re updating the super-block in a database structure. Most of the time you don’t.

Enter ESXi running brain-dead Windows guest machines. How does it know when they’re writing something it isn’t a super-block? It doesn’t. So its solution (as far as I can tell) is to send NFS a commit after every single write and hang around waiting until it’s done it. There’s no point in having the ZIL at all, as it needs to be flushed every time. Putting the ZIL on disk is even worse, as you get an extra write/read for each transaction. I’ve seen people trying to put fast SSDs on the system to try and overcome this – best of luck with that.

As you move further down the chain, FreeBSD, being POSIX compliant whenever possible, will pass on the request for a synchronous write all the way to the disk. Send a block to a SATA or SAS drive and it will initially be cached, right? The write will then complete and the data actually written in the background while the rest of the system zips along. Except that it then issues a SATA or SAS “flush cache” command and waits until everything in its cache has been committed.

In tests this paranoid behaviour lead to running at 20% throughput or less.

Please generate and paste your ad code here. If left empty, the ad location will be highlighted on your blog pages with a reminder to enter your code. Mid-Post

Now, if you’re backing an emulated Windows disk you’re always at risk of data corruption, because FAT and NTFS are corruptable. And, dare I say it, crash rather too often. Let’s face it, if you’re worried about stuff like that you wouldn’t be running Windows – never mind as a VM, So lets be sensible about it.

So why was I reading the nfsd code? Well the obvious answer to this performance problem would be to simply ignore NTFS commit commands coming from the client. This is better than killing off all synchronous writes using the tunable vfs.zfs.cache_flush_disable because ZFS itself might be updating its uberblock and have a valid reason for doing it.

My plan was to hack the code – I’ve seen this done elsewhere. But wanting to do things properly I thought I should make it a system tunable. So I took a look at where the synchronous writes were happening – vdev_disk.c and vdev_geom.c (depending on whether you were hitting the raw drive or the GEOM). Lo and behold there was a global called nfs_sync that was compared along with the SYNC flag, and if either were true the sync request was ignored.  So where did nfs_async come from? Digging further back it comes from nfs_nfsdserv.c , where it’s set by a system tuneable – vfs.nfsd.async. Now that’s an interesting name! Follow the stable auto variable in nfsrvd_write() and the nfs_async global if you want to see what I’m on about.

A quick Google for vfs.nfsd.async revealed – nothing. I seem to have found another useful tunable that’s yet to be documented. although it’s been in the source since at least 10.0. So I’ll get on to documenting after I’ve done a few more tests.

But if you’re having Windows/NFS problems, especially with ESXi, try setting  vfs.nfsd.async instead of crudely disabling cache flushing with vfs.zfs.cache_flush_disable. Let me know how you get on.

Incidentally, you can disable synchronous writes to a dataset using the “sync=disabled” ZFS option. It helps, but not much. I’m still digging to find out why.
Or you could just use Virtualbox instead.

 

FreeBSD, ZFS and Denial of Service

I’ve been using ZFS since FreeBSD 8, and it has it’s uses. It’s pretty be wonderful and all that, but I was actually pretty happy with UFS, and switching to ZFS isn’t a no-brainer.

So what’s the up-side to ZFS? Well you get more error checking and correction and it’s great for managing huge filing systems. You can snapshot and roll back, and do lots of other wonderful stuff with datasets and rive arrays. And it’s more “auto” when it comes to allocating disk space. But call me old fashioned if you like; I don’t like “auto” if I can avoid it.

Penguinistas might not “get” this next bit, but on a UNIX system you didn’t normally have One Big Disk. Instead you had several, and even if you only had one, you’d partition the slice it up so it looked like several. And then, of course, you’d mount disks or partitions on to the root filing system wherever you wanted them to appear.

For reliability, you could also create mirrors and striped RAIDs, put a FS on them and mount them wherever you wanted. And demount them, and mount them somewhere else, and so on.

ZFS does all this good stuff, but automatically, and often as One Big Disk. A good thing? Well… if you must. But there are a few points you might want to consider before diving in.

First off, I like to know where and on which disk my data actually resides. I’m really uneasy with ZFS deciding for me. If ZFS loses it, I want to know where to find it. I also like having a FS on each drive or partition, so I can pull the drive out and mount it wherever I want to get data off – or move it from machine to machine. It’s my data, I’ll do what I want to with it, dammit! You can do this virtually with ZFS datasets, but you can’t unplug a dataset and hold it in your hand. Datasets, of course, are fluid rather than fixed in size, so you don’t need to guess how much space to allocate.

Secondly, with UFS I get to decide what hardware is used for each kind of file. Parts of the FS that are rarely used can be put on slow, cheap, huge disks. The database goes on a velociraptor or better, and the swap partitions – well! Okay, you can use multiple zpools for difference performance situations but then you’re using it like UFS.

Thirdly, there’s a price for all this ZFS wonderfulness. Apart from the software overhead, the Copy-on-Write business needs a lot of RAM to maintain good performance. Fragmentation no the physical drive is guaranteed. If you’re running software (e.g. a database) that uses random access files and lots of transaction, UFS with its in-place modification wins out. A DBMS will take care of its own consistency and storage optimisation, and it has the edge as it knows what the data represents at the application level.

But what of the Denial of Service problem in the headline? Okay, it’s been a bit of a ramble, but this is something you must consider.

There are always management issues with One Big Disk. Linux users seem oblivious to this, but this doesn’t mean putting everything on a big partition is a great plan – even if you’re using a single disk in practice.

With the old way of having multiple partitions, each with an FS, mounted on the directory tree, when an FS on a partition/drive filled up, it is was full. You couldn’t create more files on it. You either have to delete unwanted stuff, or you can mount a bigger drive in its place. With One Big Disk, when it’s full, it’s also full. The difference is that you can’t write any data anywhere on the entire FS. And this is where DoS comes it.

Take, for example, /var/log. Any UNIX admin with a bit of sense will have this in its own partition. If some script kiddie then did something that caused a lot of log file activity, eventually you’d run out of space in /var/log. But the rest of the system would still be alive. With UFS the default installation process created partitions with sensible sizes. Using the One Big Disk principle, ZFS satisfies the requests of any disk-eating process until there isn’t a single byte left anywhere, and then rolls over saying the zpool is full. Or it would say it if there was a monitor connected to the server in a data centre miles away, and there was someone there to look at it.

With ZFS you can set a limit to the size on a dataset-by-dataset basis and prevent this sort of thing from happening. But it doesn’t happen by default, so set your quotas manually if you’re plonking the OS, and in particular /var on it.

Okay, this might sound a bit anti-ZFS, and I’ve yet to have a disaster with a ZFS system that’s required me to move drives around, so I don’t really know how possible it is when the chips are down. And ZFS has is a nice unified way of doing stuff, rather than fiddling around with geom and the FS separately. But after a couple of years with FreeBSD 10, where it became practical to boot from ZFS, shouldn’t I be feeling a bit more enthusiastic about it?

Having a ZFS pool attached as a data store rather than as a boot device is, of course, a different story. That’s when you see the benefits. But it does also eat resources, so I want the benefits to be worth it for the particular application. For the time being I’m putting the OS on UFS, usually with a data partition for databases to thrash, and userland putting simple files on ZFS – best of both worlds.

FreeBSD 10.0 and ZFS

It’s finally here: FreeBSD 10.0 with ZFS. I’ve been pretty happy for many years with twin-drive systems protected using gmirror and UFS. It does what I want. If a disk fails it drops it out and sends me an email, but otherwise carries on. When I put a replacement blank disk it can re-build the mirror. If I take one disk out, put it into another machine and boot it, it’ll wake up happy. It’s robust!

So why mess around with ZFS, the system that puts your drives in to a pool and decides where things are stored, so you don’t have to worry your pretty little head about it? The snag is that the old ways are dying out, and sooner or later you’ll have no choice.

Unfortunately, the transition hasn’t been that smooth. First off you have to consider 2Tb+ drives and how you partition them. MBR partition tables have difficulties with the number of sectors, although AF drives with larger sectors can bodge around this. It can get messy though, as many systems expect 512b sectors, not 4k, so everything has to be AF-aware. In my experience, it’s not worth the hassle.

The snag with the new and limitless “GPT” scheme is that it keeps safe copies of the partition at the end of the disk, as well as the start. This tends to be where gmirror stores its meta-data too. You can’t mix gmirror and GPT. Although the code is hackable, I’ve got better things to do.

So the good new is that it does actually work as a replacement for gmirror. To test it I stuck two new 3Tb AF drives into a server and installed 10.0 using the new procedure, selecting the menu option zfs on root option and GPT partitioning. This is shown in the menu as “Experimental”, but seems to work. What you end up with, if you select two drives and say you want a zfs mirror, is just that.

Being the suspicious type, I pulled each of the drives in turn to see what had happened, and the system continues without a beat just like gmirror did. There were also a nice surprises when I stuck the drives back in and “onlined” them:

First-off the re-build was almost instant. Secondly, HP’s “non-hot-swap” drive bays work just fine for hot-swap under FreeBSD/ZFS. I’d always suspected this was a Windoze nonsense. All good news.

So why is the re-build so fast? It’s obvious when you consider what’s going on. The GEOM system works a block level. If the mirror is broken it has no way of telling which blocks are valid, so the only option is to copy them all. A major feature of ZFS, however, is that the directories and files have validation codes in the blocks above, going all the way to the root. Therefore, by starting at the root and chaining down, it’s easy to find the blocks containing changed data, and copy them. Nice! Getting rid of separate volume managers and file systems has its advantages.

So am I comfortable with ZFS? Not yet, but I’m a lot happier with it when its a complete, integrated solution. Previously I’d only been using on data drives in multi-drive configurations, as although it was possible to install root on ZFS, it was a real PITA.

IP Expo 2011 – what was fun

That’s IP Expo over with for another year. I’ve never quite get what the show is about, but it’s one I seriously consider attending. It’s lack of focus is probably what makes it intersting. I’ve always suspected that some exhibition organiser kept reading about IP and decided it was a buzzword lacking its own show and started one. Anything connected to an IP network is fair game, and these days this means almost everything.

The Violin memory box is an amazing piece of kit – a massive, high-performance thumb drive connected via fibre channel. They’ve done a lot of work basically striping data across flash modules which boosts performance, avoids hitting the same flash chip repetitively and gives redundancy – I believe they can lose six modules before it bites and its hot swappable.

There were quite a lot of other storage solutions on show, some interesting, some very much the same. One company is using ZFS, which is a technology I’ve had my eye on for some time.

Prize for the fund gadget is Pelco’s thermal imaging camera – at less than £2K for the low-res version it suddenly becomes affordable, and it certainly works well enough. Still on CCTV, someone had a monitor connected to a web cam and some software to identity faces. Spooky. This put a mug-shot of everyone looking at the camera down the side of the screen, recorded how long they were standing there and guessed their sex and age. It actually took ten years of most people, which helped with the feel-good but this technology obviously works and an obvious application is snooping on people looking at shop windows to work out what attracts the right kind of demographic (why else would they have developed it). I should point out that this was showing off the screen – the web-cam and face recognition was a crowd-puller

Another interesting bit of kit is an LG stand-alone vmware terminal. This basicall allows you to virtualise your PC and use them on a thin client. The implications of this for managability are obvious – keep your PC environment in a server room, where it can be cloned and configured at will, and leave a dumb-terminal in the front line. If the terminal breaks or is stolen – no problem whatsoever. The snag? Well the terminals aren’t cheap and they could do with toughened glass.