ESXi, NFS, ZFS and vfs.nfsd.async

So there I was, reading the source code to FreeBSD’s nfsd (as you do), trying to figure out why ESXi’s performance was so bad when used with an NFS datastore in a ZFS dataset. Actually, I had some idea. There’s a lot out there on the interweb about whether it’s safe to tweak it to ignore requests to flush the write cache using the sysctl tunable vfs.zfs.cache_flush_disable. (For what it’s worth, I’d say that if your drives are on a UPS it’s fine).

But why does ESXis suck so badly in this respect with NFS connected datastores? What is this excessive cache flushing all about? I decided to install it on an HP Microserver and get to some serious debugging.

Okay, here is how ZFS writes work. When you write something it doesn’t actually write, it puts it in the ZIL. This is an Intent Log – i.e. writes intended to happen.  Not exactly a write cache, but it has the same effect, and because of the way ZFS works it’s perfectly safe for avoiding data corruption. If a transaction is waiting in the ZIL when the music stops, the transaction is lots but the disk isn’t trashed. (NB. It’s also possible to put a ZIL on a log drive rather than RAM – I won’t discuss this here).

This should speed things up, right? Normally it does, but not when NFS is being abused. Let me explain. NFS has a transaction commit instruction. The client can tell NFS to flush everything in a transaction to permanent storage and not return until it’s finished. Sometimes you really need this, like if you’re updating the super-block in a database structure. Most of the time you don’t.

Enter ESXi running brain-dead Windows guest machines. How does it know when they’re writing something it isn’t a super-block? It doesn’t. So its solution (as far as I can tell) is to send NFS a commit after every single write and hang around waiting until it’s done it. There’s no point in having the ZIL at all, as it needs to be flushed every time. Putting the ZIL on disk is even worse, as you get an extra write/read for each transaction. I’ve seen people trying to put fast SSDs on the system to try and overcome this – best of luck with that.

As you move further down the chain, FreeBSD, being POSIX compliant whenever possible, will pass on the request for a synchronous write all the way to the disk. Send a block to a SATA or SAS drive and it will initially be cached, right? The write will then complete and the data actually written in the background while the rest of the system zips along. Except that it then issues a SATA or SAS “flush cache” command and waits until everything in its cache has been committed.

In tests this paranoid behaviour lead to running at 20% throughput or less.

Please generate and paste your ad code here. If left empty, the ad location will be highlighted on your blog pages with a reminder to enter your code. Mid-Post

Now, if you’re backing an emulated Windows disk you’re always at risk of data corruption, because FAT and NTFS are corruptable. And, dare I say it, crash rather too often. Let’s face it, if you’re worried about stuff like that you wouldn’t be running Windows – never mind as a VM, So lets be sensible about it.

So why was I reading the nfsd code? Well the obvious answer to this performance problem would be to simply ignore NTFS commit commands coming from the client. This is better than killing off all synchronous writes using the tunable vfs.zfs.cache_flush_disable because ZFS itself might be updating its uberblock and have a valid reason for doing it.

My plan was to hack the code – I’ve seen this done elsewhere. But wanting to do things properly I thought I should make it a system tunable. So I took a look at where the synchronous writes were happening – vdev_disk.c and vdev_geom.c (depending on whether you were hitting the raw drive or the GEOM). Lo and behold there was a global called nfs_sync that was compared along with the SYNC flag, and if either were true the sync request was ignored.  So where did nfs_async come from? Digging further back it comes from nfs_nfsdserv.c , where it’s set by a system tuneable – vfs.nfsd.async. Now that’s an interesting name! Follow the stable auto variable in nfsrvd_write() and the nfs_async global if you want to see what I’m on about.

A quick Google for vfs.nfsd.async revealed – nothing. I seem to have found another useful tunable that’s yet to be documented. although it’s been in the source since at least 10.0. So I’ll get on to documenting after I’ve done a few more tests.

But if you’re having Windows/NFS problems, especially with ESXi, try setting  vfs.nfsd.async instead of crudely disabling cache flushing with vfs.zfs.cache_flush_disable. Let me know how you get on.

Incidentally, you can disable synchronous writes to a dataset using the “sync=disabled” ZFS option. It helps, but not much. I’m still digging to find out why.
Or you could just use Virtualbox instead.

 

NHS not exactly target of “cyber-attack”

The Security and Intelligence Committee takes all this cyber-thingy stuff very seriously.

I got home, put on BBC News and there was some dope being interviewed about a “cyber-attack on the NHS”, blithering on about their M3 network and how secure it is. I turned over to Sky, and there was someone from Alienvault talking sense, but not detail. Followed by the chair of the Security and Intelligence Committee, Dominic Grieve, blustering on about how seriously the government took cyber-security but admitting he didn’t know anything about technology, in case it wasn’t obvious. I have never met anyone in parliament who does (see previous rants).

So what’s actually happening? It’s not an attack on the NHS. It’s a bunch of criminals taking advantage of a bug in Microsoft’s server software. Almost certainly MS17-010. An attack based on this exploit was used by NSA in America (Equation Group) until someone snaffled it and leaked it (allegedly Shadow Brokers). It’s been used in a family of ransomware called WannaCrypt, and it’s being used to extort money all over the place. I see no reason to believe the NHS has been targeted specifically. It’s targeting everyone vulnerable, all over the world. Poorer countries where they are running  more old software, or running bootleg version that don’t receive updates,  are worst hit.

So why is the news full of it being the NHS, and only the NHS? One reason is that Microsoft issued a patch for MS17-010 a good while back. And the NHS didn’t apply it. Why? Because they’re still using Windows XP and Microsoft didn’t issue the patch for Windows XP. Simple.

A lot (repeat A LOT) of companies use older Microsoft systems because (a) they’ve bought them, why should they pay again; and (b) Microsoft abandoned backward compatibility with Windows 7, so a lot of legacy software (dating back to the 1980’s) won’t run any more. Upgrading isn’t so simple.

There’s a lot of money (from Crapita Illogica (CGI), Atos and G4S – amongst others) in flogging dodgy Microsoft-based IT to government projects. Microsoft Servers are considered Job Security for people who can only understand how to use a wizard, but know it’ll break down regularly and they’ll be called upon to reinstall it.

No one who knows how computers work would ever use Microsoft servers except as a last resort.

Update 13-May-2017

Guess what? Microsoft has now released a patch for older versions of their server software (ie. Server 2003 and Windows XP). That was jolly quick; it’s like they had it already but didn’t release it to punish those who refused to “upgrade”.

Blue Whale Challenge

Blue Whale at the Marine Life Hall, American Museum of Natural History
This is a blue whale. Nothing to do with the latest chain letter hoax.
People seem to be getting really worked up about a so-called “Blue Whale Challenge” social media game. And understandably so – it’s a game where vulnerable children are targeted and given progressive challenge, culminating in something that will kill them.

I saw this first a couple of months ago, and each time it turns up the lurid details have been embellished further. It sounds too macabre to be true. And it’s not.

About a year ago someone in Russia published an on-line article hoping to explain the high number of teenage suicides in the country, and blaming it on the Internet. Apparently a statistically significant number of teenagers belonging to one particular on-line group had died; the on-line group must therefore be to blame.

Wrong! If you have an on-line group of depressed teenagers then you are going to have a higher proportion of suicides amongst them. The writers have confused cause and effect.

However, facts never got in the way of a good lurid story and this one seems to have bounced around Russia for most of 2016, where it morphed into an evil on-line challenge game. It then jumped the language gap to English in winter 2017.

The story spreads as a cautionary tale, with the suggestion that you should pass it on to everyone you know so they can check their kids for early signs they are being targeted (specifically, cutting a picture of a whale in to their arm). In other words, a classic email urban legend. It’s only a matter of time before the neighbourhood watch people add it to their newsletters.

Update:

The Daily Mail has reported this as fact, so I must be wrong and it must be true. Or perhaps I’m right and they have nothing to back their carefully worded account. Wouldn’t be the first time…