Dell FS12-NV7 Review – Bargain FreeBSD/ZFS box

It seems just about everyone selling refurbished data centre kit has a load of Dell FS12-NV7’s to flog. Dell FS-what? You won’t find them in the Dell catalogue, that’s for sure. They look a bit like C2100s of some vintage, and they have a lot in common. But on closer inspection they’re obviously a “special” for an important customer. Given the number of them knocking around, it’s obviously a customer with big data, centres stuffed full of servers with a lot of processing to do. Here’s a hint: It’s not Google or Amazon.

So, should you be buying a weirdo box with no documentation whatsoever? I’d say yes, definitely. If you’re interests are anything like mine. In a 2U box you can get twin 4-core CPUs and 64Gb of RAM for £150 or less. What’s not to like? Ah yes, the complete lack of documentation.

Over the next few weeks I intend to cover that. And to start off this is my first PC review for nearly twenty years.

So the Dell FS12-NV7:

FS-12 looking at the back panel. Note the cowling across the CPUs

As I mentioned, it’s a 2U full length heavy metal box on rails. On the back there are the usual I/O ports: a 9-way RS-232, VGA, two 1Gb Ethernet, two USB2 and a PS/2 keyboard and mouse. The front is taken up by twelve 3.5″ hard drive bays, with the status lights and power button on one of the mounting ears to make room. Unlike other Dell servers, all the connections are on the back, only.

If you want to play with the metalwork, the rear panel is modular and can easily be unscrewed although in practice there’s not much scope for enhancement without changing the  motherboard.

The FS12 has a single 1U PSU

Speaking of metalwork, it comes with  a single 1U PSU. There’s space above it for a second, but the back panel behind the PSU bay would need swapping – or removing – if you wanted to add a second. The area above the existing unit is just about the only space left in the box, and I have thought of piling up a load of 2.5″ drives there.

Taking the top off is where the fun starts. Inside there’s large Gigabyte EATX motherboard – a Gigabyte GA-3CESL-RH. All the ones I’ve seen are rev 1.7, which is a custom version but its similar to a rev 1.4. It does have, of all things, a floppy disk controller and an IDE (PATA) connector. More generally usefully, there are two more USB headers, a second RS-232 and six SATA sockets (3Gb). At the back there’s either a BMC module, or a socket where it used to be. If you like DRAC, knock yourself out (you’re likely to be barely concious to begin with). Seriously, this is old DRAC and probably only works with IE 2.0 or something. (You can probably tell I haven’t bothered to try it). The BIOS also allows you to redirect the console to the serial port for remote starting.

The Ethernet ports are Marvel 88E1116 1Gb, and haven’t given me any trouble. The firmware supports PXE, and I’m pleased to say that WoL works with the FreeBSD drives.

Somebody has pinched the slots!

Unfortunately, while the original Gigabyte model sported twin PCI and three PCIe sockets, the connectors are missing from these examples. It’s hard to find anything with a bit of grunt that can also use with your old but interesting PCI cards. It should be possible to rework it by adding the sockets and smoothing caps and sockets; fortunately the SMD decoupling caps are already still there.  On the other had, you could find another motherboard with PCI sockets if that’s what you really want.

But grunt is what this box is all about, and there’s plenty of that.

This is board was designed for Opteron Socket-F processors; specifically the 2000 series (Barcelona and Shanghi). The first digit refers to the number of physical CPUs that work together (either 2 or 8), the second is a code for the number of cores (1=1, 2=2, 3=4, 4=6, 5=8). The last two digits are a speed code. It’s not the frequency, it’s the benchmark speed.  I’ve heard rumours that some of FS-12s contain six-core CPUs, but I’ve only seen the 2373EE myself. The EE is the low power consumption version. Sweet.

If I could choose any Opeteron Socket-F CPU, the 2373EE is almost as good as it gets. It’s a tad slower than some of the other models running at 2.1GHz , but has significantly lower power and cooling requirements and was one of the last they produced in the 45nm process. It would be possible to change it for a 2.3GHz version, or one with six cores, but otherwise pretty much every other Opteron would be a downgrade. In other words, don’t think you can hot-rod it with a faster processor – you’re unlikely to find a Socket-F CPU anyway. After these, AMD switched to the Bulldozer line in an AM3+ socket.

This isn’t to say the CPU is modern. It does have the AMD virtualisation instructions, so it’s good news if you want to run nested 64-bit operating systems or hypervisors. The thing it lacks that I’d like most are the AES instructions that appeared in Bulldozer onwards. If you’re doing a lot of crypto, this matters. If you’re not, it doesn’t. Naturally, it implements the AMD64 instruction set, as now used by Intel, and all the media processing bit-twiddle stuff if you can use it. AMD has traditionally been at the forefront of processing smarter, whereas Intel goes for brute force and cranks up the clock speed. This is why AMD has, in my opinion, made assembler programming fun again.

Please generate and paste your ad code here. If left empty, the ad location will be highlighted on your blog pages with a reminder to enter your code. Mid-Post

Eight very capable Opteron cores: a good start. This generation supported DDR2 ECC RAM, and these boxes have 16 sockets (eight per CPU). They should be able to support 8Gb DIMMs, although I haven’t been able to verify this. Gigabyte’s documentation on similar motherboards is inconclusive as the earlier boards were from an time when 4Gb was all you could get. Again, I haven’t tried this but they are designed to handle 512Mb DIMMs. 1Gb and 4Gb certainly work and these tend to be available with any FS-12 you buy. At one time DDR2 ECC RAM was rather expensive. Not now. It’s much cheaper than DDR3 because, to be blunt, you can’t use it in very much these days.

And this is what makes the FS12 such a good buy: For about £150 you can get an eight-core processor with 64Gb of RAM. Bargain! And that’s before you look at the disk options.

The FS12, like most Dell Servers, is set up to run Windows and as a result requires a separate volume manager, on hardware designed to pretend Windows is looking at a disk. So-called “hardware” RAID. This takes the form of two PERC6/i cards occupying both PCIe cards on a riser. Fine if you want to run Windows or some other lightweight operating system, but PERC cards are about as naff as you can get for anything Unix-like. They work in RAID mode only, hiding the drives from the OS, and these are just a bit to old to be re-flashed in to anything useful.

The drives fit into a front-loading 12-way array with a SAS/SATA backplane. This is built in to the case; you can’t detach it and use it separately. Not without an angle grinder anyway, although if you really wanted to this would be a practical proposition. Note well that this is a backplane; not an expander, enclosure or anything so complex. Some Dell 2U servers like this do have an expander, which takes four SAS channels of SAS on a single cable and expands them to twelve, but this is the 1:1 version. And it’s an old one at that, using SFF-8484 connectors. If you’ve been using SAS for years you may still never have seen an SFF-8484 (AKA 32-pin Multi-lane). These didn’t last long and were quickly replaced with  the far more sensible SFF-8487(AKA 36-pin Mini-SAS). However, if you can sort out the cables (as I will explain in a later post), this backplane has possibilities.

But as it stands you get a the PERCs and a 12-slot drive array that’s only good for Windows or Linux. Unless, that is, you remove the backplane and the PERCs and make use of the six 3Gb SATA sockets on the motherboard. You’ll have to leave the drives in place and connect the cables directly back, but how many drives do you need?

There is one unfortunate feature of these boxes that is hard to ignore: the cooling. It’s effective, but when you turn it on it sounds like a jet engine spooling up. And then it gets even louder. There a lot you can do about this and I’m experimenting with options, which I’ll explain in a later post, but in the mean time you need to give everyone ear defenders, or install it in an outbuilding and use a KVM extender. I’ve been knocking around data centres for over twenty years and I’ve never heard one this bad.

The cooling is actually accomplished by five fans. Two are 1U size in the PSU, and are probably as annoying as any other ~40mm fan. The real screamers are two 80mm and one 60mm fan positioned between the drive cage and the motherboard. A cowling directs the one 80mm fan across each CPU and its DIMMs and the 60mm gives airflow over the Northbridge and PCI slots. They all spin really fast – in excess of 10,000rpm, and although they have sense and control wires nothing seems to be adjusting them downwards to the required rate.

My suspicion is that either the customer didn’t care about noise but wanted to keep everything as cool as possible, or that whatever operating system was installed (ESX I suspect) had a custom daemon to control their speed via the SAS backplane. I shall be going in to cooling options later, but note that the motherboard has five monitored and software adjustable fan connectors that are currently not used.

So, in summary, you’re getting a lot for your money if its the kind of thing you want. It’s ideal as a high-performance Unix box with plenty of drive bays (preferably running BSD and ZFS). In this configuration it really shifts. Major bang-per-buck. Another idea I’ve had is using it for a flight simulator. That’s a lot of RAM and processors for the money. If you forego the SAS controllers in the PCIe slots and dump in a decent graphics card and sound board, it’s hard to see what’s could be better (and you get jet engine sound effects without a speaker).

So who should buy one of these? BSD geeks is the obvious answer. With a bit of tweaking they’re a dream. It can build-absolutely-everything in 20-30 minutes. For storage you can put fast SAS drives in and it goes like the wind, even at 3Gb bandwidth per drive. I don’t know if it works with FreeNAS but I can’t see why not – I’m using mostly FreeBSD 11.1 and the generic kernel is fine. And if you want to run a load of weird operating systems (like Windows XP) in VM format, it seems to work very well with the Xen hypervisor and Dom0 under FreeBSD. Or CentOS if you prefer.

So I shall end this review in true PCW style:

Pros:

  • Cheap
  • Lots of CPUs,
  • Lots of RAM
  • Lots of HD slots
  • Great for BSD/ZFS or VMs

Cons:

  • Noisy
  • no AES-NI
  • SAS needs upgrading
  • Limited PCI slots

As I’ve mentioned, the noise and SAS are easy and relatively cheap to fix, and thanks to BitCoin miners, even the PCI slot problem can be sorted. I’ll talk about this in a later post.

 

Don’t blame Amazon, it’s Corporation Tax that’s broken

Well it looks like Amazon has only paid £1.3M UK tax, based on turnover of £Sqillions. Much wringing of hands and cries of “Something should be done!”. The same goes for Google, Starbucks or any other international company doing well in the UK. But nothing is being done to solve the problem, and for various reasons depending on your economic policy outlook.

First off, it’s not true to say Amazon pays very little tax in the UK. It pays VAT and PAYE. Lots of it. What it doesn’t pay much of is corporation tax, which is the tax on profits. And if you were an international company, you wouldn’t either. For international companies, corporation tax is, for practical purposes, optional. Companies may opt to pay as much or as little as suits their purpose.

If this is news to you, it works like this: Take Starbucks, for example. They managed to make very little profit in the UK. Because of this they were paying little or no corporation tax, which may seem odd when consider their ubiquitous presence in the high street. The reason was simple: Starbucks in the UK bought its coffee from its Dutch operation and the price was so high it wiped out the profits here. In Holland they were minting it, selling coffee to the UK, but the Dutch government took a liberal view on how much tax it should pay on these profits. Basically they were allowing Starbucks to pay a cut of what should have been UK corporation tax, and trouser the rest.

If Starbucks can do this simply by finding a foreign government prepared to sell out for a share of the profits, how easy is it for a Internet company with no physical product?

Basically, corporation tax would be a farce, were it not so serious. The problem is that it’s still paid in full by our local companies, putting them at an obvious disadvantage to foreign competition. It does more damage than good.

There are two solutions:

The left-wing idea is to make more new law against tax dodging. Somehow. And if international companies don’t like it, they can take their jobs, investment, VAT payments, PAYE payments and business rates and go somewhere else (e.g. Ireland). They’ll be gutted.

Back in the real world, if you have an unenforceable tax that damages local companies the smart thing to do is abandon it. But there is a problem with this – how do you make up the revenue you’re currently collecting from UK businesses (those that remain)? The obvious answer, and one the Conservatives won’t stomach, is to raise personal income tax. This isn’t actually a problem, because foreign companies will just have to cover it to keep take-home incomes stable (or lose staff) and local companies can afford to give everyone a pay rise out of the money that would have gone in corporation tax. Levelling the playing field won’t be painless in the short term, but this no reason to avoid it.

So Labour has a busted ideological plan and the Conservatives would be annihilated if they raised taxes. Something needs to break the deadlock, because newspapers naming and shaming global companies that are simply playing by the rules we gave them is no answer. Labour banging on about alleged “tax cuts for the rich” isn’t going to help. Neither will Conservative pledges not to raise any taxes. It’s not a question of raising or reducing taxes, it’s a question of balancing them properly.

Meanwhile the Irish government is laughing at us, all the way to the bank.

 

ZFS is not always the answer. Bring back gmirror!

The ZFS bandwaggon has momentum, but ZFS isn’t for everyone. UFS2 has a number of killer advantages in some applications.

ZFS is great if you want to store a very large number of normal files safely. It’s copy-on-write (COW) is a major advantage for backup, archiving and general data safety, and datasets allow you to fine-tune almost any way you can think of. However, in a few circumstances, UFS2 is better. In particular, large random-access files do badly with COW.

Unlike traditional systems, a block in a file isn’t overwritten in place, it always ends up at a different location. If a file started off contiguous it’ll pretty soon be fragmented to hell and performance will go off a cliff. Obvious victims will be databases and VM hard disk images. You can tune for these, but to get acceptable performance you need to throw money and resources to bring ZFS up to the same level. Basically you need huge RAM caches, possibly an SLOG, and never let your pool get more than 50% full. If you’re unlucky enough to end up at 80% full ZFS turns off speed optimisations to devote more RAM to caching as things are going to get very bad fragmentation-wise.

If these costs are a problem, stuck with UFS. And for redundancy, there is still good old GEOM Mirror (gmirror). Unfortunately the documentation of this now-poor relation has lagged a bit, and what once worked as standard, doesn’t. So here are some tips.

The most common use of gmirror (with me anyway) is a twin-drive host. Basically I don’t want things to fail when a hard disk dies, so I add a second redundant drive. Such hosts (often 1U servers) don’t have space for more than two drives anyway – and it pays to keep things simple.

Setting up a gmirror is really simple. You create one using the “gmirror label” command. There is no “gmirror create” command; it really is called “label”, and it writes the necessary metadata label so that mirror will recognise it (“gmirror destroy” is present and does exactly what you might expect).

So something like:

gmirror label gm0 ada1 ada2

will create a device called /dev/mirror/gm0 and it’ll contain ada1’s contents mirrored on to ada2 (once it’s copied it all in the background). Just use /dev/mirror/gm0 as any other GEOM (i.e. disk). Instead of calling it gm0 I could have called it gm1, system, data, flubnutz or anything else that made sense, but gm0 is a handy reminder that it’s the first geom mirror on the system and it’s shorter to type.

The eagle eyed might have noticed I used ada1 and ada2 above. You’ve booted off ada0, right? So what happens if you try mirroring yourself with “gmirror label gm0 ada0 ada1“? Well this used to work, but in my experience it doesn’t any more. And on a twin-drive system, this is exactly what you want to do. But it is still possible, read on…

How to set up a twin-drive host booting from a geom mirror

First off, before you do anything (even installing FreeBSD) you need to set up your disks. Since the IBM XT, hard disks have been partitioned using an MBR (Master Boot Record) at the start. This is really old, naff, clunky and Microsoft. Those in the know have been using the far superior GPT system for ages, and it’s pretty cross-platform now. However, it doesn’t play nice with gmirror, so we’re going to use MBR instead. Trust me on this.

For the curious, know that GPT keeps a copy of the partition table at the beginning and end of the disk, but MBR only has one, stored at the front. gmirror keeps its metadata at the end of the disk, well away from the MBR but unfortunately in exactly the same spot as the spare GPT. You can hack the gmirror code so it doesn’t do this, or frig around with mirroring geoms rather than whole disks and somehow get it to boot, but my advice is to stick to MBR partitioning or BSDlabels, which is an extension. There’s not a lot of point in ever mounting your BSD boot drive on a non-BSD system, so you’re not losing much whatever you choose.

Speaking of metadata, both GPT and gmirror can get confused if they find any old tables or labels on a “new” disk. GPT will find old backup partition tables and try to restore them for you, and gmirror will recognise old drives as containing precious data and dig its heels in when you try to overwrite it. Both gpart and gmirror have commands to erase their metadata, but I prefer to use dd to overwrite the whole disk with zeros anyway before re-use. This checks that the disk is actually good, which is nice to know up-front. You could just erase the start and end if you were in a hurry and wanted to calculate the offsets.

The next thing you’ll need to do is load the geom_mirror kernel module. Either recompile the kernel with it added, or if this fills you with horror,  just add ‘load_geom_mirror=”yes”‘ to /boot/loader.conf. This does bring it in early enough in the process to let you boot from it. The loader will boot from one drive or the other and then switch to mirror mode when it’s done.

So, at this point, you’ve set up FreeBSD as you like on one drive (ada0), selecting BSDlabels or MBR as the partition method and UFS as the file system. You’ve set it to load the geom_mirror module in loader.conf.  You’re now looking at a root prompt on the console, and I’m assuming your drives are ada0 and ada1, and you want to call your mirror gm0.

Try this:

gmirror label gm0 ada0

Did it work? Well it used to once, but now you’ll probably get an error message saying it could not write metadata to ada0. If (when) this happens I know of one answer, which I found after trying everything else. Don’t be tempted to try everything else yourself (such as seeing if it works with ada1). Anything you do will either fail if you’re lucky, or make things worse. So just reboot, and select single-user mode from the loader menu.

Once you’re at the prompt, type the command again, and this time it should say that gm0 is created. My advice is to now reboot rather than getting clever.

When you do reboot it will fail to mount the root partition and stop, asking for help to find it. Don’t panic. We know where it’s gone. Mount it with “ufs:/dev/mirror/gm0s1a” or whatever slice you had it on if you’ve tried to be clever. Forgot to make a note? Don’t worry, somewhere on the boot long visible on the screen it actually tell you the name of the partition it couldn’t find.

After this you should be “in”. And to avoid this inconvenience next time you boot you’ll need to tweak /etc/fstab using an editor of your choice, although real computer nerds only use vi. What you need to do is replace all references to the actual drive with the gm0 version. Therefore /dev/ada0s1a should be edited to read /dev/mirror/gm0s1a. On a current default install, which no longer partitions the drive, this will only apply the root mount point and the swap file.

Save this, reboot (to test) and you should be looking good. Now all that remains is to add the second drive (ada1 in the example) with the line:

gmirror insert gm0 ada1

You can see the effect by running:

gmirror status

Unless your drive is very small, gm0 will be DEGRADED and it will say something about being rebuilt. The precise wording has changed over time. Rebuilding takes hours, not seconds so leave it. Did I mention it’s a good idea to do this when the system isn’t busy?

ZFS Optimal Array Size

So there I was looking at a pile of eight drives and an empty storage array, and wondering how to cofigure it for best performance under ZFS. “Everyone knows” the formula right? The best performance in a raidz array comes if you use 2^D+P drives. That’s to say your data drives should be a power of two (i.e. 2,4,8,16) plus however many redundant (parity) drives for the raidz level you desire. This is mentioned quite often in the Lucas book FreeBSD Mastery:ZFS; although it didn’t originate there I’ll call it the Lucas rule anyway

I have my own rule – redundancy should be two drives or 30%. Why? Well drives in an array have a really nasty habit of failing two at a time. It’s not sods law, it’s a real phenomenon caused by the stress of re-silvering shaking out any other drives that are “on the edge”. This means I go for configurations such as 4+2, 5+2, 6+2. From there on I go to raidz3 with 7+3, 8+3, 9+3. As there’s no raidz4, 12 drives is the limit – for 14 drives I’d have two vdevs (LUNs) of 5+2 each.

However, If you merge my rule with the Lucas rule the only valid sizes are 2+2 and 4+2 and 8+3. And I had just eight drives to play with.

I was curious – how was the Lucas rule derived? I dug out the book, and it doesn’t say. Anywhere. Having a highly developed suspicion of anything described as “best practice” I decided to test it on my rag-bag collection of drives in the Dell backplane, and guess what? No statistically significant difference.

Now the trouble with IT “best practice” guides is they’re written by technicians based on observation, not OS programmers who know how stuff actually works. The first approach has a lot of merit, but unless you know the reason for your observations you won’t know when the reason has become irrelevant. Unfortuantely, as an OS programmer, I now had a duty to figure out what this reason might have been.

After wading through the code and finding nothing much helpful, I did what I should have done first and considered the low-level disk layout. It’s actually quite simple.

Your stuff is written to disk in a series of blocks, right? In a striped array, each drive gets a block in turn to spread the load. No problem there. Well there will be a problem if your ZFS block size doesn’t match the block size on the drives, but that’s a complication I’m going to overlook – lets just assume you got that bit right.

So where does the optimal number of disks come from? I contend that on a striped vdev there never was one. The problem only comes when you add redundant drives.

I’m going to digress here to explain how error correcting data happens – in very simple terms. Suppose you have a sequence of numbers such as:

5 8 2 3

Each number is stored on a separate piece of paper, and to guard against loss you add a fifth number so that when you add them all up you get a total ending in zero. In this example, the total of the first 4 is 18. You can add an extra 2 to make the total 20, which ends in zero, so the fifth number is going to be 2.

5 8 2 3 2

Now, if we lose any one of those five numbers we can work out what it must have been – just work out which digit when added to the remaining four gives you a total ending in zero. For example, supposing ‘3’ when missing. Add up the remainder and you get 17. You need 3 more to get to a zero, so the missing number must be 3.

Digression over. ZFS calculates a block of error correction data for the blocks of data it’s just written and adds this as the last block in the sequence. If If ZFS blocks and sectors were the same size, this would be fine writing another sector is quick. But ZFS blocks no longer match sectors. In fact, they’re tunable over a wide range. We’ve also got 4k sectors instead of the traditional 512b. So, suppose you had 2k ZFS blocks on a 4k sector disk? Your parity data could end up being just half a sector, meaning that ZFS has to read it, overwrite half, and write it back rather than just writing it. This sucks. But if you choose the number of disks carefully, you end up with parity blocks that do fit. So, always make sure you follow Lucas’ rule, and make sure your data drives are a power of two.

Except…

This may have been true once, but now we have variable ZFS blocks sizes, and they tend to be much larger than the sector size anyway. In this situation the “magic” configurations no longer matter. And, now we have lx4 compression, the physical block sizes are variable anyway.

For those not in the know about this, lz4 compression is a no-brainer. It wont’ compress stuff it can’t, and its fast. Most files will compress to at least 2:1, often more – which means when you read a block only half the data needs to travel down the bus to get in memory. Everything suddenly goes twice as fast, at the expense of one core having to do some work. It’s true that the block and sector sizes are nowhere near matching, and this is bound to have a performance hit, but this is more-than eclipsed by the improved transfer rate.

So in summary, forget the 2^D+P “best practice” formula. It was only valid in the early days. Have whatever config you like, but I I do commend my rule about the number of redundant drives. This is based on a hardware issue, and update to the software is going to fix this any time soon.

ESXi, NFS, ZFS and vfs.nfsd.async

So there I was, reading the source code to FreeBSD’s nfsd (as you do), trying to figure out why ESXi’s performance was so bad when used with an NFS datastore in a ZFS dataset. Actually, I had some idea. There’s a lot out there on the interweb about whether it’s safe to tweak it to ignore requests to flush the write cache using the sysctl tunable vfs.zfs.cache_flush_disable. (For what it’s worth, I’d say that if your drives are on a UPS it’s fine).

But why does ESXis suck so badly in this respect with NFS connected datastores? What is this excessive cache flushing all about? I decided to install it on an HP Microserver and get to some serious debugging.

Okay, here is how ZFS writes work. When you write something it doesn’t actually write, it puts it in the ZIL. This is an Intent Log – i.e. writes intended to happen.  Not exactly a write cache, but it has the same effect, and because of the way ZFS works it’s perfectly safe for avoiding data corruption. If a transaction is waiting in the ZIL when the music stops, the transaction is lots but the disk isn’t trashed. (NB. It’s also possible to put a ZIL on a log drive rather than RAM – I won’t discuss this here).

This should speed things up, right? Normally it does, but not when NFS is being abused. Let me explain. NFS has a transaction commit instruction. The client can tell NFS to flush everything in a transaction to permanent storage and not return until it’s finished. Sometimes you really need this, like if you’re updating the super-block in a database structure. Most of the time you don’t.

Enter ESXi running brain-dead Windows guest machines. How does it know when they’re writing something it isn’t a super-block? It doesn’t. So its solution (as far as I can tell) is to send NFS a commit after every single write and hang around waiting until it’s done it. There’s no point in having the ZIL at all, as it needs to be flushed every time. Putting the ZIL on disk is even worse, as you get an extra write/read for each transaction. I’ve seen people trying to put fast SSDs on the system to try and overcome this – best of luck with that.

As you move further down the chain, FreeBSD, being POSIX compliant whenever possible, will pass on the request for a synchronous write all the way to the disk. Send a block to a SATA or SAS drive and it will initially be cached, right? The write will then complete and the data actually written in the background while the rest of the system zips along. Except that it then issues a SATA or SAS “flush cache” command and waits until everything in its cache has been committed.

In tests this paranoid behaviour lead to running at 20% throughput or less.

Now, if you’re backing an emulated Windows disk you’re always at risk of data corruption, because FAT and NTFS are corruptable. And, dare I say it, crash rather too often. Let’s face it, if you’re worried about stuff like that you wouldn’t be running Windows – never mind as a VM, So lets be sensible about it.

So why was I reading the nfsd code? Well the obvious answer to this performance problem would be to simply ignore NTFS commit commands coming from the client. This is better than killing off all synchronous writes using the tunable vfs.zfs.cache_flush_disable because ZFS itself might be updating its uberblock and have a valid reason for doing it.

My plan was to hack the code – I’ve seen this done elsewhere. But wanting to do things properly I thought I should make it a system tunable. So I took a look at where the synchronous writes were happening – vdev_disk.c and vdev_geom.c (depending on whether you were hitting the raw drive or the GEOM). Lo and behold there was a global called nfs_sync that was compared along with the SYNC flag, and if either were true the sync request was ignored.  So where did nfs_async come from? Digging further back it comes from nfs_nfsdserv.c , where it’s set by a system tuneable – vfs.nfsd.async. Now that’s an interesting name! Follow the stable auto variable in nfsrvd_write() and the nfs_async global if you want to see what I’m on about.

A quick Google for vfs.nfsd.async revealed – nothing. I seem to have found another useful tunable that’s yet to be documented. although it’s been in the source since at least 10.0. So I’ll get on to documenting after I’ve done a few more tests.

But if you’re having Windows/NFS problems, especially with ESXi, try setting  vfs.nfsd.async instead of crudely disabling cache flushing with vfs.zfs.cache_flush_disable. Let me know how you get on.

Incidentally, you can disable synchronous writes to a dataset using the “sync=disabled” ZFS option. It helps, but not much. I’m still digging to find out why.
Or you could just use Virtualbox instead.

 

NHS not exactly target of “cyber-attack”

The Security and Intelligence Committee takes all this cyber-thingy stuff very seriously.

I got home, put on BBC News and there was some dope being interviewed about a “cyber-attack on the NHS”, blithering on about their M3 network and how secure it is. I turned over to Sky, and there was someone from Alienvault talking sense, but not detail. Followed by the chair of the Security and Intelligence Committee, Dominic Grieve, blustering on about how seriously the government took cyber-security but admitting he didn’t know anything about technology, in case it wasn’t obvious. I have never met anyone in parliament who does (see previous rants).

So what’s actually happening? It’s not an attack on the NHS. It’s a bunch of criminals taking advantage of a bug in Microsoft’s server software. Almost certainly MS17-010. An attack based on this exploit was used by NSA in America (Equation Group) until someone snaffled it and leaked it (allegedly Shadow Brokers). It’s been used in a family of ransomware called WannaCrypt, and it’s being used to extort money all over the place. I see no reason to believe the NHS has been targeted specifically. It’s targeting everyone vulnerable, all over the world. Poorer countries where they are running  more old software, or running bootleg version that don’t receive updates,  are worst hit.

So why is the news full of it being the NHS, and only the NHS? One reason is that Microsoft issued a patch for MS17-010 a good while back. And the NHS didn’t apply it. Why? Because they’re still using Windows XP and Microsoft didn’t issue the patch for Windows XP. Simple.

A lot (repeat A LOT) of companies use older Microsoft systems because (a) they’ve bought them, why should they pay again; and (b) Microsoft abandoned backward compatibility with Windows 7, so a lot of legacy software (dating back to the 1980’s) won’t run any more. Upgrading isn’t so simple.

There’s a lot of money (from Crapita Illogica (CGI), Atos and G4S – amongst others) in flogging dodgy Microsoft-based IT to government projects. Microsoft Servers are considered Job Security for people who can only understand how to use a wizard, but know it’ll break down regularly and they’ll be called upon to reinstall it.

No one who knows how computers work would ever use Microsoft servers except as a last resort.

Update 13-May-2017

Guess what? Microsoft has now released a patch for older versions of their server software (ie. Server 2003 and Windows XP). That was jolly quick; it’s like they had it already but didn’t release it to punish those who refused to “upgrade”.

Blue Whale Challenge

Blue Whale at the Marine Life Hall, American Museum of Natural History
This is a blue whale. Nothing to do with the latest chain letter hoax.
People seem to be getting really worked up about a so-called “Blue Whale Challenge” social media game. And understandably so – it’s a game where vulnerable children are targeted and given progressive challenge, culminating in something that will kill them.

I saw this first a couple of months ago, and each time it turns up the lurid details have been embellished further. It sounds too macabre to be true. And it’s not.

About a year ago someone in Russia published an on-line article hoping to explain the high number of teenage suicides in the country, and blaming it on the Internet. Apparently a statistically significant number of teenagers belonging to one particular on-line group had died; the on-line group must therefore be to blame.

Wrong! If you have an on-line group of depressed teenagers then you are going to have a higher proportion of suicides amongst them. The writers have confused cause and effect.

However, facts never got in the way of a good lurid story and this one seems to have bounced around Russia for most of 2016, where it morphed into an evil on-line challenge game. It then jumped the language gap to English in winter 2017.

The story spreads as a cautionary tale, with the suggestion that you should pass it on to everyone you know so they can check their kids for early signs they are being targeted (specifically, cutting a picture of a whale in to their arm). In other words, a classic email urban legend. It’s only a matter of time before the neighbourhood watch people add it to their newsletters.

Update:

The Daily Mail has reported this as fact, so I must be wrong and it must be true. Or perhaps I’m right and they have nothing to back their carefully worded account. Wouldn’t be the first time…

 

 

More Fraud on Amazon Marketplace

Fancy a roll of sellotape for £215.62? Amazon has this and 708,032 other products listed by a seller called linkedeu, who’s full range can be found here:
https://www.amazon.co.uk/s?merchant=AA722TCREQZHH.

This isn’t the first time sellers like this have appeared, and it won’t be the last. However, this time I’ve reported it to Amazon and I intend to time their response. How could they let some fraudster list nearly quarter of a million items without anyone checking?

The seller does have a business address in California, but I suspect this is fake too, and the name and address may well be a legitimate company.

 

ParentPay seriously broken (again)

400 Bad Request
ParentPay, the Microsoft-based school payment system that’s the bane of so many parents’ lives, has yet another problem. Since Saturday, every time I go to their web site I get a page back that displays as above. Eh? Where does this page come from – it’s not a browser message. A look at the source reveals what they’re up to:

<html>
<head><title>400 Request Header Or Cookie Too Large</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<center>Request Header Or Cookie Too Large</center>
<hr><center>nginx</center>
</body>
</html>
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->

 

Okay, but what the hell is wrong? This is using Chrome Version 56.0 on a Windows platform. Can ParentPay not cope with its standard request header? If a cookie is too large, the only culprit can be ParentPay itself for storing too much in its own cookie.

I’ve given them three days to fix it.

Unfortunately, parents of children at schools are forced to use this flaky web site and hand over their credit card details. How much confidence do I have in their technology? Take a guess!

Solution

So what to do about this? Well they have the URL https://parentpay.com, so I tried that too. It redirected to the original site, with a slightly different error message sent from the remote server – one that omitted mention of cookies. So it was definitely Chrome’s header? Upgrade Chrome for 56.0 to 57.0, just in case…. No dice.

A look at the cookies it stored was interesting. 67 cookies belonging to this site? I know Microsoft stuff is flabby, but this is ridiculous! Rather than trawling through them, I just decided to delete the lot.

That worked.

It appears ParentPay’s bonkers ASP code had stored more data in my browser than it was prepared to accept back. Stunning!

 

BT Internet Mail Fail (again)

BT Internet’s email system is broken AGAIN. It rejects everything it gets as “spam” (554 Message rejected, policy (3.2.1.1) – Your message looks like SPAM or has been reported as SPAM please read…)

Having checked against blacklists, and sent perfectly innocuous test text messages to friends account, it’s definitely busted.

My advice to anyone using BT Internet for important email is to get a proper account with a proper provider (or handle your email in-house if your name is not Fred and you don’t work from a shed).