ZFS Optimal Array Size

So there I was looking at a pile of eight drives and an empty storage array, and wondering how to cofigure it for best performance under ZFS. “Everyone knows” the formula right? The best performance in a raidz array comes if you use 2^D+P drives. That’s to say your data drives should be a power of two (i.e. 2,4,8,16) plus however many redundant (parity) drives for the raidz level you desire. This is mentioned quite often in the Lucas book FreeBSD Mastery:ZFS; although it didn’t originate there I’ll call it the Lucas rule anyway

I have my own rule – redundancy should be two drives or 30%. Why? Well drives in an array have a really nasty habit of failing two at a time. It’s not sods law, it’s a real phenomenon caused by the stress of re-silvering shaking out any other drives that are “on the edge”. This means I go for configurations such as 4+2, 5+2, 6+2. From there on I go to raidz3 with 7+3, 8+3, 9+3. As there’s no raidz4, 12 drives is the limit – for 14 drives I’d have two vdevs (LUNs) of 5+2 each.

However, If you merge my rule with the Lucas rule the only valid sizes are 2+2 and 4+2 and 8+3. And I had just eight drives to play with.

I was curious – how was the Lucas rule derived? I dug out the book, and it doesn’t say. Anywhere. Having a highly developed suspicion of anything described as “best practice” I decided to test it on my rag-bag collection of drives in the Dell backplane, and guess what? No statistically significant difference.

Now the trouble with IT “best practice” guides is they’re written by technicians based on observation, not OS programmers who know how stuff actually works. The first approach has a lot of merit, but unless you know the reason for your observations you won’t know when the reason has become irrelevant. Unfortuantely, as an OS programmer, I now had a duty to figure out what this reason might have been.

After wading through the code and finding nothing much helpful, I did what I should have done first and considered the low-level disk layout. It’s actually quite simple.

Your stuff is written to disk in a series of blocks, right? In a striped array, each drive gets a block in turn to spread the load. No problem there. Well there will be a problem if your ZFS block size doesn’t match the block size on the drives, but that’s a complication I’m going to overlook – lets just assume you got that bit right.

So where does the optimal number of disks come from? I contend that on a striped vdev there never was one. The problem only comes when you add redundant drives.

I’m going to digress here to explain how error correcting data happens – in very simple terms. Suppose you have a sequence of numbers such as:

5 8 2 3

Each number is stored on a separate piece of paper, and to guard against loss you add a fifth number so that when you add them all up you get a total ending in zero. In this example, the total of the first 4 is 18. You can add an extra 2 to make the total 20, which ends in zero, so the fifth number is going to be 2.

5 8 2 3 2

Now, if we lose any one of those five numbers we can work out what it must have been – just work out which digit when added to the remaining four gives you a total ending in zero. For example, supposing ‘3’ when missing. Add up the remainder and you get 17. You need 3 more to get to a zero, so the missing number must be 3.

Digression over. ZFS calculates a block of error correction data for the blocks of data it’s just written and adds this as the last block in the sequence. If If ZFS blocks and sectors were the same size, this would be fine writing another sector is quick. But ZFS blocks no longer match sectors. In fact, they’re tunable over a wide range. We’ve also got 4k sectors instead of the traditional 512b. So, suppose you had 2k ZFS blocks on a 4k sector disk? Your parity data could end up being just half a sector, meaning that ZFS has to read it, overwrite half, and write it back rather than just writing it. This sucks. But if you choose the number of disks carefully, you end up with parity blocks that do fit. So, always make sure you follow Lucas’ rule, and make sure your data drives are a power of two.

Except…

This may have been true once, but now we have variable ZFS blocks sizes, and they tend to be much larger than the sector size anyway. In this situation the “magic” configurations no longer matter. And, now we have lz4 compression, the physical block sizes are variable anyway.

For those not in the know about this, lz4 compression is a no-brainer. It wont’ compress stuff it can’t, and its fast. Most files will compress to at least 2:1, often more – which means when you read a block only half the data needs to travel down the bus to get in memory. Everything suddenly goes twice as fast, at the expense of one core having to do some work. It’s true that the block and sector sizes are nowhere near matching, and this is bound to have a performance hit, but this is more-than eclipsed by the improved transfer rate.

So in summary, forget the 2^D+P “best practice” formula. It was only valid in the early days. Have whatever config you like, but I do commend my rule about the number of redundant drives. This is based on a hardware issue, and no update to the software is going to fix this any time soon.

ESXi, NFS, ZFS and vfs.nfsd.async

So there I was, reading the source code to FreeBSD’s nfsd (as you do), trying to figure out why ESXi’s performance was so bad when used with an NFS datastore in a ZFS dataset. Actually, I had some idea. There’s a lot out there on the interweb about whether it’s safe to tweak it to ignore requests to flush the write cache using the sysctl tunable vfs.zfs.cache_flush_disable. (For what it’s worth, I’d say that if your drives are on a UPS it’s fine).

But why does ESXis suck so badly in this respect with NFS connected datastores? What is this excessive cache flushing all about? I decided to install it on an HP Microserver and get to some serious debugging.

Okay, here is how ZFS writes work. When you write something it doesn’t actually write, it puts it in the ZIL. This is an Intent Log – i.e. writes intended to happen.  Not exactly a write cache, but it has the same effect, and because of the way ZFS works it’s perfectly safe for avoiding data corruption. If a transaction is waiting in the ZIL when the music stops, the transaction is lots but the disk isn’t trashed. (NB. It’s also possible to put a ZIL on a log drive rather than RAM – I won’t discuss this here).

This should speed things up, right? Normally it does, but not when NFS is being abused. Let me explain. NFS has a transaction commit instruction. The client can tell NFS to flush everything in a transaction to permanent storage and not return until it’s finished. Sometimes you really need this, like if you’re updating the super-block in a database structure. Most of the time you don’t.

Enter ESXi running brain-dead Windows guest machines. How does it know when they’re writing something it isn’t a super-block? It doesn’t. So its solution (as far as I can tell) is to send NFS a commit after every single write and hang around waiting until it’s done it. There’s no point in having the ZIL at all, as it needs to be flushed every time. Putting the ZIL on disk is even worse, as you get an extra write/read for each transaction. I’ve seen people trying to put fast SSDs on the system to try and overcome this – best of luck with that.

As you move further down the chain, FreeBSD, being POSIX compliant whenever possible, will pass on the request for a synchronous write all the way to the disk. Send a block to a SATA or SAS drive and it will initially be cached, right? The write will then complete and the data actually written in the background while the rest of the system zips along. Except that it then issues a SATA or SAS “flush cache” command and waits until everything in its cache has been committed.

In tests this paranoid behaviour lead to running at 20% throughput or less.

Now, if you’re backing an emulated Windows disk you’re always at risk of data corruption, because FAT and NTFS are corruptable. And, dare I say it, crash rather too often. Let’s face it, if you’re worried about stuff like that you wouldn’t be running Windows – never mind as a VM, So lets be sensible about it.

So why was I reading the nfsd code? Well the obvious answer to this performance problem would be to simply ignore NTFS commit commands coming from the client. This is better than killing off all synchronous writes using the tunable vfs.zfs.cache_flush_disable because ZFS itself might be updating its uberblock and have a valid reason for doing it.

My plan was to hack the code – I’ve seen this done elsewhere. But wanting to do things properly I thought I should make it a system tunable. So I took a look at where the synchronous writes were happening – vdev_disk.c and vdev_geom.c (depending on whether you were hitting the raw drive or the GEOM). Lo and behold there was a global called nfs_sync that was compared along with the SYNC flag, and if either were true the sync request was ignored.  So where did nfs_async come from? Digging further back it comes from nfs_nfsdserv.c , where it’s set by a system tuneable – vfs.nfsd.async. Now that’s an interesting name! Follow the stable auto variable in nfsrvd_write() and the nfs_async global if you want to see what I’m on about.

A quick Google for vfs.nfsd.async revealed – nothing. I seem to have found another useful tunable that’s yet to be documented. although it’s been in the source since at least 10.0. So I’ll get on to documenting after I’ve done a few more tests.

But if you’re having Windows/NFS problems, especially with ESXi, try setting  vfs.nfsd.async instead of crudely disabling cache flushing with vfs.zfs.cache_flush_disable. Let me know how you get on.

Incidentally, you can disable synchronous writes to a dataset using the “sync=disabled” ZFS option. It helps, but not much. I’m still digging to find out why.
Or you could just use Virtualbox instead.

 

NHS not exactly target of “cyber-attack”

The Security and Intelligence Committee takes all this cyber-thingy stuff very seriously.

I got home, put on BBC News and there was some dope being interviewed about a “cyber-attack on the NHS”, blithering on about their M3 network and how secure it is. I turned over to Sky, and there was someone from Alienvault talking sense, but not detail. Followed by the chair of the Security and Intelligence Committee, Dominic Grieve, blustering on about how seriously the government took cyber-security but admitting he didn’t know anything about technology, in case it wasn’t obvious. I have never met anyone in parliament who does (see previous rants).

So what’s actually happening? It’s not an attack on the NHS. It’s a bunch of criminals taking advantage of a bug in Microsoft’s server software. Almost certainly MS17-010. An attack based on this exploit was used by NSA in America (Equation Group) until someone snaffled it and leaked it (allegedly Shadow Brokers). It’s been used in a family of ransomware called WannaCrypt, and it’s being used to extort money all over the place. I see no reason to believe the NHS has been targeted specifically. It’s targeting everyone vulnerable, all over the world. Poorer countries where they are running  more old software, or running bootleg version that don’t receive updates,  are worst hit.

So why is the news full of it being the NHS, and only the NHS? One reason is that Microsoft issued a patch for MS17-010 a good while back. And the NHS didn’t apply it. Why? Because they’re still using Windows XP and Microsoft didn’t issue the patch for Windows XP. Simple.

A lot (repeat A LOT) of companies use older Microsoft systems because (a) they’ve bought them, why should they pay again; and (b) Microsoft abandoned backward compatibility with Windows 7, so a lot of legacy software (dating back to the 1980’s) won’t run any more. Upgrading isn’t so simple.

There’s a lot of money (from Crapita Illogica (CGI), Atos and G4S – amongst others) in flogging dodgy Microsoft-based IT to government projects. Microsoft Servers are considered Job Security for people who can only understand how to use a wizard, but know it’ll break down regularly and they’ll be called upon to reinstall it.

No one who knows how computers work would ever use Microsoft servers except as a last resort.

Update 13-May-2017

Guess what? Microsoft has now released a patch for older versions of their server software (ie. Server 2003 and Windows XP). That was jolly quick; it’s like they had it already but didn’t release it to punish those who refused to “upgrade”.

Blue Whale Challenge

Blue Whale at the Marine Life Hall, American Museum of Natural History
This is a blue whale. Nothing to do with the latest chain letter hoax.
People seem to be getting really worked up about a so-called “Blue Whale Challenge” social media game. And understandably so – it’s a game where vulnerable children are targeted and given progressive challenge, culminating in something that will kill them.

I saw this first a couple of months ago, and each time it turns up the lurid details have been embellished further. It sounds too macabre to be true. And it’s not.

About a year ago someone in Russia published an on-line article hoping to explain the high number of teenage suicides in the country, and blaming it on the Internet. Apparently a statistically significant number of teenagers belonging to one particular on-line group had died; the on-line group must therefore be to blame.

Wrong! If you have an on-line group of depressed teenagers then you are going to have a higher proportion of suicides amongst them. The writers have confused cause and effect.

However, facts never got in the way of a good lurid story and this one seems to have bounced around Russia for most of 2016, where it morphed into an evil on-line challenge game. It then jumped the language gap to English in winter 2017.

The story spreads as a cautionary tale, with the suggestion that you should pass it on to everyone you know so they can check their kids for early signs they are being targeted (specifically, cutting a picture of a whale in to their arm). In other words, a classic email urban legend. It’s only a matter of time before the neighbourhood watch people add it to their newsletters.

Update:

The Daily Mail has reported this as fact, so I must be wrong and it must be true. Or perhaps I’m right and they have nothing to back their carefully worded account. Wouldn’t be the first time…

 

 

More Fraud on Amazon Marketplace

Fancy a roll of sellotape for £215.62? Amazon has this and 708,032 other products listed by a seller called linkedeu, who’s full range can be found here:
https://www.amazon.co.uk/s?merchant=AA722TCREQZHH.

This isn’t the first time sellers like this have appeared, and it won’t be the last. However, this time I’ve reported it to Amazon and I intend to time their response. How could they let some fraudster list nearly quarter of a million items without anyone checking?

The seller does have a business address in California, but I suspect this is fake too, and the name and address may well be a legitimate company.

 

ParentPay seriously broken (again)

400 Bad Request
ParentPay, the Microsoft-based school payment system that’s the bane of so many parents’ lives, has yet another problem. Since Saturday, every time I go to their web site I get a page back that displays as above. Eh? Where does this page come from – it’s not a browser message. A look at the source reveals what they’re up to:

<html>
<head><title>400 Request Header Or Cookie Too Large</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<center>Request Header Or Cookie Too Large</center>
<hr><center>nginx</center>
</body>
</html>
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->

 

Okay, but what the hell is wrong? This is using Chrome Version 56.0 on a Windows platform. Can ParentPay not cope with its standard request header? If a cookie is too large, the only culprit can be ParentPay itself for storing too much in its own cookie.

I’ve given them three days to fix it.

Unfortunately, parents of children at schools are forced to use this flaky web site and hand over their credit card details. How much confidence do I have in their technology? Take a guess!

Solution

So what to do about this? Well they have the URL https://parentpay.com, so I tried that too. It redirected to the original site, with a slightly different error message sent from the remote server – one that omitted mention of cookies. So it was definitely Chrome’s header? Upgrade Chrome for 56.0 to 57.0, just in case…. No dice.

A look at the cookies it stored was interesting. 67 cookies belonging to this site? I know Microsoft stuff is flabby, but this is ridiculous! Rather than trawling through them, I just decided to delete the lot.

That worked.

It appears ParentPay’s bonkers ASP code had stored more data in my browser than it was prepared to accept back. Stunning!

 

BT Internet Mail Fail (again)

BT Internet’s email system is broken AGAIN. It rejects everything it gets as “spam” (554 Message rejected, policy (3.2.1.1) – Your message looks like SPAM or has been reported as SPAM please read…)

Having checked against blacklists, and sent perfectly innocuous test text messages to friends account, it’s definitely busted.

My advice to anyone using BT Internet for important email is to get a proper account with a proper provider (or handle your email in-house if your name is not Fred and you don’t work from a shed).

M A G Airports web site exploitable for mailbombing attacks

Last July I was surprised to receive an email of “special offers” from Manchester Airport. I’ve only ever been to Manchester once, and I drove. It was actually sent to a random email address; was the company just sending out random spam?

I checked, and visiting their web site produced a JavaScript pop-up asking you to enter your email address to receive special offers. I wondered if I’d accidentally confirmed acceptance to be added to the wrong mailing list, so I checked. No. Apparently this sign-up doesn’t bother to confirm that you actually own the email addressed entered; it just starts spamming whoever you ask it to.

It got worse. A look at the code showed it was easy for someone to make a load of calls to their site and add as many bogus addresses as they liked at the rate of several every second.

And it gets even worse – a quick look at the sites for other airports operated by MAG had identical pop-up sign-ups (Stansted, Bournemouth and East Midlands).

Naturally I called them to let them know what a bunch of silly arses they were. After being passed around from one numpty to another, I was promised a call back. “Okay, but I’ll go public if you don’t bother”.

Guess what? That was last July and they haven’t bothered. They did, however, remove the pop-up box eventually. They didn’t disable it, however. The code is still there on a domain owned by MAG Airports, and you can still use it to do multiple sign-ups with no verification.

So what are they doing wrong? Two things:

  1. Who in their right mind would allow unlimited sign-ups to a newsletter without verifying that the owner of the email address actually wanted it? Were they really born yesterday? Even one of the MD’s kids writing their web site wouldn’t have made such an elementary mistake.
  2. Their cyber-security incident reporting mechanisms need a lot of work. Companies that don’t have a quick way of hearing about security problems are obviously not doing themselves or the public any favours.

One assumes that MAG Airports doesn’t have any meaningful cybersecurity department; nor any half-way competent web developers. I’d be delighted to hear from them otherwise.

In the meantime, if you want to add all your enemies to their spamming list, here’s the URL format to do it:

Okay, perhaps not but if it’s not fixed by the next time I’m speaking at a conference, it’s going on the demo list.

 

New DVLA on-line system is broken

Why can’t companies implementing government on-line systems actually get anything right? And if they must mess things up, why can’t they do it in private? The new DVLA system is broken. They ought to have tested it in-house, without launching a beta version on the public. Seriously, do they not know what a beta version is for?

My experience – I went through and entered all the details, paid, and got this:


It’s now impossible to tell whether it’s taken payment from the card or not. Okay, this appears to be an external system that’s screwed up BUT it’s not be handled properly. Basic rule of data communications – Assume the link will be corrupted and cope with it.

Baofeng DMR handheld – the DM-5R

DM-5R PlusIn 2016 Baofeng released the DM-5R – what sounded like a fantastic DMR radio at a very attractive price. One of the best features was that it maintained the same form factor as the UV-5R, meaning accessories were cheap and plentiful. In fact it was completely compatible as an analogue transceiver, but with DMR too.

Only one huge problem – it only implemented Tier-1, which basically meant it could only talk to other DM-5Rs – not to the Motorola or Motorola-compatible Tier-2 units.

Suppliers insisted that Baofeng was going to release a software update for it. I’m on record elsewhere as being sceptical of this, as I’ve never seen a way to update the software on any Baofeng radios, even when they’ve introduced killer bugs in to the wild.

Apparently I was wrong(-ish), and a firmware update has appeared for the promised $10. Furthermore, a DM-5R Plus has also turned up on the market, with Tier-2 software already. I don’t have confirmed specifications (i.e. the unit in my hand) but there’s some question about the battery. Sometimes its listed as 1.5Ah, other time 2Ah. BL-5 battery packs (the UV-5R standard) are 1.8mAh. I really hope they haven’t been crazy enough to come up with a new battery format.

Battery aside, what’s not to like? If if’s Tier-2/Motorola compatible, then I’m sure I’ll love it. But how compatible is it? Questions remain. Take this announcement from DMR-UK (target likely to expire) quoting a Phoenix Repeater Keeper:

“I have now heard a station using the DM-5R on the Phoenix network. I can confirm that although the radio appeared to work (apart from having very low audio) it was actually occupying both time slots on the originating repeater. This confirms that even though the so-called Tier 2 update had been done it was still working as a Tier 1 radio.”

This is unattributed, and it’s not clear whether the transceiver was a DM-5R Plus or an upgraded DM-5R. I don’t even know if an upgraded DM-5R becomes identical to a 5R Plus. This will become clear over time.

That Baofeng didn’t get the complex firmware right first time would come as no surprise. But do I want to risk it? Only if they promised to offer a free fix; but they really don’t have a good track record there.