Linux – Frank Leonhardt's Blog

Why people obsess about the ZFS SLOG, but shouldn’t

There are two mysteries things on ZFS that cause a lot of confusion: The ZIL and the SLOG. This article is about what they are and why you should care, or not care about them. But I’ll come to them later. Instead I’ll start with POSIX, and what it says about writing stuff to disk files.

When you write to disk it can either be synchronous or asynchronous. POSIX (Portable Operating System Interface) has requirements for writes through various system calls and specifications.

With an asynchronous write the OS takes the data you give it and returns control to the application immediately, promising to write the data as soon as possible in the background. No delay. With a synchronous write the application won’t get control back until the data is actually written to the disk (or non-volatile storage of some kind). More or less. Actually, POSIX.1-2017 (IEEE Standard 1003.1-2017) doesn’t guarantee it’s written, but that’s the expectation.

You’d want synchronous writes for critical complex files, such as a database, where the internal structure would break if a transaction was only half written, and a database engine needs to know that one write has occurred before making another.

Writes to ZFS can be long and complicated, requiring multiple blocks be updated for a single change. This is how it maintains its very high integrity. However, this means it can take a while to write even the simplest thing, and a synchronous write could take ages (in computer terms).

To get around this, ZFS maintains a ZIL – ZFS Intent Log.

In ZFS, the ZIL primarily serves to ensure the consistency and durability of write operations, particularly for synchronous writes. But it’s not a physical thing; it’s a concept or list. It contains transaction groups that need to be completed in order.

The ZIL can be physically stored in three possible places…

In-Memory (Volatile Storage):

This is the default location. Initially, all write operations are buffered in RAM. This is where they are held before being committed to persistent storage. This kind ofZIL is volatile because it’s not backed by any permanent storage until written to disk.

Volatility doesn’t matter, because ZFS guarantees consistency with transaction groups (TXGs). The power goes off and the in-RAM ZIL is lost, the transactions are never applied; but the file system is in a consistent state.

In-Pool (Persistent Storage):

Without a dedicated log device, the ZIL entries are written to the main storage pool in transaction groups . This happens for both synchronous and asynchronous writes but is more critical for synchronous writes to ensure data integrity in case of system crashes or power failures.

SLOG (Separate Intent Log Device):

For better performance with synchronous writes, you can add a dedicated device to serve as the SLOG. This device is typically a low-latency, high-speed storage like a short-stroked Rapter, enterprise SSD or NVRAM. ZFS writes the log entries before they’re committed to the pool’s main storage.

By storing the pending TXGs on disk, either in the pool or on an SLOG, ZFS can meet the POSIX requirement that the transaction is stored in non-volatile storage before the write returns, and if you’re doing a lot of synchronous writes then storing them on a high-speed SLOG device helps. But only if the SLOG device is substantially faster than an array of standard drives. And it only matters if you do a lot of synchronous writes. Caching asynchronous writes in RAM is always going to be faster still

I’d contend that the only times synchronous writes feature heavily are databases and virtual machine disks. And then there’s NFS, which absolutely loves them. See ESXi NFS ZFS and vfs-nfsd-async for more information if this is your problem.

If you still think yo need an SLOG, install a very fast drive. These days an NVMe SLC NAND device makes sense. Pricy, but it doesn’t need to be very large. You can add it to a zpool with:

zpool add poolname
log /dev/daX

Where daX is the drive name, obviously.

As I mentioned, the SLOG doesn’t need to be large at all. It only has to cope with five seconds of writes, as that’s the maximum amount of time data is “allowed” to reside there. If you’re using NFS over 10Gbit Ethernet the throughput isn’t going to be above 1.25Gb a seconds. Assuming that’s flat-out synchronous writes, multiplying that by five seconds is less than 8Gb. Any more would be unused.

If you’ve got a really critical system you can add mirrored SLOG drives to a pool thus:

zpool add poolname
log /dev/daX /dev/daY

You can also remove them with something like:

zpool remove
poolname log /dev/daY

This may be useful if adding an SLOG doesn’t give you the performance boost you were hoping for. It’s very niche!

11-June-236-March-25

Using ddrescue to recover data from a USB flash drive

If you’re in the data recovery, forensics or just storage maintenance business (including as an amateur) you probably already know about ddrescue. Released about twenty years ago by Antonio Diaz Diaz, it was a big improvement over the original concept dd_rescue from Kurt Garloff in 1999. They copy disk images (which are just files in Unix) trying to get as much data extracted when the drive itself has faults.

If you’re using Windows rather than Unix/Linux then you probably want to get someone else to recover your data. This article assumes FreeBSD.

The advantage of using either of these over dd or cp is that they expect to find bad blocks in a device and can retry or skip over them. File copy utilities like dd ignore errors and continue, and cp will just stop. ddrescue is particularly good at retrying failed blocks, and reducing the block size to recover every last readable scrap – and it treats mechanical drives that are on their last legs as gently as possible.

If you’re new to it, the manual for ddrescue can be found here. https://www.gnu.org/software/ddrescue/manual/ddrescue_manual.html

However, for most use cases the command is simple. Assuming the device you want to copy is /dev/da1 and you’re calling it thumbdrive the command would be:

ddrescue /dev/da1
thumbdrive.img thumbdrive.map

The device data would be stored in thumbdrive.img, with ongoing state information stored in thumbdrive.map. This state information is important, as it allows ddrescue to pick up where it left off.

However, ddrescue was written before USB flash drives (pen drives, thumb drives or whatever). That’s not to say it doesn’t work, but they have a few foibles of their own. It’s still good enough that I haven’t modified ddrescue base code to cope, but by using a bit of a shell script to do the necessary.

USB flash drives seem to fail in a different way to Winchester disks. If a block of Flash EPROM can’t be read it’s going to produce a read error – fair enough. But they have complex management software running on them that attempts to make Flash EPROM look like a disk drive, and this isn’t always that great in failure mode. In fact I’ve found plenty of examples where they come across a fault and crash rather than returning an error, meaning you have to turn them off and on to get anything going again (i.e. unplug them and put them back in).

So it doesn’t matter how clever ddrescue is – if it hits a bad block and the USB drive controller crashes the it’s going to be waiting forever for a response and you’ll just have come reset everything manually and resume. One of the great features of ddrescue is that it can be stopped and restarted at any time, so continuing after this happens is “built in”.

In reality you’re going to end up unplugging your USB flash drive many times during recovery. But fortunately, it is possible to turn a USB device off and on again without unplugging it using software. Most USB hardware has software control over its power output, and it’s particularly easy on operating systems like FreeBSD to do this from within a shell script. But first you have to figure out what’s where in the device map – specifically which device represents your USB drive in /dev and which USB device it is on the system. Unfortunately I can’t find a way of determining it automatically, even on FreeBSD. Here’s how you do it manually; if you’re using a version of Linux it’ll be similar.

When you plug a USB storage device into the system it will appear as /dev/da0 for the first one; /dev/da1 for the second and so on. You can read/write to this device like a file. Normally you’d mount it so you can read the files stored on it, but for data recovery this isn’t necessary.

So how do you know which /dev/da## is your media? This easy way to tell is that it’ll appear on the console when you first plug it in. If you don’t have access to the console it’ll be in /var/log/messages. You’ll see something like this.

Jun 10 17:54:24 datarec kernel: umass0 on uhub5
kernel: umass0: <vendor 0x13fe USB DISK 3.0, class 0/0, rev 2.10/1.00, addr 2> on usbus1
kernel:  umass0 on uhub5
kernel:  umass0:  on usbus1
kernel:  umass0:  SCSI over Bulk-Only; quirks = 0x8100
kernel:  umass0:7:0: Attached to scbus7
kernel:  da0 at umass-sim0 bus 0 scbus7 target 0 lun 0
< USB DISK 3.0 PMAP> Removable Direct Access SPC-4 SCSI device
kernel:  da0: Serial Number 070B7126D1170F34
kernel:  da0: 40.000MB/s transfers
kernel:  da0: 59088MB (121012224 512 byte sectors)
kernel: da0: quirks=0x3
kernel: da0: Write Protected

So this is telling us that it’s da0 (i.e /dev/da0)

The hardware identification is “<vendor 0x13fe USB DISK 3.0, class 0/0, rev 2.10/1.00, addr 2> on usbus1” which means it’s on USB bus 1, address 2.

You can confirm this using the usbconfig utility with no arguments:

ugen5.1:  at usbus5, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=SAVE (0mA)
 ...snip...
ugen1.1:  at usbus1, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=SAVE (0mA)
ugen1.2:  at usbus1, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (300mA)

There it is again, last line.

usbconfig has lots of useful commands, but the ones we’re interested are power_off and power_on. No prizes for guessing what they do. However, unless you specify a target then it’ll switch off every USB device on the system – including your keyboard, probably.

There are two ways of specifying the target, but I’m using the -d method. We’re after device 1.2 so the target is -d 1.2

Try it and make sure you can turn your USB device off and on again. You’ll have to wait for it to come back online, of course.

There are ways of doing this on Linux by installing extra utilities such as hub-ctrl. You may also be able to do it by writing stuff to /sys/bus/usb/devices/usb#/power/level” – see the manual that came with your favourite Linux distro.

The next thing we need to do is provide an option for ddrescue so that it actually times out if the memory stick crashes. The default is to wait forever. The –timeout=25 or -T 25 option (depending on your optional taste) sees to that, making it exit if it hasn’t been able to read anything for 25 seconds. This isn’t entirely what we’re after, as a failed read would also indicate that the drive hadn’t crashed. Unfortunately there’s no such tweak for ddrescue, but failed reads tend to be quick so you’d expect a good read within a reasonable time anyway.

So as an example of putting it all into action, here’s a script for recovering a memory stick called duracell (because it’s made by Duracell) on USB bus 1 address 2.

#!/bin/sh
while ! ddrescue -T 25 -u /dev/da0 duracell.img duracell.map
 do
     echo ddrescue returned $?
     usbconfig -d 1.2 power_off
     sleep 5
     usbconfig -d 1.2 power_on
     sleep 15
     echo Restarting
 done

A few notes on the above. Firstly, ddrescue’s return code isn’t defined. However, it appears to do what one might expect so the above loop will drop out if it ever completes. I’ve set the timeout for time since last good read to 25 seconds, which seems about right. Turning off the power for 5 seconds and then waiting for 15 seconds for the system to recognise it may be a bit long – tune as required. I’m also using the -u option to tell ddrescue to only go forward through the drive as it’s easier to read the status when it’s always incrementing. Going backwards and forwards makes sense with mechanical drives, but not flash memory.

Aficionados of ddrescue might want to consider disabling scraping and/or trimming (probably trimming) but I’ve seen it recover data with both enabled. Data recovery is an art, so tweak away as you see fit – I wanted to keep this example simple.

Now this system isn’t prefect. I’m repurposing ddrescue, which does a fine job on mechanical drives, to recover data from a very different animal. I may well write a special version for USB Flash drives but this method does actually work quite well. Let me know how you get on.

6-September-2215-September-22

Proper Case in a shell script

How do you force a string into proper case in a Unix shell script? (That is to say, capitalise the first letter and make the rest lower case). Bash4 has a special feature for doing it, but I’d avoid using it because, well, I want to be Unix/POSIX compatible.

It’s actually very easy once you’ve realised tr won’t do it all for you. The tr utility has no concept on where in the input stream it is, but combining tr with cut works a treat.

I came across this problem when I was writing a few lines to automatically create directory layouts for interpreted languages (in this case the Laminas framework). Languages of this type like capitalisation of class names, but other names have be lower case.

Before I get started, I note about expressing character ranges in tr. Unfortunately different systems have done it in different ways. The following examples assume BSD Unix (and POSIX). Unix System V required ranges to be in square brackets – e.g. A-Z becomes “[A-Z]”. And the quotes are absolutely necessary to stop the shell globing once you’ve introduced the square brackets!

Also, if you’re using a strange character set, consider using \[:lower:\] and \[:upper:\] instead of A-Z if your version of tr supports it (most do). It’s more compatible with foreign character sets although I’d argue it’s not so easy on the eye!

Anyway, these examples use A-Z to specify ASCII characters 0x41 to 0x5A – adjust to suit your tr if your Unix is really old.

To convert a string ($1) into lower case, use this:

lower=$(echo $1 | tr A-Z a-z)

To convert it into upper case, use the reverse:

upper=$(echo $1 | tr a-z A-Z)

To capitalise the first letter and force the rest to lower case, split using cut and force the first character to be upper and the rest lower:

proper=$(echo $1 | cut -c 1 | tr a-z A-Z)$(echo $1 | cut -c 2- | tr A-Z a-z)

A safer version would be:

proper=$(echo $1 | cut -c 1 | tr "[:lower:]" "[:upper:]")$(echo $1 | cut -c 2- | tr "[:upper:]" [":lower:"])

This is tested on FreeBSD in /bin/sh, but should work on all BSD and bash-based Linux systems using international character sets.

You could, if you wanted to, use sed to split up a multi-word string and change each word to proper case, but I’ll leave that as an exercise to the reader.

13-February-2214-February-22

Nothing new with Intel SDSi

Intel’s latest wheeze for its CPUs is Software Defined Silicone (SDSi). The deal is that you buy the CPU at one price and then pay extra for a license to enable more stuff.

If you want the geeky stuff about how it’s supposed to work in Linux, see here. https://github.com/intel/intel-sdsi

Basically, the CPU has an interface that you can access if you have an Authentication Key Certificate (AKC) and have purchased a Capability Activation Payload (CAP) code. This will then enable extra stuff that was previously disabled. Quite what the extra stuff is remains to be seen – it could be extra instructions or enabling extra cores on a multi-core chip, or enabling more of the cache. In other words, you buy extra hardware that’s disabled, and pay extra to use it. What’s even more chilling is that you could be continuously paying licenses for the hardware you’ve bought or it’ll stop working.

It’s not actually defining the silicone in software like a FPGA, as you’d expect from euphemistic name. Software Defined Uncrippling would be more honest, but a harder sell.

But this is nothing new. I remember IBM doing this with disk drives in the 1970’s. If you upgraded your drive to double the capacity an IBM tech turned up and removed a jumper, enabling the remaining cylinders. Their justification was that double the capacity meant double the support risk – and this stuff was leased.

Fast forward 20 years to Intel CPUS. Before the Intel 80486 chips you could provide whatever input clock you wanted to your 80386, just choosing how fast it went. Intel would guarantee the chip to run at a certain speed, but that was the only limiting factor. Exceed this speed at your own risk.

The thing was that the fast and slow CPUs were theoretically identical. It’s often the case with electronic components. However, manufacturing tolerances mean that not all components end up being the same, so they’re batch tested when the come off the line. Those that pass the toughest test get stamped with a higher speed and go in the fast bucket, where they’re sold for more. Those that work just fine at a lower speed go into the slower bucket and sell for less. Fair enough. Except…

It’s also the nature of chip manufacture that the process improves over time, so more of the output meets the higher test – eventually every chip is a winner. You don’t get any of the early-run slow chips, but you’re contracted to sell them anyway. The answer is to throw some of the fast chips into the slow bucket and sell them cheap, whilst selling others at premium price to maintain your margins.

In the early 1990’s I wrote several articles about how to take advantage of this in PCW, after real-world testing of many CPUs. It later became known as overclocking. I also took the matter up with Intel at the time, and they explained that their pricing had nothing to do with manufacturing costs, and everything to do with supply and demand. Fair enough – they were honest about it. This is why AMD gives you more bang-per-buck – they choose to make things slightly better and cheaper because that maximises their profits too.

With the introduction of the 80486, the CPU clock speed was set in the package so the chip would only run at the speed you paid for. SDSi is similar, except you can adjust the setting by paying more at a later date. It also makes technical sense – producing large quantities of just one chip has huge economies of scale. The yield improves, and you just keep the fab working. In order to have a product range you simply knobble some chips to make them less desirable. And using software to knobble them is the ultimate, as you can decide at the very last minute how much you want to sell the chip for, long after it’s packaged and has left the factory.

All good? Well not by me. This only works if you’re in a near monopoly position in the first place. Microsoft scalps its customers with licenses and residual income, and Intel wants in on that game. It’s nothing about being best, it’s about holding your customers to ransom for buying into your tech in the first place. This hasn’t hurt Microsoft’s bottom line, and I doubt it’ll hurt Intel’s either.

11-April-2014-April-20

FreeBSD in Godden Green

What is going on with FreeBSD in Godden Green in Kent, UK? Jobsite has been spamming me with junior/mid-level programmer roles mentioning FreeBSD for months now, and I’m getting curious!

I have an alert set up so whenever FreeBSD is mentioned I get a ping, as I like to know what’s going on. This isn’t one of the usual suspect AFAIK – they might even be interesting!

3-November-17

What’s the point of Docker on FreeBSD or Solaris?

Penguinisters are very keen on their docker, but for the rest of us it may be difficult to see what the fuss is all about – it’s only been around a few years and everyone’s talking about it. And someone asked again today. What are we missing?

Well docker is a solution to a Linux (and Windows) problem that FreeBSD/Solaris doesn’t have. Until recently, the Linux kernel only implemented the original user isolation model involving chroot. More recent kernels have had Control Groups added, which are intended to provide isolation for a group of processes (namespaces). This came out of Google, and they’ve extended to concept to include processor resource allocation as one of the knobs, which could be a good idea for FreeBSD. The scheduler is aware of the JID of the process it’s about to schedule, and I might take a look in the forthcoming winter evenings. But I digress.

So if isolation (containerisation in Linux terms) is in the Linux kernel, what is Docker bringing to the party? The only thing I can think of is standardisation and an easy user interface (at the expense of having Python installed). You might think of it in similar terms to ezjail – a complex system intended to do something that is otherwise very simple.

To make a jail in FreeBSD all you need do is copy the files for your system to a directory. This can even be a whole server’s system disk if you like, and jails can run inside jails. You then create a very simple config file, giving the jail a name, the path to your files and an what IP addresses to pass through (if any) and you’re done. Just type “service jail nameofjal start”, and off it goes.

Is there any advantage in running Docker? Well, in a way, there is. Docker has a repository of system images that you can just install and run, and this is what a lot of people want. They’re a bit like virtual appliances, but not mind-numbingly inefficient.

You can actually run docker on FreeBSD. A port was done a couple of years ago, but it relies on the 64-bit Linux emulation that started to appear in 10.x. The newer the version of FreeBSD the better.

Docker is in ports/sysutils/docker-freebsd. It makes uses of jails instead of Linux cgroups, and requires ZFS rather than UFS for file system isolation. I believe the Linux version uses Union FS but I could be completely wrong on that.

The FreeBSD port works with the Docker hub repository, giving you access to thousands of pre-packaged system images to play with. And that’s about as far as I’ve ever tested it. If you want to run the really tricky stuff (like Windows) you probably want full hardware emulation and something like Xen. If you want to deploy or migrate FreeBSD or Solaris systems, just copy a new tarball in to the directory and go. It’s a non-problem, so why make it more complicated?

Given the increasing frequency Docker turns up in conversations, it’s probably worth taking seriously as Linux applications get packaged up in to images for easy access. Jails/Zones may be more efficient, and Docker images are limited to binary, but convenience tends to win in many environments.

21-September-1721-September-17

Dell FS12-NV7 Review – Bargain FreeBSD/ZFS box

It seems just about everyone selling refurbished data centre kit has a load of Dell FS12-NV7’s to flog. Dell FS-what? You won’t find them in the Dell catalogue, that’s for sure. They look a bit like C2100s of some vintage, and they have a lot in common. But on closer inspection they’re obviously a “special” for an important customer. Given the number of them knocking around, it’s obviously a customer with big data, centres stuffed full of servers with a lot of processing to do. Here’s a hint: It’s not Google or Amazon.

So, should you be buying a weirdo box with no documentation whatsoever? I’d say yes, definitely. If you’re interests are anything like mine. In a 2U box you can get twin 4-core CPUs and 64Gb of RAM for £150 or less. What’s not to like? Ah yes, the complete lack of documentation.

Over the next few weeks I intend to cover that. And to start off this is my first PC review for nearly twenty years.

So the Dell FS12-NV7:

FS-12 looking at the back panel. Note the cowling across the CPUs

As I mentioned, it’s a 2U full length heavy metal box on rails. On the back there are the usual I/O ports: a 9-way RS-232, VGA, two 1Gb Ethernet, two USB2 and a PS/2 keyboard and mouse. The front is taken up by twelve 3.5″ hard drive bays, with the status lights and power button on one of the mounting ears to make room. Unlike other Dell servers, all the connections are on the back, only.

If you want to play with the metalwork, the rear panel is modular and can easily be unscrewed although in practice there’s not much scope for enhancement without changing the motherboard.

Speaking of metalwork, it comes with a single 1U PSU. There’s space above it for a second, but the back panel behind the PSU bay would need swapping – or removing – if you wanted to add a second. The area above the existing unit is just about the only space left in the box, and I have thought of piling up a load of 2.5″ drives there.

Taking the top off is where the fun starts. Inside there’s large Gigabyte EATX motherboard – a Gigabyte GA-3CESL-RH. All the ones I’ve seen are rev 1.7, which is a custom version but its similar to a rev 1.4. It does have, of all things, a floppy disk controller and an IDE (PATA) connector. More generally usefully, there are two more USB headers, a second RS-232 and six SATA sockets (3Gb). At the back there’s either a BMC module, or a socket where it used to be. If you like DRAC, knock yourself out (you’re likely to be barely concious to begin with). Seriously, this is old DRAC and probably only works with IE 2.0 or something. (You can probably tell I haven’t bothered to try it). The BIOS also allows you to redirect the console to the serial port for remote starting.

The Ethernet ports are Marvel 88E1116 1Gb, and haven’t given me any trouble. The firmware supports PXE, and I’m pleased to say that WoL works with the FreeBSD drives.

Unfortunately, while the original Gigabyte model sported twin PCI and three PCIe sockets, the connectors are missing from these examples. It’s hard to find anything with a bit of grunt that can also use with your old but interesting PCI cards. It should be possible to rework it by adding the sockets and smoothing caps and sockets; fortunately the SMD decoupling caps are already still there. On the other had, you could find another motherboard with PCI sockets if that’s what you really want.

But grunt is what this box is all about, and there’s plenty of that.

This is board was designed for Opteron Socket-F processors; specifically the 2000 series (Barcelona and Shanghi). The first digit refers to the number of physical CPUs that work together (either 2 or 8), the second is a code for the number of cores (1=1, 2=2, 3=4, 4=6, 5=8). The last two digits are a speed code. It’s not the frequency, it’s the benchmark speed. I’ve heard rumours that some of FS-12s contain six-core CPUs, but I’ve only seen the 2373EE myself. The EE is the low power consumption version. Sweet.

If I could choose any Opeteron Socket-F CPU, the 2373EE is almost as good as it gets. It’s a tad slower than some of the other models running at 2.1GHz , but has significantly lower power and cooling requirements and was one of the last they produced in the 45nm process. It would be possible to change it for a 2.3GHz version, or one with six cores, but otherwise pretty much every other Opteron would be a downgrade. In other words, don’t think you can hot-rod it with a faster processor – you’re unlikely to find a Socket-F CPU anyway. After these, AMD switched to the Bulldozer line in an AM3+ socket.

This isn’t to say the CPU is modern. It does have the AMD virtualisation instructions, so it’s good news if you want to run nested 64-bit operating systems or hypervisors. The thing it lacks that I’d like most are the AES instructions that appeared in Bulldozer onwards. If you’re doing a lot of crypto, this matters. If you’re not, it doesn’t. Naturally, it implements the AMD64 instruction set, as now used by Intel, and all the media processing bit-twiddle stuff if you can use it. AMD has traditionally been at the forefront of processing smarter, whereas Intel goes for brute force and cranks up the clock speed. This is why AMD has, in my opinion, made assembler programming fun again.

Eight very capable Opteron cores: a good start. This generation supported DDR2 ECC RAM, and these boxes have 16 sockets (eight per CPU). They should be able to support 8Gb DIMMs, although I haven’t been able to verify this. Gigabyte’s documentation on similar motherboards is inconclusive as the earlier boards were from an time when 4Gb was all you could get. Again, I haven’t tried this but they are designed to handle 512Mb DIMMs. 1Gb and 4Gb certainly work and these tend to be available with any FS-12 you buy. At one time DDR2 ECC RAM was rather expensive. Not now. It’s much cheaper than DDR3 because, to be blunt, you can’t use it in very much these days.

And this is what makes the FS12 such a good buy: For about £150 you can get an eight-core processor with 64Gb of RAM. Bargain! And that’s before you look at the disk options.

The FS12, like most Dell Servers, is set up to run Windows and as a result requires a separate volume manager, on hardware designed to pretend Windows is looking at a disk. So-called “hardware” RAID. This takes the form of two PERC6/i cards occupying both PCIe cards on a riser. Fine if you want to run Windows or some other lightweight operating system, but PERC cards are about as naff as you can get for anything Unix-like. They work in RAID mode only, hiding the drives from the OS, and these are just a bit to old to be re-flashed in to anything useful.

The drives fit into a front-loading 12-way array with a SAS/SATA backplane. This is built in to the case; you can’t detach it and use it separately. Not without an angle grinder anyway, although if you really wanted to this would be a practical proposition. Note well that this is a backplane; not an expander, enclosure or anything so complex. Some Dell 2U servers like this do have an expander, which takes four SAS channels of SAS on a single cable and expands them to twelve, but this is the 1:1 version. And it’s an old one at that, using SFF-8484 connectors. If you’ve been using SAS for years you may still never have seen an SFF-8484 (AKA 32-pin Multi-lane). These didn’t last long and were quickly replaced with the far more sensible SFF-8487(AKA 36-pin Mini-SAS). However, if you can sort out the cables (as I will explain in a later post), this backplane has possibilities.

But as it stands you get a the PERCs and a 12-slot drive array that’s only good for Windows or Linux. Unless, that is, you remove the backplane and the PERCs and make use of the six 3Gb SATA sockets on the motherboard. You’ll have to leave the drives in place and connect the cables directly back, but how many drives do you need?

There is one unfortunate feature of these boxes that is hard to ignore: the cooling. It’s effective, but when you turn it on it sounds like a jet engine spooling up. And then it gets even louder. There a lot you can do about this and I’m experimenting with options, which I’ll explain in a later post, but in the mean time you need to give everyone ear defenders, or install it in an outbuilding and use a KVM extender. I’ve been knocking around data centres for over twenty years and I’ve never heard one this bad.

The cooling is actually accomplished by five fans. Two are 1U size in the PSU, and are probably as annoying as any other ~40mm fan. The real screamers are two 80mm and one 60mm fan positioned between the drive cage and the motherboard. A cowling directs the one 80mm fan across each CPU and its DIMMs and the 60mm gives airflow over the Northbridge and PCI slots. They all spin really fast – in excess of 10,000rpm, and although they have sense and control wires nothing seems to be adjusting them downwards to the required rate.

My suspicion is that either the customer didn’t care about noise but wanted to keep everything as cool as possible, or that whatever operating system was installed (ESX I suspect) had a custom daemon to control their speed via the SAS backplane. I shall be going in to cooling options later, but note that the motherboard has five monitored and software adjustable fan connectors that are currently not used.

So, in summary, you’re getting a lot for your money if its the kind of thing you want. It’s ideal as a high-performance Unix box with plenty of drive bays (preferably running BSD and ZFS). In this configuration it really shifts. Major bang-per-buck. Another idea I’ve had is using it for a flight simulator. That’s a lot of RAM and processors for the money. If you forego the SAS controllers in the PCIe slots and dump in a decent graphics card and sound board, it’s hard to see what’s could be better (and you get jet engine sound effects without a speaker).

So who should buy one of these? BSD geeks is the obvious answer. With a bit of tweaking they’re a dream. It can build-absolutely-everything in 20-30 minutes. For storage you can put fast SAS drives in and it goes like the wind, even at 3Gb bandwidth per drive. I don’t know if it works with FreeNAS but I can’t see why not – I’m using mostly FreeBSD 11.1 and the generic kernel is fine. And if you want to run a load of weird operating systems (like Windows XP) in VM format, it seems to work very well with the Xen hypervisor and Dom0 under FreeBSD. Or CentOS if you prefer.

So I shall end this review in true PCW style:

Pros:

Cheap
Lots of CPUs,
Lots of RAM
Lots of HD slots
Great for BSD/ZFS or VMs

Cons:

Noisy
no AES-NI
SAS needs upgrading
Limited PCI slots

As I’ve mentioned, the noise and SAS are easy and relatively cheap to fix, and thanks to BitCoin miners, even the PCI slot problem can be sorted. I’ll talk about this in a later post.

25-April-1627-April-16

UbuntuBSD – lovechild of Linux and FreeBSD

It’s no secret that Linux users with good taste have viewed the FreeBSD kernel with envious eyes for many years. A while back Debian distributions started having the FreeBSD kernel as an option instead of the Linux one. (Yes, you read that correctly). But now things seem to have been turned up a notch with UbuntuBSD.

It seems a group of penguinistas regard the Ubuntu world’s adoption of systemd as a step too far, and forked. And rather than keeping with Linux, they’ve opted to dump the whole kernel and bolt the Ubuntu front-end on to FreeBSD instead, getting kernel technology like ZFS and jails but “…keeping the familiarity of Ubuntu”.

Where could this be going? We already have PC-BSD for a “shrink wrapped” graphical desktop environment. Is anyone actually using it? I’m not. I’m sure we’ve all downloaded it out of curiosity, but if I want a Windows PC I’ll have a Windows PC. With BSD I’m more than happy with a command line, thank you very much.

UbuntuBSD could be different. Linux users actually use the graphical desktop, and most can’t cope with a command line. If they were to switch to FreeBSD instead, UbuntuBSD would make a lot of sense.

Although it’s only been around a month, in early beta form, its Sourceforge page is showing a lot of downloads. If I wanted to run a graphical desktop on top of FreeBSD, UbuntuBSD would make a lot of sense over PC-BSD, because I get the vibes that Ubuntu has desktop applications more together.

The project has just launched its own web site too, at www.ubuntubsd.org.

So does this spell the end of PC-BSD, Ubuntu Linux, Windows 10 or none of the above? It’s surely a strong vote against systemd.

7-December-157-December-15

How to stop Samba users deleting their home directory and email

UNIX permissions can send you around the twist sometimes. You can set them up to do anything, not. Here’s a good case in point…

Imagine you have Samba set up to provide users with a home directory. This is a useful feature; if you log in to the server with the name “fred” you (and only you) will see a network share called “fred”, which contains the files in your UNIX/Linux home directory. This is great for knowledgeable computer types, but is it such a great idea for normal lusers? If you’re running IMAP email it’s going to expose your mail directory, .forward and a load of other files that Windoze users might delete on a whim, and really screw things up.

Is there a Samba option to share home directories but to leave certain subdirectories alone? No. Can you just change the ownership and permissions of the critical files to root and deny write access? No! (Because mail systems require such files to be owned by their user for security reasons). Can you use permission bits or even an ACL? Possibly, but you’ll go insane trying.

A bit of lateral thinking is called for here. Let’s start with the standard section in smb.conf for creating automatic shares for home directories:

[homes]
    comment = Home Directories
    browseable = no
    writable = yes

The “homes” section is special – the name “homes” is reserved to make it so. Basically it auto-creates a share with a name matching the user when someone logs in, so that they can get to their home directory.

First off, you could make it non-writable (i.e. set writable = no). Not much use to use luser, but it does the job of stopping them deleting anything. If read-only access is good enough, it’s an option.

The next idea, if you want it to be useful, is to use the directive “hide dot files” in the definition. This basically returns files beginning in a ‘.’ as “hidden” to Windoze users, hiding the UNIX user configuration files and other stuff you don’t want deleted. Unfortunately the “mail” directory, containing all your loverly IMAP folders is still available for wonton destruction, but you can hide this too by renaming it .mail. All you then need to do is tell your mail server to use the new name. For example, in dovecot.conf, uncomment and edit the line thus:

mail_location = mbox:~/.mail/:INBOX=/var/mail/%u

(Note the ‘.’ added at the front of ~/mail/)

You then have to rename each of the user’s “mail” folders to “.mail”, restart dovecot and the job is done.

Except when you have lusers who have turned on the “Show Hidden Files” option in Windoze, of course. A surprising number seem to think this is a good idea. You could decide that hidden files allows advanced users control of their mail and configuration, and anyone messing with a hidden file can presumably be trusted to know what you’re doing. You could even mess with Windoze policies to stop them doing this (ha!). Or you may take the view that all lusers and dangerous and if there is a way to mess things up, they’ll find it and do it. In this case, here’s Plan B.

The trick is to know that the default path to shares in [homes] is ‘~’, but you can actually override this! For example:

[homes]
    path = /usr/data/flubnutz
    ...

This maps users’ home directories in a single directory called ‘flubnutz’. This is not that useful, and I haven’t even bothered to try it myself. When it becomes interesting is when you can add a macro to the path name. %S is a good one to use because it’s the name as the user who has logged in (the service name). %u, likewise. You can then do stuff like:

[homes]
     path = /usr/samba-files/%S
     ....

This stores the user’s home directory files in a completely different location, in a directory matching their name. If you prefer to keep the user’s account files together (like a sensible UNIX admin) you can use:

[homes]
     comment = Home Directories
     path = /usr/home/%S/samba-files
     browseable = no
     writable = yes<

As you can imagine, this stores their Windows home directory files in a sub-directory to their home directory; one which they can’t escape from. You have to create “~/samba-files” and give them ownership of it for this to work. If you don’t want to use the explicit path, %h/samba-files should do instead.

I’ve written a few scripts to create directories and set permissions, which I might add to this if anyone expresses an interest.

20-March-1524-October-21

FreeBSD hr utility – human readable number filter (man page)

Several years ago I wrote a utility to convert numeric output into human readable format – you know the kind of thing – 12345678 becomes 12M and so on. Although it was very clever in the way it dealt with really big numbers (Zetabytes), and in spite of ZFS having really big numbers as a possibility, no really big numbers have actually come my way.

It was always a dilemma as to whether I should use the same humanize_number() function as most of the FreeBSD utilities, which is limited to 64-bit numbers as its input, or stick with my own rolling conversion. In this release, actually written a couple of years ago, I’ve decided to go for standardisation.

You can download it from here. I’ve moved it (24-10-2021) and it’s not on a prettified page yet, but the file you’re looking for is “hr.tar”.

This should work on most current BSD releases, and quite a few Linux distributions. If you want binaries, leave a note in comments and I’ll see what I can do. Otherwise just download, extract and run make && make install

Extracted from the man page:

NAME

hr — Format numbers in human-readable form

SYNOPSIS

hr [-b] [-p] [-ffield] [-sbits] [-wwidth] [file ...]

DESCRIPTION
The hr utility formats numbers taken from the input stream and sends them
to stdout in a format that’s human readable. Specifically, it scales the
number and adds an appropriate suffix (e.g. 1073741824 becomes 1.0M)

The options are as follows:

-b Put a ‘B’ suffix on a number that hasn’t been scaled (for Bytes).

-p Attempt to deal with input fields that have been padded with spaces for formatting purposes.

-wwidth Set the field width to field characters. The default is four
(three digits and a suffix). Widths less than four are not normally useful.

-sbits Shift the number being processed right by bits bits. i.e. multi-
ply by 2^bits. This is useful if the number has already been scaled in to units. For example, if the number is in 512-byte
blocks then -s9 will multiply the output number by 512 before scaling it. If the number was already in Kb use -s10 and so on.
In addition to specifying the number of bits to shift as a number you may also use one of the SI suffixes B, K, M, G, T, P, E
(upper or lower case).

k-ffield Process the number in the numbered field , with fields being numbered from 0 upwards and separated by whitespace.

The hr utility currently uses the humanize() function in System Utilities Library (libutil, -lutil) to format the numbers. This will repeatedly divide the input number by 1024 until it fits in to a width of three digits (plus suffix), unless the width is modified by the -w option. Depending on the number of divisions required it will append a k, M, G, T, P or E suffix as appropriate. If the -b option is specified it will append a ‘B’ if no division is required.

If no file names are specified, hr will get its input from stdin. If ‘-‘ is specified as one of the file names hr will read from stdin at this point.

If you wish to convert more than one field, simply pipe the output from one hr command into another.

By default the first field (i.e. field 0) is converted, if possible, and the output will be four characters wide including the suffix.

If the field being converted contains non-numeral characters they will be passed through unchanged.

Command line options may appear at any point in the line, and will only take effect from that point onwards. This allows different options to apply to different input files. You may cancel an option by prepending it with a ‘-‘. For consistency, you can also set an option explicitly with a ‘+’. Options may also be combined in a string. For example:

hr -b file1 -b- file2

Will add a ‘B’ suffix when processing file1 but cancel it for file2.

hr -bw5f4p file1

Will set the B suffix option, set the output width to 5 characters, process field 4 and remove excess padding from in front of the original digits.

EXAMPLES
To format the output of an ls -l command’s file size use:

ls -l | hr -p -b -f4

This output will be very similar to the output of “ls -lh” using these options. However the -h option isn’t available with the -ls option on the “find” command. You can use this to achieve it:

find. -ls | hr -p -f6

Finally, if you wish to produce a sorted list of directories by size in human format, try:

du -d1 | sort -n | hr -s10

This assumes that the output of du is the disk usage in kilobytes, hence the need for the -s10

DIAGNOSTICS
The hr utility exits 0 on success, and >0 if an error occurs.