Unix chmod and chown

Following on from Basic UNIX file commands, here’s a bit there wasn’t time for on changing metadata on files.

These two commands change the files permissions and ownership. Permissions is information associated with a file that decides who can do what with it. This was once called the file mode, which is why the command is chmod (Change MODe). Files also have owners, both individuals and group of users, and the command to change this is chown (CHOWNer).

chown is easiest, so I’ll start there. To make a file belong to fred the command is:

chown fred myfile

To change the owning group to accounts:

chown :accounts myfile

And to change both at once:

chown fred:accounts myfile

Changing a files permissions is more tricky, and there are several ways of doing it, but this is probably the easiest to remember. You’ll recall that each file has three sets of permissions: Owner, Group and Other. The permissions themselves are read, write and execute (i.e. it’s an executable program).

chown can set or clear a load of permissions in one go, and the format is basically the type of permission, and ‘+’ or ‘-’ for set or clear, and the permissions themselves. What? It’s probably easier to explain with a load of examples.:

chmod u+w myfile

Allows the user of the file to write to it (u means user/owner)

chmod g+w myfile

Allows any user in the group the file belongs to write to it.

chmod o+r myfile

Allows any user who is not in the files group or the owner to read it. (o means “other”).

You can combine these options

chmod ug+rw

Allows the owner and the group to read and write the file.
chmod go-w

Prevents anyone but the user from being able to modify the file.

If you want to run a program you’ve just written called myprog.

chmod +x myprog

If you don’t specify anything before the +/- chmod assumes you mean yourself.

You might notice an ‘x’ permission on a directory – in this case it means the directory is searchable to whoever has the permission.

Basic UNIX file commands

I was asked me to explain basic Unix shell file manipulation commands, so here goes.

If you’re familiar with MS-DOS, or Windows CMD.EXE and PowerShell (or even CP/M) you’ll know how to manipulate files and directories on the command line. It’s tempting to think that the Unix command line is the same, but there are a few differences that aren’t immediately apparent.

There are actually two main command lines (or shells) for Unix: sh and csh. Others are mainly compatible with these two, with the most common clones being bash and tcsh respectively. Fortunately they’re all the same when it comes to basic commands.

Directory Concepts

Files are organised into groups called “directories”, which are often called “Folders” on Macintosh and Windows. It’s not a great analogy, but it’s visual on a GUI. Unlike the real world, a directory may contain additional directories as well as files. These directories (or sub-directories) can also contain files and more directories and so on. If you drew a diagram you’d end up with something looking like a tree, with directories being branches coming off branches and the files themselves being leaves.
All good trees start with a root from which the rest branches off, and this is no different. The start of a Unix directory tree is known as the root.

Unix has a concept called the Current Working Directory. When a program is looking for a file it is assumed it will be found in the Working Directory if no other location is specified.

Users on a Unix system have an assigned Home Directory, and their Working Directory is initially set to this when the log on.

Users may create whatever files and sub-directories they need within their Home Directory, and the system will allow the do whatever they want with anything they create as it’s owned by them. It’s possible for a normal user to see other directories on the system, in fact it’s necessary, but generally they won’t be able to modify files outside their home directory.

Here’s an example of a directory tree. It starts with the root, /, and each level down adds to the directory “path” to get to the directory.

If you want to know what your current working directory is, the first command we’ll need is “pwd” – or “Print Working Directory”. If you’re ever unsure, use pwd to find out where you are.

Unix commands tend to be short to reduce the amount of typing needed. Many are two letters, and few are longer than four.

The thing you’re most likely to want to do is see a list of files in the current directory. This is achieved using the ls command, which is a shortened form of LiSt.

Typing ls will list the names of the all the files and directories, sorted into ASCII order and arranged into as many columns as will fit on the terminal. You may be surprised to see files that begin with “X” are ahead of files beginning with “a”, but upper case “X” has a lower ASCII value than lower case “a”. Digits 0..9 have a lower value still.

ls has lots of flags to control it’s behaviour, and you should read the documentation if you want to know more of them.

If you want more detail about the files, pass ls the ‘-l’ flag (that’s lower-case L, and means “long form”). You’ll get output like this instead:

drw-r-----  2 fjl  devs        2 Aug 28 13:17 Release
drw-r-----  2 fjl  devs       29 Dec 26  2019 Debug
-rw-r-----  1 fjl  devs     2176 Feb 17  2012 editor.h
-rw-r-----  1 fjl  devs    28190 Feb  7  2012 fbas.c
-rw-r-----  1 fjl  devs    10197 Feb 17  2012 fbas.h
-rw-r-----  1 fjl  devs     5590 Feb 17  2012 fbasexpr.c
-rw-r-----  1 fjl  devs     7556 Feb  3  2012 fbasheap.c
-rw-r-----  1 fjl  devs     7044 Feb  4  2012 fbasio.c
-rw-r-----  1 fjl  devs     4589 Feb  3  2012 fbasline.c
-rw-r-----  1 fjl  devs     4069 Feb  3  2012 fbasstr.c
-rw-r-----  1 fjl  devs     4125 Feb  3  2012 fbassym.c
-rw-r-----  1 fjl  devs    13934 Feb  3  2012 fbastok.c
drw-r-----  3 fjl  devs        3 Dec 26  2019 ipch
-rw-r-----  1 fjl  devs     3012 Feb 17  2012 token.h
:

I’m going to skip the first column for now and look at the third and fourth.

fjl devs

This shows the name of the user who owns the file, followed by the group that owns the file. Unix users can be members of groups, and sometimes it’s useful for a file to be effectively used by a group rather than one user. For example, if you have an “accounts” group and all your accounts staff belong to it, a file can be part of the “accounts” group so everyone can work on it.

Now we’ve covered users and groups we can return to the first column. It shows the file flags, which are various attributes concerning the file. If there’s a ‘-’ then the flag isn’t set. The last nine flags are three sets of three permissions for the file.

The first set are the file owner’s permissions (or rights to use the file).

The second set are the file group’s permissions.

The third are the permissions for any user who isn’t either the owner or in the file’s group

Each group of three represents, in order:

r – Can the file be read, but not changed.
w – can the file be written to. If not set it means you can only read it.
x – can the file be run (i.e. is it a program)

So:

– rw- — — means only the user can read/write the file.

– rw- r– — means the file can be read by anyone in it’s group but only written to by the owner.

– rwx r-x — means the file is a program, probably written by its owner. Others in the group can run it, but no one else can even read it.

There are other special characters that might appear in the first filed for advanced purposes but I’m covering the basics here, and you could write a book on ls.

I’ve missed off the first ‘-’, which isn’t a permission but indicates the type of the file. If it’s a ‘-’ it’s just a regular file. A ‘d’ means it’s actually a directory. You’ll sometimes see ‘c’ and ‘s’ on modern systems, which are normally disk drives and network connections. Unix treats everything like a file so disk drives and network sockets can be found in the directory tree too. You’ll probably see ‘l’ (lower case L) which means it’s a symbolic link – a bit like a .LNK file in Windows.

This brings us to the second column, which is a number. It is the number of times the file exists in the directory tree thanks to their being links, and most cases this will be one. I’ll deal with links later.

The last three columns should be easy to guess: Length, date and finally the name of the file; at least in the case of a regular file.

There are many useful and not so useful options supported by ls. Here are a few that might be handy.

-d

By default, if you give ls a directory name it will show you the contents of the directory. If you want to see the directory itself, most likely because you want to see its permissions, specify -d.

-t

Sort output by time instead of ASCII

-r

Reverse the order of sort. -rtl is useful as it will sort your files with the newest at the end of the list.

-h

Instead of printing a file size in the bytes, which could be a very long number, display in “human readable” format, which restricts it to three characters followed by a suffix: B=bytes, K=Kb, M=Mb and so on.

-F

This is very handy if you’re not using -l, as with just the name printed you can’t tell regular and special files apart. This causes a single character to be added to the file name: ‘*’ means it’s a program (has the x flag set), ‘/’ means it’s a directory and ‘@’ means it’s a symbolic link. Less often you’ll see ‘=’ for a socket, ‘|’ for a FIFO (obsolete) and ‘%’ for a whiteout file (insanity involving union mounts).

Finally, ls takes arguments. By default it lists everything if you give it a list of files and directories, it will just list them.

Where src is a directory, ls src will list the files in that directory. Remember ls -d if you just want information on the directory?

ls src

List everything in the src and obj directories:

ls src obj

Now you can find the names of the files, how do you look at what’s in them? To display the contents of a text file the simple method is cat.

cat test.c

This prints the contents of test.c. You might just want to see the first few lines, so instead try:

head test.c

Only the first ten lines (by default) are printed.

If you want to see the last ten lines, try:

tail test.c

If you want to go down the file a screen full at a time, use.

more test.c

It stops and waits for you to press the space bar after every screen. If you’ve read enough, hit ‘q’ to quit.

less test.c

This is the latest greatest file viewer and it allows you to scroll up and down a file using the arrow keys. It’s got a lot of options.

So far we’ve stayed in our home directory, where we have kept all our files. But evenrtually you’re going to need to organise your files in a hierarchical structure in directories.

To make a directory called “new” type:

mkdir new

This is an mnemonic for “make directory”.

To change your working directory use the chdir command (Change Directory)

chdir new

Most people use the abbreviated synonym for chdir, “cd”, so this is equivalent:

cd new

Once you’re there, type “pwd” to prove we’ve moved:

pwd

If you type ls now you won’t see any files, because it’s empty.

You can also specify the directory explicitly, such as:

cd /usr/home/fred/new

If you don’t start with the root ‘/’, cd will usually start looking for the name of the new directory in the current working directory.

To move back one level up the directory level use this command:

cd ..

You’ll be back in your home directory.

To get rid of the “new” directory use the rmdir command (ReMove DIRectory)

rmdir new

This only works on empty directories, so if there were any files in it you’d have to delete them first. There are other more dangerous commands that will destroy directories and all their contents but it’s better to stick with the safer ones!

To remove an individual file use the rm (ReMove) command, in this case the file being named “unwanted”:

rm unwanted

Normally files are created by applications, but if you want a file to experiment on the easiest way to create one is “touch filename”, which creates an empty file called “filename”. You can also use the echo command:

echo “This is my new text file, do you like it?” > myfile.txt

Echo prints stuff to the screen, but “> myfile.txt” tells Unix to put the output of the echo command into “myfile.txt” instead of displaying it on the screen. We’ll use “echo” more later.

You can display the contents with:

cat myfile.txt

One thing you’re going to want to do pretty soon is copy a file, which is achieved using the cp (CoPy) command:

cp myfile.txt copy-of-myfile.txt

This makes a copy of the file and calls it copy-of-myfile.txt

You can also copy it into a directory

mkdir new
cp myfile.txt new

To see it there, type:

ls -l new

To see the original and the copy, try:

ls - newfile.txt new

If you wanted to delete the copy in “new” use the command:

rm new/myfile.txt

Perhaps, instead of copying your file into “new” you wanted to move it there, so you ended up with only one copy. This is one use of the mv (MoVe) command:

mv myfile.txt new

The file will disappear from your working directory and end up in “new”.

How do you rename a file? There’s no rename command, but mv does it for you. When all is said and done, all mv is doing is changing the name and location of a file.

cd new
mv myfile.txt myfile.text

That’s better – much less Microsoft, much more Unix.

Wildcards

So far we’ve used commands on single files and directories, but most of these commands work with multiple files in one go. We’ve given them a single parameter but we could have used a list.

For example, if we wanted to remove three files called “test”, “junk” and “foo” we could use the command:

rm test junk foo

If you’re dealing with a lot of files you have have the shell create a list of names instead of typing them all. You do this by specifying a sort of “template”, and all the files matching the template will be added to the list.

This might seem the same as Windows, but it’s not – be careful. With Windows the command does the pattern matching according to its context, but the Unix shell has no context and you may end up matching more than you intended, which is unfortunate if you’re about to delete stuff.

The matching against the template is called globbing, and uses the special characters ‘*’ and ‘?’ in it’s simplest form.

‘?’ matches any single character, whereas ‘*’ matches zero or more characters. All other characters except ‘[‘ match themselves. For example:

“?at” would match cat, bat and rat. It would not match “at” as it must have a first character. Neither will it match “cats” as it’s expecting exactly three characters.

“cat*” would match cat, cats, caterpillar and so on.

“*cat*” would match all of the above, as well as “scatter”, “application” and “hellcat”.

You can also specify a list of allowable letters to match between square brackets [ and ], which means any single character will do. You can specify a range, so [0-9] will match any digit. Putting a ‘!’ in front negates the match, so [!0-9] will match any single character that is NOT a digit. If you want to match a two-digit number use [0-9][0-9].

To test globbing out safely, I recommend the use of the echo command for safety. It works like this:

echo Hello world

This prints out Hello world. Useful, eh? But technically what it’s doing is taking all the arguments (aka parameters) one by one and printing them. The first argument is “Hello” so it prints that. The second is “world” so it prints a space and prints that, until there are no arguments left.

Suppose we type this:

echo a*

The Unix shell globs it using the * special character produces a list of all files that start with the letter ‘a’.

You can use this, for example, to specify all the ‘C’ files ending in .c:

echo *.c

If you want to include .h files in this, use

echo *.c *.h

Practice with echo to see how globbing works as it’s non-destructive!

You can also use ls, although this goes on to expand directories into their contents, which can be confusing.

When you have a command that has a source and destination, such as cp (CoPy), they will interpret the everything in the list as a file to be processed apart from the last, which it will expect to be a directory. For example:

cp test junk foo rubbish

Will copy “test”, “junk” and “foo” into an existing directory rubbish.

Now for a practical example. Suppose you have a ‘C’ project where everything is in one directory. .c files, .h files, .o files as well as the program itself. You want to sort this out so the source is in one directory and the objects in another.

Create two directories like this:

mkdir src obj
mv *.c *.h src
mv *.o obj

All done!

Mirrored swap devices

Although some of this is BSD specific, the principles apply to any Unix or Linux.

When you install you Unix like OS across several disks, either with a mirror or RAID system (particularly ZFS RAIDZ) you’ll be asked if you want to set up a swap partition, and if you want it mirrored.

The default (for FreeBSD) is to add a swap partition on every disk and not mirror it. This is actually the most efficient configuration apart from having dedicated swap drives, but is also a spectacularly bad idea. More on this later.

What is a swapfile/drive anyway?

The name is a hangover from early swapping multi tasking systems. Only a few programs could fit in main memory, so when their time allocation ran out they were swapped with others on a disk until it was their turn again.

These days we have “virtual memory”, where a Memory Management Unit (MMU) fixed it so blocks of memory known as pages are stored on disk when not in use and automatically loaded when needed again. This is much more effective than swapping out entire programs but needs MMU hardware, which was once complex, slow and expensive.

So the swap partition should really be called the paging partition now, and Microsoft actually got the name right on Windows. But we still call it the swap partition.

What you need to remember is that parts of a running programs memory may be in the swap partition instead of RAM at any time, and that includes parts of the operating system.

Strategies

There are several ideas for swap partitions in the 2020s.

No swap partition

Given RAM is so cheap, you can decide not to bother with one, and this is a reasonable approach. Virtual memory is slow, and if you can, get RAM instead. It can still pay to have one though, as some pages of memory are rarely, if ever, used again once created. Parts of a large program that aren’t actually used, and so on. The OS can recognise this and page them out, using the RAM for something useful.

You may also encounter a situation where the physical RAM runs out, which will mean no further programs can be run and those already running won’t be able to allocate any more. This leads to two problems: Firstly “Developers” don’t often program for running out of memory and their software doesn’t handle the situation gracefully. Secondly, if the program your need to run is you login shell you’ll be locked out of your server.

For these reasons I find it better to have a swap partition, but install enough RAM that it’s barely used. As a rule of thumb, I go for having the same swap space as there is physical RAM.

Dedicated Swap Drive(s)

This is the classic gold standard. Use a small fast drive (and expensive), preferably short stroked, so your virtual memory goes as fast as possible. If you’re really using VM this is probably the way to go, and having multiple dedicated drives spreads the load and increases performance.

Swap partition on single drive

If you’ve got a single drive system, just create a swap partition. It’s what most installers do.

Use a swap file

You don’t need a drive or even a partition. Unix treats devices and files the same, so you can create a normal file and use that.

truncate -s 16G /var/swapfile
swapon /var/swapfile

You can swap on any number of files or drives, and use “swapoff” to stop using a particular one.

Unless you’re going for maximum performance, this has a lot going for it. You can allocate larger or smaller swap files as required and easily reconfigure a running system. Also, if your file system is redundant, your swap system is too.

Multiple swap partitions

This is what the FreeBSD installer will offer by default if you set up a ZFS mirror or RAIDZ. It spreads the load across all drives. The only problem is that the whole point of a redundant drive system is that it will keep going after a hardware failure. With a bit of swap space on every drive, the system will fail if any of the drives fails, even if the filing system carries on. Any process with RAM paged out to swap gets knocked out, including the operating system. It’s like pulling out RAM chips and hoping it’s not going to crash. SO DON’T DO IT.

If you are going to use a partition on a data drive, just use one. On an eight drive system the chances of a failure on one of eight drives is eight times higher than one one specific unit, so you reduce the probability of failure considerably by putting all your eggs in one basket. Counterintuitive? Consider that if one basket falls on a distributed swap, they all do anyway.

Mirrored swap drives/partitions

This is sensible. The FreeBSD installer will do this if you ask it, using geom mirror. I’ve explained gmirror in posts passem, and there is absolutely no problem mixing it with ZFS (although you might want to read earlier posts to avoid complications with GPT). But the installer will do it automatically, so just flip the option. It’s faster than a swap file, although this will only matter if your job mix actually uses virtual memory regularly. If you have enough RAM, it shouldn’t.

You might think that mirroring swap drives is slower – and to an extent it is. Everything has to be written twice, and the page-out operation will only complete when both drives have been updated. However, on a page-in the throughput is doubled, given the mirror can read either drive to satisfy the request. The chances are there will be about the same, or slightly more page-ins so it’s not the huge performance hit it might seem at first glance.

Summary

MethodProsCons
No swapSimple
Fastest
Wastes RAM
Can lead to serious problems if you run out of RAM
Dedicated Swap Drive(s)Simple
Optimal performance
Each drive is a single point of failure for the whole system
Multiple Swap PartitionsImproved performance
Lower cost than dedicated
Each drive is a single point of failure for the whole system
Single swap partition (multi-drive system)Simple
Lower probability of single point of failure occurring.
Reduced performance
Still has single point of failure
Mirrored drives or partitionsNo single point of failure for the whole systemReduced performance
Swap fileFlexible even on live system
Redundancy the same as drive array
Reduced performance
Quick summary of different swap/paging device strategies.

Conclusion

Having swap paritions on multiple drives increases your risk of a fault taking down a server that would otherwise keep running. Either use mirrored swap partitions/drives, or use a swap file on redundant storage. The choice depends on the amount of virtual memory you use in normal circumstances.

How hot is your FreeBSD machine?

FreeBSD may be the hottest operating system available, but having hot hardware isn’t so good. Modern drives and CPUs can report their temperature, but it’s not easy to see it.

I’ve produced an example script that will report the temperature of whatever it can, with the idea that it can form the basis of whatever you really need to do. I have a multi-server monitoring system using the same methods.

Getting the CPU core temperature is a simple sysctl variable, but it only appears if you have the coretemp.ko module loaded or compiled into the kernel. But coretemp.ko only works for Intel processors; for AMD you need amdtemp.ko instead.

The script tries to determine the CPU type the best way I know how, and loads the appropriate module if necessary. If it loads the module, it unloads the module at the end to leave things as it found them. You can omit the unload so it’s loaded on first, or put the module permanently in loader.conf if you prefer. But this is only an example, and I prefer to keep my scripts independent of host configuration.

Next you need a way to get the temperature from all your drives. This isn’t built in to FreeBSD but you can use the excellent the excellent smartmontools https://www.smartmontools.org by Bruce Allen and Christian Franke.

This was originally intended to access the SMART reporting on ATA drives, but will now extract information for SAS units too. To smartmontools, and the “smartctl” utility in particular, you can build from ports with:

cd /usr/ports/sysutils/smartmontools
make install clean

You can also install it as a binary package with: “pkg install smartmontools”

The script tries to enumerate the drives and CPUs on the system. ATA drives are in /dev/ and begin “ada”, SCSI and USB drives begin “da”. The trick is to figure out which devices are drives and which are partitions or slices within a drive – I don’t have a perfect method.

SCSI drives that aren’t disks (i.e. tape) start “sa”, and return some really weird stuff when smartctl queries them. I’ve tested the script with standard LTO tape drives, but you’ll probably need to tweak it for other things. (Do let me know).

Figuring out the CPUs is tricky, as discrete CPUs and cores within a single chip appear the same. The script simply goes on the cores, which results in the same temperature being reported for each.

You can override any of the enumeration by simply assigning the appropriate devices to a list, but where’s the fun? Seriously, this example shows how you can enumerate devices and it’s useful when you’re monitoring disparate hosts using the same script.

Finally, there are three loops that read the temperature for each device type into into “temp” and then print it. Do whatever you want – call “shutdown -p now” if you think something’s too hot; autodial the fire brigade or, as I do, send yourself an email.

The Script

#!/bin/sh
# FreeBSD Temperature monitoring example script
# (c) FJL 2024 frank@fjl.co.uk
# Please feel free to use this as an example for a few techniques
# including enumerating devices on a host and extracting temperature
# information.

# Full path to utilities in case run with no PATH set
GREP=/usr/bin/grep
SMARTCTL=/usr/local/sbin/smartctl
CUT=/usr/bin/cut
SYSCTL=/sbin/sysctl

# Load the AMD CPU monitoring driver if necessary
if [ ! $($SYSCTL -n dev.cpu.0.temperature 2>/dev/null) ]
then
# Let's try to find out if we have Intel or
# AMD processor and select correct kernel module
if $SYSCTL -n hw.model | $GREP AMD >/dev/null
then
tempmodule=amdtemp
else
tempmodule=coretemp
fi
# Load the CPU temp kernel module
kldload $tempmodule
# Set command to unload it when we're done (optional)
unload="kldunload $tempmodule"
fi
# Enumerate SATA, USB and SAS disks - everything
# in /dev/ starting da or ada
disks=$(find /dev -depth 1 -type c \( -name da[0-9+] -o -name ada[0-9+] \) | cut -c 6- | sort )

# Enumerate other SCSI devices, starting in sa.
# Normally tape drives. May need tweaking!
scsis=$(find /dev -depth 1 -type c \( -name sa[0-9+] \) | cut -c 6- | sort)

# Enumerate the CPUs
cpus="$(seq 0 $(expr $($SYSCTL -n hw.ncpu ) - 1))"

# Print all the disks
for disk in $disks
do
temp=$($SMARTCTL -a /dev/$disk | $GREP Temperature_Celsius | $CUT -w -f 10)
echo "$disk: ${temp}C"
done

# Print all the SCSI devices (e.g. tapes)
# NB. This will probably take a lot of fiddling as SCSI units return all sorts to smartctl
# Note the -T verypermissive. see man smartctl for details.
for scsi in $scsis
do
temp=$($SMARTCTL -a -T verypermissive /dev/$scsi | $GREP "Current Drive Temperature" | cut -w -f 4)
echo "$scsi: ${temp}C"
done

# Print all the CPUs
for cpu in $cpus
do
temp=$($SYSCTL -n dev.cpu.$cpu.temperature | $CUT -f 1 -d .)
echo "CPU$cpu: ${temp}C"
done

# Unload the CPU temp kernel module if we loaded it (optional)
$unload

Example Output

ada0: 35C
ada1: 39C
ada2: 39C
ada3: 38C
da0: 24C
da1: 25C
sa0: 33C
CPU0: 48C
CPU1: 48C

Why people obsess about the ZFS SLOG, but shouldn’t

There are two mysteries things on ZFS that cause a lot of confusion: The ZIL and the SLOG. This article is about what they are and why you should care, or not care about them. But I’ll come to them later. Instead I’ll start with POSIX, and what it says about writing stuff to disk files.

When you write to disk it can either be synchronous or asynchronous. POSIX (Portable Operating System Interface) has requirements for writes through various system calls and specifications.

With an asynchronous write the OS takes the data you give it and returns control to the application immediately, promising to write the data as soon as possible in the background. No delay. With a synchronous write the application won’t get control back until the data is actually written to the disk (or non-volatile storage of some kind). More or less. Actually, POSIX.1-2017 (IEEE Standard 1003.1-2017) doesn’t guarantee it’s written, but that’s the expectation.

You’d want synchronous writes for critical complex files, such as a database, where the internal structure would break if a transaction was only half written, and a database engine needs to know that one write has occurred before making another.

Writes to ZFS can be long and complicated, requiring multiple blocks be updated for a single change. This is how it maintains its very high integrity. However, this means it can take a while to write even the simplest thing, and a synchronous write could take ages (in computer terms).

To get around this, ZFS maintains a ZIL – ZFS Intent Log.

In ZFS, the ZIL primarily serves to ensure the consistency and durability of write operations, particularly for synchronous writes. But it’s not a physical thing; it’s a concept or list. It contains transaction groups that need to be completed in order.

The ZIL can be physically stored in three possible places…

In-Memory (Volatile Storage):

This is the default location. Initially, all write operations are buffered in RAM. This is where they are held before being committed to persistent storage. This kind ofZIL is volatile because it’s not backed by any permanent storage until written to disk.

Volatility doesn’t matter, because ZFS guarantees consistency with transaction groups (TXGs). The power goes off and the in-RAM ZIL is lost, the transactions are never applied; but the file system is in a consistent state.

In-Pool (Persistent Storage):

Without a dedicated log device, the ZIL entries are written to the main storage pool in transaction groups . This happens for both synchronous and asynchronous writes but is more critical for synchronous writes to ensure data integrity in case of system crashes or power failures.

SLOG (Separate Intent Log Device):

For better performance with synchronous writes, you can add a dedicated device to serve as the SLOG. This device is typically a low-latency, high-speed storage like a short-stroked Rapter, enterprise SSD or NVRAM. ZFS writes the log entries before they’re committed to the pool’s main storage.

By storing the pending TXGs on disk, either in the pool or on an SLOG, ZFS can meet the POSIX requirement that the transaction is stored in non-volatile storage before the write returns, and if you’re doing a lot of synchronous writes then storing them on a high-speed SLOG device helps. But only if the SLOG device is substantially faster than an array of standard drives. And it only matters if you do a lot of synchronous writes. Caching asynchronous writes in RAM is always going to be faster still

I’d contend that the only times synchronous writes feature heavily are databases and virtual machine disks. And then there’s NFS, which absolutely loves them. See ESXi NFS ZFS and vfs-nfsd-async for more information if this is your problem.

If you still think yo need an SLOG, install a very fast drive. These days an NVMe SLC NAND device makes sense. Pricy, but it doesn’t need to be very large. You can add it to a zpool with:

zpool add poolname
log /dev/daX

Where daX is the drive name, obviously.

As I mentioned, the SLOG doesn’t need to be large at all. It only has to cope with five seconds of writes, as that’s the maximum amount of time data is “allowed” to reside there. If you’re using NFS over 10Gbit Ethernet the throughput isn’t going to be above 1.25Gb a seconds. Assuming that’s flat-out synchronous writes, multiplying that by five seconds is less than 8Gb. Any more would be unused.

If you’ve got a really critical system you can add mirrored SLOG drives to a pool thus:

zpool add poolname
log /dev/daX /dev/daY

You can also remove them with something like:

zpool remove
poolname log /dev/daY

This may be useful if adding an SLOG doesn’t give you the performance boost you were hoping for. It’s very niche!

Proper Case in a shell script

How do you force a string into proper case in a Unix shell script? (That is to say, capitalise the first letter and make the rest lower case). Bash4 has a special feature for doing it, but I’d avoid using it because, well, I want to be Unix/POSIX compatible.

It’s actually very easy once you’ve realised tr won’t do it all for you. The tr utility has no concept on where in the input stream it is, but combining tr with cut works a treat.

I came across this problem when I was writing a few lines to automatically create directory layouts for interpreted languages (in this case the Laminas framework). Languages of this type like capitalisation of class names, but other names have be lower case.

Before I get started, I note about expressing character ranges in tr. Unfortunately different systems have done it in different ways. The following examples assume BSD Unix (and POSIX). Unix System V required ranges to be in square brackets – e.g. A-Z becomes “[A-Z]”. And the quotes are absolutely necessary to stop the shell globing once you’ve introduced the square brackets!

Also, if you’re using a strange character set, consider using \[:lower:\] and \[:upper:\] instead of A-Z if your version of tr supports it (most do). It’s more compatible with foreign character sets although I’d argue it’s not so easy on the eye!

Anyway, these examples use A-Z to specify ASCII characters 0x41 to 0x5A – adjust to suit your tr if your Unix is really old.

To convert a string ($1) into lower case, use this:

lower=$(echo $1 | tr A-Z a-z)

To convert it into upper case, use the reverse:

upper=$(echo $1 | tr a-z A-Z)

To capitalise the first letter and force the rest to lower case, split using cut and force the first character to be upper and the rest lower:

proper=$(echo $1 | cut -c 1 | tr a-z A-Z)$(echo $1 | cut -c 2- | tr A-Z a-z)

A safer version would be:

proper=$(echo $1 | cut -c 1 | tr "[:lower:]" "[:upper:]")$(echo $1 | cut -c 2- | tr "[:upper:]" [":lower:"])

This is tested on FreeBSD in /bin/sh, but should work on all BSD and bash-based Linux systems using international character sets.

You could, if you wanted to, use sed to split up a multi-word string and change each word to proper case, but I’ll leave that as an exercise to the reader.

Reply-To: gmail spam and Spamassassin

Over the last few months I’ve noticed huge increase is spam with a “Reply To:” field set to a gmail address. What the miscreants are doing is hijacking a legitimate mail server (usually a Microsoft one) and pumping out spam advertising a service of some kind. These missives only work if the mark is able to reply, and as even a Microsoft server will be locked down sooner or later, so they’ll never get the reply.

The reason for sending this way is, of course, spam from a legitimate mail server isn’t going to be blacklisted or blocked. SPF and other flags will be good. So these spams are likely to land in inboxes, and a few marks will reply based on the law of numbers.

To get the reply they’re using the email “Reply-To:” field, which will direct the reply to an alternative address – one which Google is happy to supply them for nothing.

The obvious way of detecting this would be to examine the Reply-To: field, and if it’s gmail whereas the original sender isn’t, flag it as highly suspect.

I was about to write a Spamassassin rule to do just this, when I discovered there is one already – and it’s always been there. The original idea came from Henrik Krohns in 2009, but it’s time has now definitely arrived. However, in a default install, it’s not enabled – and for a good reason (see later). The rule you want is FREEMAIL_FORGED_REPLYTO, and it’s found in 20_freemail.cf

Enabling FREEMAIL_FORGED_REPLYTO in Spamassassin

If you check 20_freemail.cf you’ll see the rules require Mail::SpamAssassin::Plugin::FreeMail, The FreeMail.pm plugin is part of the standard install, but it’s very likely disabled. To enable this (or any other plugin) edit the init.pre file in /usr/local/etc/mail/spamassassin/ Just add the following to the end of the file:

# Freemail checks
#
loadplugin Mail::SpamAssassin::Plugin::FreeMail FreeMail.pm

You’ll then need to add a list of what you consider to be freemail accounts in your local.cf (/usr/local/etc/mail/spamassassin/local.cf). As an example:

freemail_domains aol.* gmail.* gmail.*.* outlook.com hotmail.* hotmail.*.*

Note the use of ‘*’ as a wildcard. ‘?’ matches a single character, but neither match a ‘.’. It’s not a regex! There’s also a local.cf setting “freemail_whitelist”, and other things documented in FreeMail.pm.

Then restart spamd (FreeBSD: service spamd restart) and you’re away. Except…

The problem with this Rule

If you look at 20_freemail.cf you’ll see the weighting is very low (currently 0.1). If this is such a good rule, why so little? The fact is that there’s a lot of spam appearing in this form, and it’s the best heuristic for detecting it, but it’s also going to lead to false positives in some cases.

Consider those silly “contact forms” beloved by PHP Web Developers. They send an email from a web server but with a “faked” reply address to the person filling in the form. This becomes indistinguishable from the heuristic used to spot the spammers.

If you know this is going to happen you can, of course add an exception. You can even have the web site use a local submission port and send it to a local mailbox without filtering. But in a commercial hosting environment this gets a bit complicated – you don’t know what Web Developers are doing. (How could you? They often don’t).

If you have control over your users, it’s probably safe to up the weighting. I’d say 3.0 is a good starting point. But it may be safer to leave it at 0.1 and examine the results for what would have been false positives.

Networking FreeBSD Jails

Or port forwarding to a jail

I’ve already explained how easy FreeBSD jails are to set up and use without resorting to installing heavy management tools, but today I thought I’d add a bit about networking. Specifically, how do you pass traffic arriving on a particular port to a service running inside a jail?

It’s actually very easy. All you need is a very local network inside FreeBSD, natted to the one outside.

Suppose you have your jail.conf set up as per my previous article. Here’s an excerpt:

tom { ip4.addr = 192.168.0.2 ; }
dick { ip4.addr = 192.168.0.3 ; }
harry { ip4.addr = 192.168.0.4 ; }

The defaults were set earlier in the file; the only thing that’s unique about each jail is the IP4 address and the name. What I didn’t say at the time was that 192.168.0.0 could have been on an internal network.

To define your local network just define it in rc.conf:

cloned_interfaces="lo1"
ipv4_addrs_lo1="192.168.0.1-14/28"

This creates another local loopback interface and assigns a range of IPv4 addresses to it. This can be as large as you wish, but I’ve defined 1..14 (with appropriate subnet mask) because they’ll be listed every time you run ifconfig!

Next you’re going to need something to do the natting. pf is your friend here. I struggled for years using ipfw before I discovered pf.

Enable pf in rc.conf too:

pf_enable="yes"

And you’ll need an /etc/pf.conf file to do the magic. I like pf – it’s easier for my brain to understand than most. Here’s an example file:

PUB_IP="192.168.1.217"
INT="bge0"
JAIL_NET="192.168.0.0/24"
TOM="192.168.0.2"
DICK="192.168.0.3"
HARRY="192.168.0.4"
scrub in all
nat pass on $INT from $JAIL_NET to any -> $PUB_IP
block on $INT proto tcp from any to $PUBIP port 111
rdr pass on $INT proto tcp from any to $PUBIP port 3306 -> $TOM
rdr pass on $INT proto tcp from any to $PUBIP port {21,80,443} -> $DICK
rdr pass on $INT proto tcp from any to $PUBIP port 81 -> $HARRY port 80

So what’s going on?

I’ve used a few macros. PUB_IP is your public IP address, and INT is the interface it’s on. pf may figure some of this out, but I’m being explicit.

TOM, DICK and HARRY are the IPv4 addresses of the jails.

Next I’m scrubbing all interfaces (normally a good idea, but you don’t have to). But the next line is important – it uses nat to allow stuff on your jail network to talk to the outside world.

The following line is where you might want to block more stuff – in this case NFS on port 111. Then we’re back to jail things for the final three lines. They’re pretty self-explanatory, but here’s an explanation anyway.

Let’s say the tom jail is running a MariaDB server on port 3306. The first line takes anything arriving on port 3306 and sends it to tom’s jail IP. Simple. It can reply because of the nat line earlier.

dick is running a web and ftp server, so ports 21,80 and 443 are sent there. The pf syntax lets you do nice stuff like this with the {..}

Finally we come to harry. Here we’re running an http server on port 80, but to make it accessible externally we’re mapping it to port 81 as otherwise it would clash with dick. In other words, if you don’t specify a destination port in the redirect it will assume the same as the source port.

And that’s it! When you jail is started you will see an interface lo1 with the IP address defined in /etc/jail.conf and assuming you have something sensible in /etc/resolv.conf you’ll have a jail that looks like it’s running behind a NAT router with port forwarding.

Of course, if you don’t need to map a jailed service to an external IP address, don’t! Jails can access services on each other using their own virtual network.

FreeBSD in Godden Green

What is going on with FreeBSD in Godden Green in Kent, UK? Jobsite has been spamming me with junior/mid-level programmer roles mentioning FreeBSD for months now, and I’m getting curious!

I have an alert set up so whenever FreeBSD is mentioned I get a ping, as I like to know what’s going on. This isn’t one of the usual suspect AFAIK – they might even be interesting!

ZFS is not always the answer. Bring back gmirror!

The ZFS bandwaggon has momentum, but ZFS isn’t for everyone. UFS2 has a number of killer advantages in some applications.

ZFS is great if you want to store a very large number of normal files safely. It’s copy-on-write (COW) is a major advantage for backup, archiving and general data safety, and datasets allow you to fine-tune almost any way you can think of. However, in a few circumstances, UFS2 is better. In particular, large random-access files do badly with COW.

Unlike traditional systems, a block in a file isn’t overwritten in place, it always ends up at a different location. If a file started off contiguous it’ll pretty soon be fragmented to hell and performance will go off a cliff. Obvious victims will be databases and VM hard disk images. You can tune for these, but to get acceptable performance you need to throw money and resources to bring ZFS up to the same level. Basically you need huge RAM caches, possibly an SLOG, and never let your pool get more than 50% full. If you’re unlucky enough to end up at 80% full ZFS turns off speed optimisations to devote more RAM to caching as things are going to get very bad fragmentation-wise.

If these costs are a problem, stuck with UFS. And for redundancy, there is still good old GEOM Mirror (gmirror). Unfortunately the documentation of this now-poor relation has lagged a bit, and what once worked as standard, doesn’t. So here are some tips.

The most common use of gmirror (with me anyway) is a twin-drive host. Basically I don’t want things to fail when a hard disk dies, so I add a second redundant drive. Such hosts (often 1U servers) don’t have space for more than two drives anyway – and it pays to keep things simple.

Setting up a gmirror is really simple. You create one using the “gmirror label” command. There is no “gmirror create” command; it really is called “label”, and it writes the necessary metadata label so that mirror will recognise it (“gmirror destroy” is present and does exactly what you might expect).

So something like:

gmirror label gm0 ada1 ada2

will create a device called /dev/mirror/gm0 and it’ll contain ada1’s contents mirrored on to ada2 (once it’s copied it all in the background). Just use /dev/mirror/gm0 as any other GEOM (i.e. disk). Instead of calling it gm0 I could have called it gm1, system, data, flubnutz or anything else that made sense, but gm0 is a handy reminder that it’s the first geom mirror on the system and it’s shorter to type.

The eagle eyed might have noticed I used ada1 and ada2 above. You’ve booted off ada0, right? So what happens if you try mirroring yourself with “gmirror label gm0 ada0 ada1“? Well this used to work, but in my experience it doesn’t any more. And on a twin-drive system, this is exactly what you want to do. But it is still possible, read on…

How to set up a twin-drive host booting from a geom mirror

First off, before you do anything (even installing FreeBSD) you need to set up your disks. Since the IBM XT, hard disks have been partitioned using an MBR (Master Boot Record) at the start. This is really old, naff, clunky and Microsoft. Those in the know have been using the far superior GPT system for ages, and it’s pretty cross-platform now. However, it doesn’t play nice with gmirror, so we’re going to use MBR instead. Trust me on this.

For the curious, know that GPT keeps a copy of the partition table at the beginning and end of the disk, but MBR only has one, stored at the front. gmirror keeps its metadata at the end of the disk, well away from the MBR but unfortunately in exactly the same spot as the spare GPT. You can hack the gmirror code so it doesn’t do this, or frig around with mirroring geoms rather than whole disks and somehow get it to boot, but my advice is to stick to MBR partitioning or BSDlabels, which is an extension. There’s not a lot of point in ever mounting your BSD boot drive on a non-BSD system, so you’re not losing much whatever you choose.

Speaking of metadata, both GPT and gmirror can get confused if they find any old tables or labels on a “new” disk. GPT will find old backup partition tables and try to restore them for you, and gmirror will recognise old drives as containing precious data and dig its heels in when you try to overwrite it. Both gpart and gmirror have commands to erase their metadata, but I prefer to use dd to overwrite the whole disk with zeros anyway before re-use. This checks that the disk is actually good, which is nice to know up-front. You could just erase the start and end if you were in a hurry and wanted to calculate the offsets.

The next thing you’ll need to do is load the geom_mirror kernel module. Either recompile the kernel with it added, or if this fills you with horror,  just add ‘load_geom_mirror=”yes”‘ to /boot/loader.conf. This does bring it in early enough in the process to let you boot from it. The loader will boot from one drive or the other and then switch to mirror mode when it’s done.

So, at this point, you’ve set up FreeBSD as you like on one drive (ada0), selecting BSDlabels or MBR as the partition method and UFS as the file system. You’ve set it to load the geom_mirror module in loader.conf.  You’re now looking at a root prompt on the console, and I’m assuming your drives are ada0 and ada1, and you want to call your mirror gm0.

Try this:

gmirror label gm0 ada0

Did it work? Well it used to once, but now you’ll probably get an error message saying it could not write metadata to ada0. If (when) this happens I know of one answer, which I found after trying everything else. Don’t be tempted to try everything else yourself (such as seeing if it works with ada1). Anything you do will either fail if you’re lucky, or make things worse. So just reboot, and select single-user mode from the loader menu.

Once you’re at the prompt, type the command again, and this time it should say that gm0 is created. My advice is to now reboot rather than getting clever.

When you do reboot it will fail to mount the root partition and stop, asking for help to find it. Don’t panic. We know where it’s gone. Mount it with “ufs:/dev/mirror/gm0s1a” or whatever slice you had it on if you’ve tried to be clever. Forgot to make a note? Don’t worry, somewhere on the boot long visible on the screen it actually tell you the name of the partition it couldn’t find.

After this you should be “in”. And to avoid this inconvenience next time you boot you’ll need to tweak /etc/fstab using an editor of your choice, although real computer nerds only use vi. What you need to do is replace all references to the actual drive with the gm0 version. Therefore /dev/ada0s1a should be edited to read /dev/mirror/gm0s1a. On a current default install, which no longer partitions the drive, this will only apply the root mount point and the swap file.

Save this, reboot (to test) and you should be looking good. Now all that remains is to add the second drive (ada1 in the example) with the line:

gmirror insert gm0 ada1

You can see the effect by running:

gmirror status

Unless your drive is very small, gm0 will be DEGRADED and it will say something about being rebuilt. The precise wording has changed over time. Rebuilding takes hours, not seconds so leave it. Did I mention it’s a good idea to do this when the system isn’t busy?