There are two mysteries things on ZFS that cause a lot of confusion: The ZIL and the SLOG. This article is about what they are and why you should care, or not care about them. But I’ll come to them later. Instead I’ll start with POSIX, and what it says about writing stuff to disk files.
When you write to disk it can either be synchronous or asynchronous. POSIX (Portable Operating System Interface) has requirements for writes through various system calls and specifications.
With an asynchronous write the OS takes the data you give it and returns control to the application immediately, promising to write the data as soon as possible in the background. No delay. With a synchronous write the application won’t get control back until the data is actually written to the disk (or non-volatile storage of some kind). More or less. Actually, POSIX.1-2017 (IEEE Standard 1003.1-2017) doesn’t guarantee it’s written, but that’s the expectation.
You’d want synchronous writes for critical complex files, such as a database, where the internal structure would break if a transaction was only half written, and a database engine needs to know that one write has occurred before making another.
Writes to ZFS can be long and complicated, requiring multiple blocks be updated for a single change. This is how it maintains its very high integrity. However, this means it can take a while to write even the simplest thing, and a synchronous write could take ages (in computer terms).
To get around this, ZFS maintains a ZIL – ZFS Intent Log.
In ZFS, the ZIL primarily serves to ensure the consistency and durability of write operations, particularly for synchronous writes. But it’s not a physical thing; it’s a concept or list. It contains transaction groups that need to be completed in order.
The ZIL can be physically stored in three possible places…
In-Memory (Volatile Storage):
This is the default location. Initially, all write operations are buffered in RAM. This is where they are held before being committed to persistent storage. This kind ofZIL is volatile because it’s not backed by any permanent storage until written to disk.
Volatility doesn’t matter, because ZFS guarantees consistency with transaction groups (TXGs). The power goes off and the in-RAM ZIL is lost, the transactions are never applied; but the file system is in a consistent state.
In-Pool (Persistent Storage):
Without a dedicated log device, the ZIL entries are written to the main storage pool in transaction groups . This happens for both synchronous and asynchronous writes but is more critical for synchronous writes to ensure data integrity in case of system crashes or power failures.
SLOG (Separate Intent Log Device):
For better performance with synchronous writes, you can add a dedicated device to serve as the SLOG. This device is typically a low-latency, high-speed storage like a short-stroked Rapter, enterprise SSD or NVRAM. ZFS writes the log entries before they’re committed to the pool’s main storage.
By storing the pending TXGs on disk, either in the pool or on an SLOG, ZFS can meet the POSIX requirement that the transaction is stored in non-volatile storage before the write returns, and if you’re doing a lot of synchronous writes then storing them on a high-speed SLOG device helps. But only if the SLOG device is substantially faster than an array of standard drives. And it only matters if you do a lot of synchronous writes. Caching asynchronous writes in RAM is always going to be faster still
I’d contend that the only times synchronous writes feature heavily are databases and virtual machine disks. And then there’s NFS, which absolutely loves them. See ESXi NFS ZFS and vfs-nfsd-async for more information if this is your problem.
If you still think yo need an SLOG, install a very fast drive. These days an NVMe SLC NAND device makes sense. Pricy, but it doesn’t need to be very large. You can add it to a zpool with:
zpool add poolname log /dev/daX
Where daX is the drive name, obviously.
As I mentioned, the SLOG doesn’t need to be large at all. It only has to cope with five seconds of writes, as that’s the maximum amount of time data is “allowed” to reside there. If you’re using NFS over 10Gbit Ethernet the throughput isn’t going to be above 1.25Gb a seconds. Assuming that’s flat-out synchronous writes, multiplying that by five seconds is less than 8Gb. Any more would be unused.
If you’ve got a really critical system you can add mirrored SLOG drives to a pool thus:
zpool add poolname log /dev/daX /dev/daY
You can also remove them with something like:
zpool remove poolname log /dev/daY
This may be useful if adding an SLOG doesn’t give you the performance boost you were hoping for. It’s very niche!