Stateless provisioning of stateful nodes: examples with Warewulf 4

This article was also published via the CIQ blog on 30 November 2022.

When deploying Warewulf 4, we often encounter expectations that Warewulf should support stateful provisioning. Typically these expectations are born from experience with another system (such as Foreman, XCAT, or even Warewulf 3) that supported writing a provisioned operating system to the local disk of each compute node.

Warewulf 4 intentionally omits this kind of stateful provisioning from its feature set, following experiences from Warewulf 3: the code for stateful provisioning was complex, and required a disproportionate amount of maintenance compared to the number of sites using it.

For the most part, we think that arguments for stateful provisioning are better addressed within Warewulf 4's stateless provisioning process. I'd like to go over three such common use cases here, and show how each can be addressed to provision nodes with local state using Warewulf 4.

Local scratch

The first thing to understand is that stateless provisioning does not mean diskless nodes. For example, you may have a local disk that you want to provide as a scratch file system.

Warewulf compute nodes run a small wwclient agent that assists with the init process during boot and deploys the node's overlays during boot and runtime. wwclient reads its own initialization scripts from /warewulf/init.d/, so we can place startup scripts there to take actions during boot.

My test nodes here are KVM instances with a virtual disk at /dev/vda. This wwclient init script looks for a "local-scratch" file system and, if it does not exist, creates one on the local disk.

# /warewulf/init.d/70-mkfs-local-scratch


# KVM disks require a kernel module
modprobe virtio_blk

fs=$(findfs LABEL=local-scratch)
if [ $? == 0 ]
    echo "local-scratch filesystem already exists: ${fs}"
    echo "Creating local-scratch filesystem on ${target}"
    mkfs.ext4 -FL local-scratch "${target}"

wwclient runs this script before it passes init on to systemd, so it is also processed before fstab. So we can mount the "local-scratch" file system just like any other disk in fstab.

LABEL=local-scratch /mnt/scratch ext4 defaults,X-mount.mkdir,nofail 0 0

The Warewulf 4 overlay system allows us to deploy customized files to nodes or groups of nodes (via profiles) at boot. For this example, I've placed my customized fstab and init script in a "local-scratch" overlay and included it as a system overlay, alongside the default wwinit overlay.

# wwctl overlay list -a local-scratch
OVERLAY NAME                   FILES/DIRS  
local-scratch                  /etc/        
local-scratch                  /etc/fstab.ww
local-scratch                  /warewulf/   
local-scratch                  /warewulf/init.d/
Local-scratch                  /warewulf/init.d/70-mkfs-local-scratch

# wwctl profile set --system wwinit,local-scratch default
# wwctl overlay build

Because local-scratch is listed after wwinit in the "system" overlay list (see above), its fstab overrides the definition in the wwinit overlay. 70-mkfs-local-scratch is placed alongside other init scripts, and is processed in lexical order.

A node booting with this overlay will create (if it does not exist) a "local-scratch" file system and mount it at "/mnt/scratch", potentially for use by compute jobs.

Disk partitioning

But perhaps you want to do something more complex. Perhaps you have a single disk, but you want to allocate part of it for scratch (as above) and part of it as swap space. Perhaps contrary to popular opinion, we actively encourage the use of swap space in an image-netboot environment like Warewulf 4: a swap partition that is at least as big as the image to be booted allows Linux to write idle portions of the image to disk, freeing up system memory for compute jobs.

So let's expand on the above pattern to actually partition a disk, rather than just format it.

# /warewulf/init.d/70-parted


# KVM disks require a kernel module
modprobe virtio_blk

local_swap=$(findfs LABEL=local-swap)
local_scratch=$(findfs LABEL=local-scratch)

if [ -n "${local_swap}" -a -n "${local_scratch}" ]
    echo "Found local-swap: ${local_swap}"
    echo "Found local-scratch: ${local_scratch}"

    echo "Writing partition table to ${disk}"
    parted --script --align=optimal ${disk} -- \
        mklabel gpt \
        mkpart primary linux-swap 0 2GB \
        mkpart primary ext4 2GB -1

    echo "Creating local-swap on ${local_swap}"
    mkswap --label=local-swap "${local_swap}"

    echo "Creating local-scratch on ${local_scratch}"
    mkfs.ext4 -FL local-scratch "${local_scratch}"

This new init script looks for the expected "local-scratch" and "local-swap" and, if either of them is not found, uses parted to partition the disk and creates them. As before, this is done before fstab is processed, so we can configure these with fstab the standard way.

LABEL=local-swap swap swap defaults,nofail 0 0
LABEL=local-scratch /mnt/scratch ext4 defaults,X-mount.mkdir,nofail 0 0

This configuration went into a new parted overlay, allowing us to configure some nodes for "local-scratch" only, and some nodes for this partitioned layout.

# wwctl overlay list -a parted
OVERLAY NAME                   FILES/DIRS  
parted                         /etc/        
parted                         /etc/fstab.ww
parted                         /warewulf/   
parted                         /warewulf/init.d/
parted                         /warewulf/init.d/70-parted

# wwctl profile set --system wwinit,parted default
# wwctl overlay build

(Note: I installed parted in my system image to support this; but the same could also be done with sfdisk, which is included in the image by default.)

Persistent storage for logs

Another common use case we hear concerns the persistence of logs on the compute nodes. Particularly in a failure event, where a node must be rebooted, it can be useful to have retained logs on the compute host so that they can be investigated when the node is brought back up: in a default stateless deployment, these logs are lost on reboot.

We can extend from the previous two examples to deploy a "local-log" file system to retain these logs between reboots.

(Note: generally we advise not retaining logs on compute nodes: in stead, you should deploy something like Elasticsearch, Splunk, or even just a central rsyslog instance.)

# /warewulf/init.d/70-parted


# KVM disks require a kernel module
modprobe virtio_blk

local_swap=$(findfs LABEL=local-swap)
local_log=$(findfs LABEL=local-log)
local_scratch=$(findfs LABEL=local-scratch)

if [ -n "${local_swap}" -a -n "${local_log}" -a -n "${local_scratch}" ]
    echo "Found local-swap: ${local_swap}"
    echo "Found local-log: ${local_log}"
    echo "Found local-scratch: ${local_scratch}"

    echo "Writing partition table to ${disk}"
    parted --script --align=optimal ${disk} -- \
        mklabel gpt \
        mkpart primary linux-swap 0 2GB \
        mkpart primary ext4 2GB 4GB \
        mkpart primary ext4 4GB -1

    echo "Creating local-swap on ${local_swap}"
    mkswap --label=local-swap "${local_swap}"

    echo "Creating local-log on ${local_log}"
    mkfs.ext4 -FL local-log "${local_log}"

    echo "Populating local-log from image /var/log/"
    mkdir -p /mnt/log/ \
      && mount "${local_log}" /mnt/log \
      && rsync -a /var/log/ /mnt/log/ \
      && umount /mnt/log/ \
      && rmdir /mnt/log

    echo "Creating local-scratch on ${local_scratch}"
    mkfs.ext4 -FL local-scratch "${local_scratch}"

For the most part, this follows the same pattern from the "parted" example above; but adds a step to initalize the new "local-log" file system from the directory structure in the image.

Finally, the new file system is added to fstab, after which logs will be persisted on the local disk.

LABEL=local-swap swap swap defaults,nofail 0 0
LABEL=local-scratch /mnt/scratch ext4 defaults,X-mount.mkdir,nofail 0 0
LABEL=local-log /var/log ext4 defaults,nofail 0 0

Some applications may write logs outside of /var/log; but, in these instances, it's probably easier to configure the application to write to /var/log than to try to capture all the places where logs might be written.

The future

There are a few more use cases that we sometimes hear brought up in the context of stateful node provisioning:

  • How can we use Ansible to configure compute nodes?
  • How can we configure custom kernels and kernel modules per node?
  • Isn't stateless provisioning slower than having the OS deployed on disk?

If you'd like to hear more about these or other potential corner-cases for stateless provisioning, get in touch! We'd love to hear from you, learn about the work you're doing, and address any of the challenges you're having.

Warewulf Deep Dive | CIQ Webinar

I'm really excited about Warewulf 4! I'm also really excited about OpenHPC, Apptainer, and containerizing HPC workloads, particularly MPI. Today I presented my most recent work in these areas in the CIQ webinar.

We're doing these webinars every week! I hope you'll join us Thursdays at 11:00 Pacific time, either via YouTube and LinkedIn.

Twilight Princess | Game Older

Family heirlooms of the adult gamer.

Andi beat a game! And it's a Zelda game! Come hang with us for a bit while we discuss the special place the Zelda series has in each of our hearts.

Media used

Also included is background music and marketing from a variety of Zelda games, including Link to the Past, Twilight Princess, and Breath of the Wild.

Apptainer Signatures | CIQ Webinar

I had the pleasure of participating in my first CIQ webinar today! Check it out if you'd like to learn a bit about Apptainer's support for cryptographic signatures, using well-established PGP infrastructure and paradigms.

I hope you'll join us for our next session! We're live every Thursday at 11:00 Pacific time, streaming to YouTube and LinkedIn.

Migrating to LDAP PAM Pass Through Auth

The Research Computing authentication path is more complex than I'd like.

  • We start with pam_sss which, of course, authenticates against sssd.

  • Because we have users from multiple home institutions, both internal and external, sssd is configured with multiple domains.

  • Two of our configured domains authenticate against Duo and Active Directory. To support this we run two discrete instances of the Duo authentication proxy, one for each domain.

  • The Duo authentication proxy can present either an LDAP or RADIUS interface. We went with RADIUS. So sssd is configured with auth_provider = proxy, with a discrete pam stack for each domain. This pam stack uses pam_radius to authenticate against the correct Duo authentication proxy.

  • The relevant Duo authentication proxy then performs AD authentication to the relevant authoritative domain and, on success, performs Duo authentication for second factor.

All of this technically works, and has been working for some time. However, we've increasingly seen a certain bug in sssd's proxy authentication provider, which manifests as an incorrect monitoring or management of authentication threads.

The problem

[sssd[be[]]] [dp_attach_req] (0x0400): Number of active DP request: 32

sssd maintains a number of pre-forked children for performing this proxy authentication. This default to 10 threads, and is configurable per-domain as proxy_max_children. Somewhere in sssd a bug exists that either prevents threads from being closed properly or fails to decrement the active thread count when they are closed. When the "Number of active DP request" exceeds proxy_max_children sssd will no longer perform authentication for the affected domain.

We have reported this issue to Red Hat, but 8 months on and we still don't have a fix. Meanwhile, I'm interested in simplifying our authentication path, hopefully removing the proxy authentication provider from our configuration in the process, and making sssd optional for authentication in our environment.

Our solution

We use 389 Directory Server as our local LDAP server. 389 has with it the capability to proxy authentication via PAM. A previous generation RC LDAP used this to perform authentication; but only in a way that supported a single authentication path. However, with some research and experimentation, we have managed to configure our instance with different proxy authentication paths for each of our child domains.

First we simply activate the PAM Pass Through Auth plugin by setting nsslapd-pluginEnabled: on in the existing LDAP entry.

dn: cn=PAM Pass Through Auth,cn=plugins,cn=config
objectClass: top
objectClass: nsSlapdPlugin
objectClass: extensibleObject
objectClass: pamConfig
cn: PAM Pass Through Auth
nsslapd-pluginPath: libpam-passthru-plugin
nsslapd-pluginInitfunc: pam_passthruauth_init
nsslapd-pluginType: betxnpreoperation
nsslapd-pluginEnabled: on
nsslapd-pluginloadglobal: true
nsslapd-plugin-depends-on-type: database
pamMissingSuffix: ALLOW
pamExcludeSuffix: cn=config
pamIDMapMethod: RDN
pamIDAttr: uid
pamFallback: FALSE
pamSecure: TRUE
pamService: ldapserver
nsslapd-pluginId: pam_passthruauth
nsslapd-pluginVendor: 389 Project
nsslapd-pluginDescription: PAM pass through authentication plugin

The specifics of authentication can be specified at this level as well, if we're able to express our desired behavior in a single configuration. However, the plugin supports multiple simultaneous configurations expressed as nested LDAP entries.

dn: PAM,cn=PAM Pass Through Auth,cn=plugins,cn=config
objectClass: pamConfig
objectClass: top
cn: PAM
pamMissingSuffix: ALLOW
pamExcludeSuffix: cn=config
pamIDMapMethod: RDN ENTRY
pamIDAttr: uid
pamFallback: FALSE
pamSecure: TRUE
pamService: curc-twofactor-duo
pamFilter: (&(objectClass=posixAccount)(!(homeDirectory=/home/*@*)))

dn: PAM,cn=PAM Pass Through Auth,cn=plugins,cn=config
objectClass: pamConfig
objectClass: top
cn: PAM
pamMissingSuffix: ALLOW
pamExcludeSuffix: cn=config
pamIDMapMethod: RDN ENTRY
pamIDAttr: uid
pamFallback: FALSE
pamSecure: TRUE
pamService: csu
pamFilter: (&(objectClass=posixAccount)(homeDirectory=/home/*

Our two sets of users are authenticated using different PAM stacks, as before. Only now this proxy authentication is happening within the LDAP server, rather than within sssd. This may seem like a small difference, but there are multiple benefits:

  • The proxy configuration exists, and need only be maintained, only within the LDAP server. It does not require all login nodes to run sssd and a complex, multi-tiered PAM stack.

  • The LDAP "PAM Pass Through Auth" plugin does not have the same bug as the sssd proxy authentication method, bypassing our immediate problem.

  • Applications that do not support PAM authentication, such as XDMoD, Foreman, and Grafana, can now be configured with simple LDAP authentication, and need not know anything of the complexity of authenticating our multiple domains.

For now I'm differentiating our different user types based on the name of their home directory, because it happens to include the relevant domain suffix. In the future we expect to update usernames in the directory to match and would then likely update this configuration to use uid.

Cleaning up a few remaining issues

However, when I first tied this back into sssd, I DOS'd our LDAP server.

debug_level = 3

description = CU Boulder Research Computing
id_provider = ldap
auth_provider = ldap
chpass_provider = none

enumerate = false
entry_cache_timeout = 300

ldap_id_use_start_tls = True
ldap_tls_reqcert = allow
ldap_uri = ldap://
ldap_search_base = dc=rc,dc=int,dc=colorado,dc=edu
ldap_user_search_base = ou=UCB,ou=People,dc=rc,dc=int,dc=colorado,dc=edu
ldap_group_search_base = ou=UCB,ou=Groups,dc=rc,dc=int,dc=colorado,dc=edu

This seemed simple enough: when I would try to authenticate using this configuration, I would enter my password as usual and then respond to a Duo "push." But the authentication never cleared in sssd, and I would keep receiving Duo pushes until I stopped sssd. This despite the fact that I could authenticate with ldapsearch as expected.

$ ldapsearch -LLL -x -ZZ -D uid=[redacted],ou=UCB,ou=People,dc=rc,dc=int,dc=colorado,dc=edu -W '(uid=[redacted])' dn
Enter LDAP Password:
dn: uid=[redacted],ou=UCB,ou=People,dc=rc,dc=int,dc=colorado,dc=edu

I eventually discovered that sssd has a six-second timeout for "calls to synchronous LDAP APIs," including BIND. This timeout is entirely reasonable--even generous--for operations that do not have a manual intervention component. But when BIND includes time to send a notification to a phone, unlock the phone, and acknowledge the notification in an app, it is easy to exceed this timeout. sssd gives up and tries again, prompting a new push that won't be received until the first is addressed. In this way, the timeouts just extend against each other.

Thankfully, this timeout is also configurable as ldap_opt_timeout in the relevant sssd domain section. I went with ldap_opt_timeout = 90, which is likely longer than anyone will need.

There is still the matter of the fact that this DOS'd the LDAP server, however. I suspect I had exhausted the number of directory server threads with pending, long-living (due to manual intervention required / timeout) BIND requests.

The number of threads Directory Server uses to handle simultaneous connections affects the performance of the server. For example, if all threads are busy handling time-consuming tasks (such as add operations), new incoming connections are queued until a free thread can process the request.

Red Hat suggests that nsslapd-threadnumber should be 32 for an eight-CPU system like ours; so for now I simply increased to this recommendation from 16. If we continue to experience thread exhaustion in real-world use, we can always increase the number of threads again.

digging into BeeGFS striping

I did some work today figuring out how BeeGFS actually writes its data to disk. I shudder to think that we’d actually use this knowledge; but I still found it interesting, so I want to share.

First, I created a simple striped file in the rcops allocation.

[root@boss2 rcops]# beegfs-ctl --createfile testfile --numtargets=2 --storagepoolid=2
Operation succeeded.

This file will stripe across two targets (chosen by BeeGFS at random) and is using the default 1M chunksize for the rcops storage pool. You can see this with beegfs-ctl --getentryinfo.

[root@boss2 rcops]# beegfs-ctl --getentryinfo /mnt/beegfs/rcops/testfile --verbose
EntryID: 9-5F7E8E87-1
Metadata buddy group: 1
Current primary metadata node: bmds1 [ID: 1]
Stripe pattern details:
+ Type: RAID0
+ Chunksize: 1M
+ Number of storage targets: desired: 2; actual: 2
+ Storage targets:
  + 826 @ boss1 [ID: 1]
  + 834 @ boss2 [ID: 2]
Chunk path: uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1
Dentry path: 50/4/0-5BEDEB51-1/

I write an easily-recognized dataset to the file: 1M of A to the file; then 1M of B and so-on.

[root@boss2 rcops]# python -c 'import sys; sys.stdout.write("A"*(1024*1024))' >>testfile
[root@boss2 rcops]# python -c 'import sys; sys.stdout.write("B"*(1024*1024))' >>testfile
[root@boss2 rcops]# python -c 'import sys; sys.stdout.write("C"*(1024*1024))' >>testfile
[root@boss2 rcops]# python -c 'import sys; sys.stdout.write("D"*(1024*1024))' >>testfile

This gives me a 4M file, precisely 1024*1024*4=4194304 bytes.

[root@boss2 rcops]# du --bytes --apparent-size testfile
4194304     testfile

Those two chunk files, as identified by beegfs-ctl --getentryinfo, are at /data/boss207/rcops/storage/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1 and /data/boss106/rcops/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1. (boss106/rcops doesn’t have a storage directory as part of an experiment to see how difficult it would be to remove them. I guess we never put it back.) the boss1 target, 826, is first in the list, so that’s where the file starts.

[root@boss1 ~]# dd if=/data/boss106/rcops/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1 bs=1 count=5 status=none

if we skip 1M (1024*1024 bytes) we see that that’s where the file changes to C.

[root@boss1 ~]# dd if=/data/boss106/rcops/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1 bs=1 skip=$(((1024 * 1024))) count=5 status=none

And we can see that actually is precisely where it starts by stepping back a little.

[root@boss1 ~]# dd if=/data/boss106/rcops/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1 bs=1 skip=$(((1024 * 1024)-2)) count=5 status=none

Cool. So we’ve found the end of the first chunk (made of A) and the start of the third chunk (made of C). That means the second and fourth chunks are over in 834. Which they are.

[root@boss2 rcops]# dd if=/data/boss207/rcops/storage/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1 bs=1 count=5 status=none
[root@boss2 rcops]# dd if=/data/boss207/rcops/storage/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1 bs=1 count=5 skip=$(((1024*1024-2))) status=none

So, in theory, if we wanted to bypass BeeGFS and re-construct files from their chunks, we could do that. It sounds like a nightmare, but we could do it. In a worst-case scenario.

It’s this kind of transparency and inspectability that still makes me really like BeeGFS, despite everything we’ve been through with it.

Wireguard on Raspberry Pi OS

Recently I fell victim to an attack on a security vulnerability in SaltStack that left much of my homelab infected with cryptominers. When I rebuilt the environment I found myself in the market for a VPN solution.

I have used OpenVPN for a little while, but I found it inconvenient enough to set up and use that I only used it when absolutely necessary to bridge between otherwise private networks.

But I had been hearing good things about WireGuard, so I performed a test deployment. First between two disparate servers. Then on a workstation. Then another. Each time the software deployed easily and remained reliably available, particularly in contrast to the unreliability I had become accustomed to with the Cisco VPN I use for work.

So I came to the last system in my network: a first-generation Raspberry Pi B+. WireGuard isn't available in the Raspberry Pi OS (née Raspbian) repository, but I found articles describing how to install the packages from either Debian backports or unstable. I generally avoid mixing distributions, but I followed the directions as proof of concept.

The base wireguard package installed successfully, and little surprise: it is a DKMS package, after all. However, binaries from wireguard-tools immediately segfaulted. (I expect this is because the CPU in the first-generation B+ isn't supported by Debian.)

But then I realized that APT makes source repositories as accessible as binary repositories. Compiling my own WireGuard packages would worry me less as well:

First add the Debian Buster backports repository, including its signing key. (You can verify the key fingerprint at

sudo apt-key adv --keyserver --recv-keys 0x80D15823B7FD1561F9F7BCDDDC30D7C23CBBABEE
echo 'deb-src buster-backports main' | sudo tee /etc/apt/sources.list.d/backports.list
sudo apt update

Install the devscripts package (so we can use debuild to build the WireGuard packages) and any build dependencies for WireGuard itself.

sudo apt install devscripts
sudo apt build-dep wireguard

Finally, download, build, and install WireGuard.

apt source wireguard
(cd wireguard-*; debuild -us -uc)
sudo apt install ./wireguard_*.deb ./wireguard-tools_*.deb

At this point you should have a fully functional WireGuard deployment, with working wireguard-tools binaries.

Ōkami | Game Older

Issun must die

Jonathon was finally left with no excuse to not play Ōkami, when Cam joined the crew, Steam library in tow. Like the game, we spend a lot of time hanging out and talking too much.

Media used

  • Ōkami OST

  • "Harushiden," halc,

Games mentioned

  • Ōkami

  • Myst (series)

  • Death Stranding

  • Spyro the Dragon

  • The Witness

  • Viewtiful Joe

  • God Hand

  • Vanquish

  • Metal Gear Rising: Revengeance

  • The Legend of Zelda: Ocarina of Time

  • The Legend of Zelda: Twilight Princess

  • Bayonetta

  • Nier: Automata

  • Snake Pass

  • Crash Bandicoot

  • Katamari Damacy

  • Chibi-Robo!

Other references

sprint backlog - 18 February 2020

Research Computing team goals for the period 18 February - 3 March, 2020. If you have any questions or comments please contact

Intro to Python workshop

Reserch Computing is presenting its regular Intro to Python course.

RCAMP portal testing framework

The RC Account Management Portal (RCAMP) handles account requests and group membership in the RC environment. In order to help us better update and develop the portal and its dependencies we are rebuilding and enhancing its automated test infrastructure.

Internal training for upcoming CC* hybrid cloud environment

RC is developing a hybrid "coud" environment with support from the NSF Campus Cyberinfrastructure (CC*) program. Development of this environment is ongoing; but our team is also taking this time to learn more about Amazon EC2 and OpenStack virtual machines in order to better support our users when the platform is ready.

Better staff access to fail2ban on login nodes

RC login nodes are protected from brute-force attacks using fail2ban: if a login node sees a sequence of login failures from the same source, that souce is "banned" from all login node access for a period of time. During a training, however, when such authentication failures are common from multiple people in the same room, it is inconvenient to wait for the ban to expire. RC system administrators have the ability to cancel such a ban, but they are not usually present at trainings. To better support this use case, we will be delegating the ability to cancel such bans to the rest of the RC team.

PetaLibrary monthly status reports

A monthly email status report is sent out to PetaLibrary allocation owners and and contacts; but this report has fallen out of date, and has not been updated to reflect changes in the PetaLibrary infrastructure. We are updating this reporting script so that all PetaLibrary allocations are reported, irrespective of their deployment location.

Updated MPI in rebuilt Core Software

Our efforts to update our core software stack are ongoing, with our next goal being to install up-to-date Intel MPI and OpenMPI.

RC trainings review

Finally, to better plan future RC trainings and other user support activities, we are reviewing the trainings, office hours, and consults that we've supported in CY2019.