Posts about warewulf

Splitting Warewulf Images Between PXE and NFS

This article was also published via the CIQ blog on 6 December 2022.

Warewulf 4 introduced compatibility with the OCI container ecosystem, which greatly streamlines the process of defining, importing, and maintaining compute node images compared to other systems--even compared to Warewulf 3! But one aspect of compute node images remains unchanged: they can quickly grow in size.

Warewulf (and the technique of PXE-booting a node image more broadly) expects that a compute node image will remain relatively small. Larger sets of software, like you might provide via an Environment Modules stack or, perhaps, via Spack, are typically deployed via a central NFS share, which is then mounted at runtime by the booted compute node. Even OpenHPC, with software packaged as operating system containers, supports this paradigm, with packages installed on the head node, landing in /opt, and then being shared from the head node to compute nodes.

However, there are still benefits to maintaining this software as part of a compute node image; but such a large image can quickly grow to tens of gigabytes, making network booting difficult.

In this article I'll demonstrate how a full software stack can be managed together with a given compute node image, but the resultant payload can be split in-place between PXE-served netbooting and an NFS-mounted file system.


NOTE

This procedure depends on support for /etc/warewulf/excludes, which was broken in Warewulf v4.3.0.


The root image

First, I start with the standard Rocky Linux 8 image as published by HPCng.

[root@wwctl1 ~]# wwctl container import docker://docker.io/warewulf/rocky:8 rocky-8-split

Installing some software

Using the OpenHPC project as a source, I install a set of typical scientific software. Most OpenHPC packages install software in /opt for distribution via NFS, which is what we're going to do: just a little bit differently than usual.

[root@wwctl1 ~]# wwctl container shell rocky-8-split
[rocky-8-split] Warewulf> dnf -y install 'dnf-command(config-manager)'
[rocky-8-split] Warewulf> dnf config-manager --set-enabled powertools
[rocky-8-split] Warewulf> dnf -y install epel-release http://repos.openhpc.community/OpenHPC/2/CentOS_8/x86_64/ohpc-release-2-1.el8.x86_64.rpm
[rocky-8-split] Warewulf> dnf -y install valgrind-ohpc {netcdf,pnetcdf,hypre,boost}-gnu9-mpich-ohpc

After installing the software our image is approaching 2GB. This isn't egregious (and the compressed image as sent over the network is even smaller), but gives us a point of comparison for what comes next.

[root@wwctl1 ~]# du -h /var/lib/warewulf/container/rocky-8-split.img{,.gz}
1.8G    /var/lib/warewulf/container/rocky-8-split.img
651M    /var/lib/warewulf/container/rocky-8-split.img.gz

Excluding the software from the final image

Warewulf consults /etc/warewulf/excludes within the image itself to define files that should not be included in the built image. For our example here, I exclude the full contents of /opt/, in anticipation that we'll be mounting it via NFS in stead.

[rocky-8-split] Warewulf> cat /etc/warewulf/excludes
/boot
/usr/share/GeoIP
/opt/*

Rebuilding the image with /opt/* excluded, the image is reduced in size, and further software installation would no longer increase the final size of the image delivered over PXE.

[root@wwctl1 ~]# du -h /var/lib/warewulf/container/rocky-8-split.img{,.gz}
1.1G    /var/lib/warewulf/container/rocky-8-split.img
483M    /var/lib/warewulf/container/rocky-8-split.img.gz

Exporting the software via NFS

With the software in /opt excluded from the image, we need to export it via NFS in stead. This is relatively easily done, though we must discover and hard-code paths to the container directory.

[root@wwctl1 ~]# readlink -f $(wwctl container show rocky-8-split)/opt
/var/lib/warewulf/chroots/rocky-8-split/rootfs/opt

Add an NFS export to /etc/warewulf/warewulf.conf, restart the Warewulf server, and configure NFS with wwctl. Note that I've specified mount: false for this export, as I want to control which nodes will mount it: presumably nodes that aren't using this image should not mount this image's software.

nfs:
  export paths:
  - path: /var/lib/warewulf/chroots/rocky-8-split/rootfs/opt
    export options: rw,sync,no_root_squash
    mount: false
[root@wwctl1 ~]# systemctl restart warewulfd
[root@wwctl1 ~]# wwctl configure nfs

Mounting the software on the compute node

We can mount this new NFS share just like any other, by listing it in fstab.

Warewulf typically configures fstab as part of the wwinit overlay. In order to mount this NFS share without setting mount: true for all nodes, I copy fstab.ww to a new overlay and add an additional entry.

[root@wwctl1 ~]# wwctl overlay list -a rocky-8-split
OVERLAY NAME                   FILES/DIRS
rocky-8-split                  /etc/
rocky-8-split                  /etc/fstab.ww

[root@wwctl1 ~]# wwctl overlay show rocky-8-split /etc/fstab.ww | tail -n1
{{ .Ipaddr }}:/var/lib/warewulf/chroots/rocky-8-split/rootfs/opt /opt nfs defaults 0 0

I can add the new overlay to our wwinit list, and the fstab in rocky-8-split will override the one in wwinit. (Note: --wwinit was specified as --system in Warewulf 4.3.0.)

[root@wwctl1 ~]# wwctl profile set --wwinit wwinit,rocky-8-split default
[root@wwctl1 ~]# wwctl profile set --container rocky-8-split default

From a compute node, we can see that /opt is mounted via NFS as expected.

[root@compute1 ~]# findmnt /opt
TARGET SOURCE                                                      FSTYPE OPTIONS
/opt   10.0.0.3:/var/lib/warewulf/chroots/rocky-8-split/rootfs/opt nfs4   rw,relatime,vers=4.2,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.0.4,local_lock=none,addr=10.0.0.3

We can further confirm that /opt is empty on the local, PXE-deployed file system.

[root@compute1 ~]# mount -o bind / /mnt
[root@compute1 ~]# du -s /mnt/opt
0   /mnt/opt

Future work

As demonstrated here, we can already implement split PXE/NFS images using functionality already in Warewulf; but future Warewulf development may simplify this process further:

Container path variables in warewulf.conf

We could support referring to compute node images in warewulf.conf. For example, it would be nice to be able to replace

nfs:
  export paths:
  - path: /var/lib/warewulf/chroots/rocky-8-split/rootfs/opt
    export options: rw,sync,no_root_squash
    mount: false

with something like

nfs:
  export paths:
  - path: {{ containers['rocky-8-split'] }}/opt
    export options: rw,sync,no_root_squash
    mount: false

This way, our configuration would not have to hard-code the path to the container chroot.

Move NFS mount settings to nodes and profiles

Right now, NFS client settings are stored in warewulf.conf as mount options, mount, and implicitly via path; but if these settings were moved to nodes and profiles we could configure per-profile and per-node NFS client behavior without having to manually edit or override fstab.

Stateless provisioning of stateful nodes: examples with Warewulf 4

This article was also published via the CIQ blog on 30 November 2022.

When deploying Warewulf 4, we often encounter expectations that Warewulf should support stateful provisioning. Typically these expectations are born from experience with another system (such as Foreman, XCAT, or even Warewulf 3) that supported writing a provisioned operating system to the local disk of each compute node.

Warewulf 4 intentionally omits this kind of stateful provisioning from its feature set, following experiences from Warewulf 3: the code for stateful provisioning was complex, and required a disproportionate amount of maintenance compared to the number of sites using it.

For the most part, we think that arguments for stateful provisioning are better addressed within Warewulf 4's stateless provisioning process. I'd like to go over three such common use cases here, and show how each can be addressed to provision nodes with local state using Warewulf 4.

Local scratch

The first thing to understand is that stateless provisioning does not mean diskless nodes. For example, you may have a local disk that you want to provide as a scratch file system.

Warewulf compute nodes run a small wwclient agent that assists with the init process during boot and deploys the node's overlays during boot and runtime. wwclient reads its own initialization scripts from /warewulf/init.d/, so we can place startup scripts there to take actions during boot.

My test nodes here are KVM instances with a virtual disk at /dev/vda. This wwclient init script looks for a "local-scratch" file system and, if it does not exist, creates one on the local disk.

#!/bin/sh
#
# /warewulf/init.d/70-mkfs-local-scratch

PATH=/usr/sbin:/usr/bin

# KVM disks require a kernel module
modprobe virtio_blk

fs=$(findfs LABEL=local-scratch)
if [ $? == 0 ]
then
    echo "local-scratch filesystem already exists: ${fs}"
else
    target=/dev/vda
    echo "Creating local-scratch filesystem on ${target}"
    mkfs.ext4 -FL local-scratch "${target}"
fi

wwclient runs this script before it passes init on to systemd, so it is also processed before fstab. So we can mount the "local-scratch" file system just like any other disk in fstab.

LABEL=local-scratch /mnt/scratch ext4 defaults,X-mount.mkdir,nofail 0 0

The Warewulf 4 overlay system allows us to deploy customized files to nodes or groups of nodes (via profiles) at boot. For this example, I've placed my customized fstab and init script in a "local-scratch" overlay and included it as a system overlay, alongside the default wwinit overlay.

# wwctl overlay list -a local-scratch
OVERLAY NAME                   FILES/DIRS  
local-scratch                  /etc/        
local-scratch                  /etc/fstab.ww
local-scratch                  /warewulf/   
local-scratch                  /warewulf/init.d/
Local-scratch                  /warewulf/init.d/70-mkfs-local-scratch

# wwctl profile set --system wwinit,local-scratch default
# wwctl overlay build

Because local-scratch is listed after wwinit in the "system" overlay list (see above), its fstab overrides the definition in the wwinit overlay. 70-mkfs-local-scratch is placed alongside other init scripts, and is processed in lexical order.

A node booting with this overlay will create (if it does not exist) a "local-scratch" file system and mount it at "/mnt/scratch", potentially for use by compute jobs.

Disk partitioning

But perhaps you want to do something more complex. Perhaps you have a single disk, but you want to allocate part of it for scratch (as above) and part of it as swap space. Perhaps contrary to popular opinion, we actively encourage the use of swap space in an image-netboot environment like Warewulf 4: a swap partition that is at least as big as the image to be booted allows Linux to write idle portions of the image to disk, freeing up system memory for compute jobs.

So let's expand on the above pattern to actually partition a disk, rather than just format it.

#!/bin/sh
#
# /warewulf/init.d/70-parted

PATH=/usr/sbin:/usr/bin

# KVM disks require a kernel module
modprobe virtio_blk

local_swap=$(findfs LABEL=local-swap)
local_scratch=$(findfs LABEL=local-scratch)

if [ -n "${local_swap}" -a -n "${local_scratch}" ]
then
    echo "Found local-swap: ${local_swap}"
    echo "Found local-scratch: ${local_scratch}"
else
    disk=/dev/vda
    local_swap="${disk}1"
    local_scratch="${disk}2"

    echo "Writing partition table to ${disk}"
    parted --script --align=optimal ${disk} -- \
        mklabel gpt \
        mkpart primary linux-swap 0 2GB \
        mkpart primary ext4 2GB -1

    echo "Creating local-swap on ${local_swap}"
    mkswap --label=local-swap "${local_swap}"

    echo "Creating local-scratch on ${local_scratch}"
    mkfs.ext4 -FL local-scratch "${local_scratch}"
fi

This new init script looks for the expected "local-scratch" and "local-swap" and, if either of them is not found, uses parted to partition the disk and creates them. As before, this is done before fstab is processed, so we can configure these with fstab the standard way.

LABEL=local-swap swap swap defaults,nofail 0 0
LABEL=local-scratch /mnt/scratch ext4 defaults,X-mount.mkdir,nofail 0 0

This configuration went into a new parted overlay, allowing us to configure some nodes for "local-scratch" only, and some nodes for this partitioned layout.

# wwctl overlay list -a parted
OVERLAY NAME                   FILES/DIRS  
parted                         /etc/        
parted                         /etc/fstab.ww
parted                         /warewulf/   
parted                         /warewulf/init.d/
parted                         /warewulf/init.d/70-parted

# wwctl profile set --system wwinit,parted default
# wwctl overlay build

(Note: I installed parted in my system image to support this; but the same could also be done with sfdisk, which is included in the image by default.)

Persistent storage for logs

Another common use case we hear concerns the persistence of logs on the compute nodes. Particularly in a failure event, where a node must be rebooted, it can be useful to have retained logs on the compute host so that they can be investigated when the node is brought back up: in a default stateless deployment, these logs are lost on reboot.

We can extend from the previous two examples to deploy a "local-log" file system to retain these logs between reboots.

(Note: generally we advise not retaining logs on compute nodes: in stead, you should deploy something like Elasticsearch, Splunk, or even just a central rsyslog instance.)

#!/bin/sh
#
# /warewulf/init.d/70-parted

PATH=/usr/sbin:/usr/bin

# KVM disks require a kernel module
modprobe virtio_blk

local_swap=$(findfs LABEL=local-swap)
local_log=$(findfs LABEL=local-log)
local_scratch=$(findfs LABEL=local-scratch)

if [ -n "${local_swap}" -a -n "${local_log}" -a -n "${local_scratch}" ]
then
    echo "Found local-swap: ${local_swap}"
    echo "Found local-log: ${local_log}"
    echo "Found local-scratch: ${local_scratch}"
else
    disk=/dev/vda
    local_swap="${disk}1"
    local_log="${disk}2"
    local_scratch="${disk}3"

    echo "Writing partition table to ${disk}"
    parted --script --align=optimal ${disk} -- \
        mklabel gpt \
        mkpart primary linux-swap 0 2GB \
        mkpart primary ext4 2GB 4GB \
        mkpart primary ext4 4GB -1

    echo "Creating local-swap on ${local_swap}"
    mkswap --label=local-swap "${local_swap}"

    echo "Creating local-log on ${local_log}"
    mkfs.ext4 -FL local-log "${local_log}"

    echo "Populating local-log from image /var/log/"
    mkdir -p /mnt/log/ \
      && mount "${local_log}" /mnt/log \
      && rsync -a /var/log/ /mnt/log/ \
      && umount /mnt/log/ \
      && rmdir /mnt/log

    echo "Creating local-scratch on ${local_scratch}"
    mkfs.ext4 -FL local-scratch "${local_scratch}"
fi

For the most part, this follows the same pattern from the "parted" example above; but adds a step to initalize the new "local-log" file system from the directory structure in the image.

Finally, the new file system is added to fstab, after which logs will be persisted on the local disk.

LABEL=local-swap swap swap defaults,nofail 0 0
LABEL=local-scratch /mnt/scratch ext4 defaults,X-mount.mkdir,nofail 0 0
LABEL=local-log /var/log ext4 defaults,nofail 0 0

Some applications may write logs outside of /var/log; but, in these instances, it's probably easier to configure the application to write to /var/log than to try to capture all the places where logs might be written.

The future

There are a few more use cases that we sometimes hear brought up in the context of stateful node provisioning:

  • How can we use Ansible to configure compute nodes?
  • How can we configure custom kernels and kernel modules per node?
  • Isn't stateless provisioning slower than having the OS deployed on disk?

If you'd like to hear more about these or other potential corner-cases for stateless provisioning, get in touch! We'd love to hear from you, learn about the work you're doing, and address any of the challenges you're having.