Posts about technology

Splitting Warewulf Images Between PXE and NFS

This article was also published via the CIQ blog on 6 December 2022.

Warewulf 4 introduced compatibility with the OCI container ecosystem, which greatly streamlines the process of defining, importing, and maintaining compute node images compared to other systems--even compared to Warewulf 3! But one aspect of compute node images remains unchanged: they can quickly grow in size.

Warewulf (and the technique of PXE-booting a node image more broadly) expects that a compute node image will remain relatively small. Larger sets of software, like you might provide via an Environment Modules stack or, perhaps, via Spack, are typically deployed via a central NFS share, which is then mounted at runtime by the booted compute node. Even OpenHPC, with software packaged as operating system containers, supports this paradigm, with packages installed on the head node, landing in /opt, and then being shared from the head node to compute nodes.

However, there are still benefits to maintaining this software as part of a compute node image; but such a large image can quickly grow to tens of gigabytes, making network booting difficult.

In this article I'll demonstrate how a full software stack can be managed together with a given compute node image, but the resultant payload can be split in-place between PXE-served netbooting and an NFS-mounted file system.


NOTE

This procedure depends on support for /etc/warewulf/excludes, which was broken in Warewulf v4.3.0.


The root image

First, I start with the standard Rocky Linux 8 image as published by HPCng.

[root@wwctl1 ~]# wwctl container import docker://docker.io/warewulf/rocky:8 rocky-8-split

Installing some software

Using the OpenHPC project as a source, I install a set of typical scientific software. Most OpenHPC packages install software in /opt for distribution via NFS, which is what we're going to do: just a little bit differently than usual.

[root@wwctl1 ~]# wwctl container shell rocky-8-split
[rocky-8-split] Warewulf> dnf -y install 'dnf-command(config-manager)'
[rocky-8-split] Warewulf> dnf config-manager --set-enabled powertools
[rocky-8-split] Warewulf> dnf -y install epel-release http://repos.openhpc.community/OpenHPC/2/CentOS_8/x86_64/ohpc-release-2-1.el8.x86_64.rpm
[rocky-8-split] Warewulf> dnf -y install valgrind-ohpc {netcdf,pnetcdf,hypre,boost}-gnu9-mpich-ohpc

After installing the software our image is approaching 2GB. This isn't egregious (and the compressed image as sent over the network is even smaller), but gives us a point of comparison for what comes next.

[root@wwctl1 ~]# du -h /var/lib/warewulf/container/rocky-8-split.img{,.gz}
1.8G    /var/lib/warewulf/container/rocky-8-split.img
651M    /var/lib/warewulf/container/rocky-8-split.img.gz

Excluding the software from the final image

Warewulf consults /etc/warewulf/excludes within the image itself to define files that should not be included in the built image. For our example here, I exclude the full contents of /opt/, in anticipation that we'll be mounting it via NFS in stead.

[rocky-8-split] Warewulf> cat /etc/warewulf/excludes
/boot
/usr/share/GeoIP
/opt/*

Rebuilding the image with /opt/* excluded, the image is reduced in size, and further software installation would no longer increase the final size of the image delivered over PXE.

[root@wwctl1 ~]# du -h /var/lib/warewulf/container/rocky-8-split.img{,.gz}
1.1G    /var/lib/warewulf/container/rocky-8-split.img
483M    /var/lib/warewulf/container/rocky-8-split.img.gz

Exporting the software via NFS

With the software in /opt excluded from the image, we need to export it via NFS in stead. This is relatively easily done, though we must discover and hard-code paths to the container directory.

[root@wwctl1 ~]# readlink -f $(wwctl container show rocky-8-split)/opt
/var/lib/warewulf/chroots/rocky-8-split/rootfs/opt

Add an NFS export to /etc/warewulf/warewulf.conf, restart the Warewulf server, and configure NFS with wwctl. Note that I've specified mount: false for this export, as I want to control which nodes will mount it: presumably nodes that aren't using this image should not mount this image's software.

nfs:
  export paths:
  - path: /var/lib/warewulf/chroots/rocky-8-split/rootfs/opt
    export options: rw,sync,no_root_squash
    mount: false
[root@wwctl1 ~]# systemctl restart warewulfd
[root@wwctl1 ~]# wwctl configure nfs

Mounting the software on the compute node

We can mount this new NFS share just like any other, by listing it in fstab.

Warewulf typically configures fstab as part of the wwinit overlay. In order to mount this NFS share without setting mount: true for all nodes, I copy fstab.ww to a new overlay and add an additional entry.

[root@wwctl1 ~]# wwctl overlay list -a rocky-8-split
OVERLAY NAME                   FILES/DIRS
rocky-8-split                  /etc/
rocky-8-split                  /etc/fstab.ww

[root@wwctl1 ~]# wwctl overlay show rocky-8-split /etc/fstab.ww | tail -n1
{{ .Ipaddr }}:/var/lib/warewulf/chroots/rocky-8-split/rootfs/opt /opt nfs defaults 0 0

I can add the new overlay to our wwinit list, and the fstab in rocky-8-split will override the one in wwinit. (Note: --wwinit was specified as --system in Warewulf 4.3.0.)

[root@wwctl1 ~]# wwctl profile set --wwinit wwinit,rocky-8-split default
[root@wwctl1 ~]# wwctl profile set --container rocky-8-split default

From a compute node, we can see that /opt is mounted via NFS as expected.

[root@compute1 ~]# findmnt /opt
TARGET SOURCE                                                      FSTYPE OPTIONS
/opt   10.0.0.3:/var/lib/warewulf/chroots/rocky-8-split/rootfs/opt nfs4   rw,relatime,vers=4.2,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.0.4,local_lock=none,addr=10.0.0.3

We can further confirm that /opt is empty on the local, PXE-deployed file system.

[root@compute1 ~]# mount -o bind / /mnt
[root@compute1 ~]# du -s /mnt/opt
0   /mnt/opt

Future work

As demonstrated here, we can already implement split PXE/NFS images using functionality already in Warewulf; but future Warewulf development may simplify this process further:

Container path variables in warewulf.conf

We could support referring to compute node images in warewulf.conf. For example, it would be nice to be able to replace

nfs:
  export paths:
  - path: /var/lib/warewulf/chroots/rocky-8-split/rootfs/opt
    export options: rw,sync,no_root_squash
    mount: false

with something like

nfs:
  export paths:
  - path: {{ containers['rocky-8-split'] }}/opt
    export options: rw,sync,no_root_squash
    mount: false

This way, our configuration would not have to hard-code the path to the container chroot.

Move NFS mount settings to nodes and profiles

Right now, NFS client settings are stored in warewulf.conf as mount options, mount, and implicitly via path; but if these settings were moved to nodes and profiles we could configure per-profile and per-node NFS client behavior without having to manually edit or override fstab.

Stateless provisioning of stateful nodes: examples with Warewulf 4

This article was also published via the CIQ blog on 30 November 2022.

When deploying Warewulf 4, we often encounter expectations that Warewulf should support stateful provisioning. Typically these expectations are born from experience with another system (such as Foreman, XCAT, or even Warewulf 3) that supported writing a provisioned operating system to the local disk of each compute node.

Warewulf 4 intentionally omits this kind of stateful provisioning from its feature set, following experiences from Warewulf 3: the code for stateful provisioning was complex, and required a disproportionate amount of maintenance compared to the number of sites using it.

For the most part, we think that arguments for stateful provisioning are better addressed within Warewulf 4's stateless provisioning process. I'd like to go over three such common use cases here, and show how each can be addressed to provision nodes with local state using Warewulf 4.

Local scratch

The first thing to understand is that stateless provisioning does not mean diskless nodes. For example, you may have a local disk that you want to provide as a scratch file system.

Warewulf compute nodes run a small wwclient agent that assists with the init process during boot and deploys the node's overlays during boot and runtime. wwclient reads its own initialization scripts from /warewulf/init.d/, so we can place startup scripts there to take actions during boot.

My test nodes here are KVM instances with a virtual disk at /dev/vda. This wwclient init script looks for a "local-scratch" file system and, if it does not exist, creates one on the local disk.

#!/bin/sh
#
# /warewulf/init.d/70-mkfs-local-scratch

PATH=/usr/sbin:/usr/bin

# KVM disks require a kernel module
modprobe virtio_blk

fs=$(findfs LABEL=local-scratch)
if [ $? == 0 ]
then
    echo "local-scratch filesystem already exists: ${fs}"
else
    target=/dev/vda
    echo "Creating local-scratch filesystem on ${target}"
    mkfs.ext4 -FL local-scratch "${target}"
fi

wwclient runs this script before it passes init on to systemd, so it is also processed before fstab. So we can mount the "local-scratch" file system just like any other disk in fstab.

LABEL=local-scratch /mnt/scratch ext4 defaults,X-mount.mkdir,nofail 0 0

The Warewulf 4 overlay system allows us to deploy customized files to nodes or groups of nodes (via profiles) at boot. For this example, I've placed my customized fstab and init script in a "local-scratch" overlay and included it as a system overlay, alongside the default wwinit overlay.

# wwctl overlay list -a local-scratch
OVERLAY NAME                   FILES/DIRS  
local-scratch                  /etc/        
local-scratch                  /etc/fstab.ww
local-scratch                  /warewulf/   
local-scratch                  /warewulf/init.d/
Local-scratch                  /warewulf/init.d/70-mkfs-local-scratch

# wwctl profile set --system wwinit,local-scratch default
# wwctl overlay build

Because local-scratch is listed after wwinit in the "system" overlay list (see above), its fstab overrides the definition in the wwinit overlay. 70-mkfs-local-scratch is placed alongside other init scripts, and is processed in lexical order.

A node booting with this overlay will create (if it does not exist) a "local-scratch" file system and mount it at "/mnt/scratch", potentially for use by compute jobs.

Disk partitioning

But perhaps you want to do something more complex. Perhaps you have a single disk, but you want to allocate part of it for scratch (as above) and part of it as swap space. Perhaps contrary to popular opinion, we actively encourage the use of swap space in an image-netboot environment like Warewulf 4: a swap partition that is at least as big as the image to be booted allows Linux to write idle portions of the image to disk, freeing up system memory for compute jobs.

So let's expand on the above pattern to actually partition a disk, rather than just format it.

#!/bin/sh
#
# /warewulf/init.d/70-parted

PATH=/usr/sbin:/usr/bin

# KVM disks require a kernel module
modprobe virtio_blk

local_swap=$(findfs LABEL=local-swap)
local_scratch=$(findfs LABEL=local-scratch)

if [ -n "${local_swap}" -a -n "${local_scratch}" ]
then
    echo "Found local-swap: ${local_swap}"
    echo "Found local-scratch: ${local_scratch}"
else
    disk=/dev/vda
    local_swap="${disk}1"
    local_scratch="${disk}2"

    echo "Writing partition table to ${disk}"
    parted --script --align=optimal ${disk} -- \
        mklabel gpt \
        mkpart primary linux-swap 0 2GB \
        mkpart primary ext4 2GB -1

    echo "Creating local-swap on ${local_swap}"
    mkswap --label=local-swap "${local_swap}"

    echo "Creating local-scratch on ${local_scratch}"
    mkfs.ext4 -FL local-scratch "${local_scratch}"
fi

This new init script looks for the expected "local-scratch" and "local-swap" and, if either of them is not found, uses parted to partition the disk and creates them. As before, this is done before fstab is processed, so we can configure these with fstab the standard way.

LABEL=local-swap swap swap defaults,nofail 0 0
LABEL=local-scratch /mnt/scratch ext4 defaults,X-mount.mkdir,nofail 0 0

This configuration went into a new parted overlay, allowing us to configure some nodes for "local-scratch" only, and some nodes for this partitioned layout.

# wwctl overlay list -a parted
OVERLAY NAME                   FILES/DIRS  
parted                         /etc/        
parted                         /etc/fstab.ww
parted                         /warewulf/   
parted                         /warewulf/init.d/
parted                         /warewulf/init.d/70-parted

# wwctl profile set --system wwinit,parted default
# wwctl overlay build

(Note: I installed parted in my system image to support this; but the same could also be done with sfdisk, which is included in the image by default.)

Persistent storage for logs

Another common use case we hear concerns the persistence of logs on the compute nodes. Particularly in a failure event, where a node must be rebooted, it can be useful to have retained logs on the compute host so that they can be investigated when the node is brought back up: in a default stateless deployment, these logs are lost on reboot.

We can extend from the previous two examples to deploy a "local-log" file system to retain these logs between reboots.

(Note: generally we advise not retaining logs on compute nodes: in stead, you should deploy something like Elasticsearch, Splunk, or even just a central rsyslog instance.)

#!/bin/sh
#
# /warewulf/init.d/70-parted

PATH=/usr/sbin:/usr/bin

# KVM disks require a kernel module
modprobe virtio_blk

local_swap=$(findfs LABEL=local-swap)
local_log=$(findfs LABEL=local-log)
local_scratch=$(findfs LABEL=local-scratch)

if [ -n "${local_swap}" -a -n "${local_log}" -a -n "${local_scratch}" ]
then
    echo "Found local-swap: ${local_swap}"
    echo "Found local-log: ${local_log}"
    echo "Found local-scratch: ${local_scratch}"
else
    disk=/dev/vda
    local_swap="${disk}1"
    local_log="${disk}2"
    local_scratch="${disk}3"

    echo "Writing partition table to ${disk}"
    parted --script --align=optimal ${disk} -- \
        mklabel gpt \
        mkpart primary linux-swap 0 2GB \
        mkpart primary ext4 2GB 4GB \
        mkpart primary ext4 4GB -1

    echo "Creating local-swap on ${local_swap}"
    mkswap --label=local-swap "${local_swap}"

    echo "Creating local-log on ${local_log}"
    mkfs.ext4 -FL local-log "${local_log}"

    echo "Populating local-log from image /var/log/"
    mkdir -p /mnt/log/ \
      && mount "${local_log}" /mnt/log \
      && rsync -a /var/log/ /mnt/log/ \
      && umount /mnt/log/ \
      && rmdir /mnt/log

    echo "Creating local-scratch on ${local_scratch}"
    mkfs.ext4 -FL local-scratch "${local_scratch}"
fi

For the most part, this follows the same pattern from the "parted" example above; but adds a step to initalize the new "local-log" file system from the directory structure in the image.

Finally, the new file system is added to fstab, after which logs will be persisted on the local disk.

LABEL=local-swap swap swap defaults,nofail 0 0
LABEL=local-scratch /mnt/scratch ext4 defaults,X-mount.mkdir,nofail 0 0
LABEL=local-log /var/log ext4 defaults,nofail 0 0

Some applications may write logs outside of /var/log; but, in these instances, it's probably easier to configure the application to write to /var/log than to try to capture all the places where logs might be written.

The future

There are a few more use cases that we sometimes hear brought up in the context of stateful node provisioning:

  • How can we use Ansible to configure compute nodes?
  • How can we configure custom kernels and kernel modules per node?
  • Isn't stateless provisioning slower than having the OS deployed on disk?

If you'd like to hear more about these or other potential corner-cases for stateless provisioning, get in touch! We'd love to hear from you, learn about the work you're doing, and address any of the challenges you're having.

The SSH agent

This is one part in a series on OpenSSH client configuration. Also read Elegant OpenSSH Configuration and Secure OpenSSH Defaults.

As part of another SSH client article we potentially generated a new ssh key for use in ssh public-key authentication.

$ ssh-keygen -t rsa -b 4096 # if you don't already have a key

SSH public-key authentication has intrinsic benefits; but many see it as a mechanism for non-interactive login: you don’t have to remember, or type, a password.

This behavior is dependent, however, on having a non-encrypted private key. This is a security risk, because the non-encrypted private key may be compromised, either by accidential mishandling of the file or by unauthorized intrusion into the client system. In almost all cases, ssh private keys should be encrypted with a passphrase.

$ ssh-keygen -t rsa -b 4096 -f test
Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:

If you already have a passphrase that is not encrypted, use the -p argument to ssh-keygen to set one.

$ ssh-keygen -p -f ~/.ssh/id_rsa

Now the private key is protected by a passphrase, which you’ll be prompted for each time you use it. This is better than a password, because the passphrase is not transmitted to the server; but we’ve lost the ability to authenticate without having to type anything.

ssh-agent

OpenSSH provides a dedicated agent process for the sole purpose of handling decrypted ssh private keys in-memory. Most Unix and Linux desktop operating systems (including OS X) start and maintain a per-user SSH agent process automatically.

$ pgrep -lfu $USER ssh-agent
815 /usr/bin/ssh-agent -l

Using the ssh-add command, you can decrypt your ssh private key by inputing your passphrase once, adding the decrypted key to the running agent.

$ ssh-add ~/.ssh/id_rsa # the path to the private key may be omitted for default paths
Enter passphrase for /Users/user1234/.ssh/id_rsa:
Identity added: /Users/user1234/.ssh/id_rsa (/Users/user1234/.ssh/id_rsa)

The decrypted private key remains resident in the ssh-agent process.

$ ssh-add -L
ssh-rsa [redacted] /Users/user1234/.ssh/id_rsa

This is better than a non-encrypted on-disk private key for two reasons: first the decrypted private key exists only in memory, not on disk. This makes is more difficult to mishandle, including the fact that it cannot be recovered without re-inputing the passphrase once the workstation is powered off. Second, client applications (like OpenSSH itself) no longer require direct access to the private key, encrypted or otherwise, nor must you provide your (secret) key passphrase to client applications: the agent moderates all use of the key itself.

The default OpenSSH client will use the agent process identified by the SSH_AUTH_SOCK environment variable by default; but you generally don’t have to worry about it: your workstation environment should configure it for you.

$ echo $SSH_AUTH_SOCK
/private/tmp/com.apple.launchd.L311i5Nw5J/Listeners

At this point, there’s nothing more to do. With your ssh key added to the agent process, you’re back to not needing to type in a password (or passphrase), but without the risk of a non-encrypted private key stored permanently on disk.

Secure OpenSSH defaults

This is one part in a series on OpenSSH client configuration. Also read Elegant OpenSSH configuration and The SSH agent.

It’s good practice to harden our ssh client with some secure “defaults”. Starting your configuration file with the following directives will apply the directives to all (*) hosts.

(These are listed as multiple Host * stanzas, but they can be combined into a single stanza in your actual configuration file.)

If you prefer, follow along with an example of a complete ~/.ssh/config file.

Require secure algorithms

OpenSSH supports many encryption and authentication algorithms, but some of those algorithms are known to be weak to cryptographic attack. The Mozilla project publishes a list of recommended algorithms that exclude algorithms that are known to be insecure.

Host *
HostKeyAlgorithms ssh-ed25519-cert-v01@openssh.com,ssh-rsa-cert-v01@openssh.com,ssh-ed25519,ssh-rsa,ecdsa-sha2-nistp521-cert-v01@openssh.com,ecdsa-sha2-nistp384-cert-v01@openssh.com,ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp521,ecdsa-sha2-nistp384,ecdsa-sha2-nistp256
Ciphers chacha20-poly1305@openssh.com,aes256-gcm@openssh.com,aes128-gcm@openssh.com,aes256-ctr,aes192-ctr,aes128-ctr
MACs hmac-sha2-512-etm@openssh.com,hmac-sha2-256-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-512,hmac-sha2-256,umac-128@openssh.com
KexAlgorithms curve25519-sha256@libssh.org,diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1

(More information on the the available encryption and authentication algorithms, and how a recommended set is derived, is available in this fantastic blog post, “Secure secure shell.”)

Hash your known_hosts file

Every time you connect to an SSH server, your client caches a copy of the remote server’s host key in a ~/.ssh/known_hosts file. If your ssh client is ever compromised, this list can expose the remote servers to attack using your compromised credentials. Be a good citizen and hash your known hosts file.

Host *
HashKnownHosts yes

(Hash any existing entries in your ~/.ssh/known_hosts file by running ssh-keygen -H. Don’t forget to remove the backup ~/.ssh/known_hosts.old.)

$ ssh-keygen -H
$ rm -i ~/.ssh/known_hosts.old

No roaming

Finally, disable the experimental “roaming” feature to mitigate exposure to a pair of potential vulnerabilities, CVE-2016-0777 and CVE-2016-0778.

Host *
UseRoaming no

Dealing with insecure servers

Some servers are old enough that they may not support the newer, more secure algorithms listed. In the RC environment, for example, the login and other Internet-accessible systems provide relatively modern ssh algorithms; but the host in the rc.int.colorado.edu domain may not.

To support connection to older hosts while requiring newer algorithms by default, override these settings earlier in the configuration file.

# Internal RC hosts are running an old version of OpenSSH
Match host=*.rc.int.colorado.edu
MACs hmac-sha1,umac-64@openssh.com,hmac-ripemd160,hmac-ripemd160@openssh.com,hmac-sha1-96

Elegant OpenSSH configuration

This is one part in a series on OpenSSH client configuration. Also read Secure OpenSSH defaults and The SSH agent.

The OpenSSH client is very robust, verify flexible, and very configurable. Many times I see people struggling to remember server-specific ssh flags or arcane, manual multi-hop procedures. I even see entire scripts written to automate the process.

But the vast majority of what you might want ssh to do can be abstracted away with some configuration in your ~/.ssh/config file.

All (or, at least, most) of these configuration directives are fully documented in the ssh_config manpage.

If you prefer, follow along with an example of a complete ~/.ssh/config file.

HostName

One of the first annoyances people have–and one of the first things people try to fix–when using a command-line ssh client is having to type in long hostnames. For example, the Research Computing login service is available at login.rc.colorado.edu.

$ ssh login.rc.colorado.edu

This particular name isn’t too bad; but coupled with usernames and especially when used as part of an scp, these fully-qualified domain names can become cumbersome.

$ scp -r /path/to/src/ user1234@login.rc.colorado.edu:dest/

OpenSSH supports host aliases through pattern-matching in Host directives.

Host login*.rc
HostName %h.colorado.edu

Host *.rc
HostName %h.int.colorado.edu

In this example, %h is substituted with the name specified on the command-line. With a configuration like this in place, connections to login.rc are directed to the full name login.rc.colorado.edu.

$ scp -r /path/to/src/ user1234@login.rc:dest/

Failing that, other references to hosts with a .rc suffix are directed to the internal Research Computing domain. (We’ll use these later.)

(The .rc domain segment could be moved from the Host pattern to the HostName value; but leaving it in the alias helps to distinguish the Research Computing login nodes from other login nodes that you may have access to. You can use arbitrary aliases in the Host directive, too; but then the %h substitution isn’t useful: you have to enumerate each targeted host.)

User

Unless you happen to use the same username on your local workstation as you have on the remove server, you likely specify a username using either the @ syntax or -l argument to the ssh command.

$ ssh user1234@login.rc

As with specifying a fully-qualified domain name, tracking and specifying a different username for each remote host can become burdensome, especially during an scp operation. Record the correct username in your ~/.ssh/config file in stead.

Match host=*.rc.colorado.edu,*.rc.int.colorado.edu
User user1234

Now all connections to Research Computing hosts use the specified username by default, without it having to be specified on the command-line.

$ scp -r /path/to/src/ login.rc:dest/

Note that we’re using a Match directive here, rather than a Host directive. The host= argument to Match matches against the derived hostname, so it reflects the real hostname as determined using the previous Host directives. (Make sure the correct HostName is established earlier in the configuration, though.)

ControlMaster

Even if the actual command is simple to type, authenticating to the host may be require manual intervention. The Research Computing login nodes, for example, require two-factor authentication using a password or pin coupled with a one-time VASCO password or Duo credential. If you want to open multiple connections–or, again, copy files using scp–having to authenticate with multiple factors quickly becomes tedious. (Even having to type in a password at all may be unnecessary; but we’ll assume, as is the case with the Research Computing login example, that you can’t use public-key authentication.)

OpenSSH supports sharing a single network connection for multiple ssh sessions.

Match host=login.rc.colorado.edu
ControlMaster auto
ControlPath ~/.ssh/.socket_%h_%p_%r
ControlPersist 4h

With ControlMaster and ControlPath defined, the first ssh connection authenticates and establishes a session normally; but future connections join the active connection, bypassing the need to re-authenticate. The optional ControlPersist option causes this connection to remain active for a period of time even after the last session has been closed.

$ ssh login.rc
user1234@login.rc.colorado.edu's password:
[user1234@login01 ~]$ logout

$ ssh login.rc
[user1234@login01 ~]$

(Note that many arguments to the ssh command are effectively ignored after the initial connection is established. Notably, if X11 was not forwarded with -X or -Y during the first session, you cannot use the shared connection to forward X11 in a later session. In this case, use the -S none argument to ssh to ignore the existing connection and explicitly establish a new connection.)

ProxyCommand

But what if you want to get to a host that isn’t directly available from your local workstation? The hosts in the rc.int.colorado.edu domain referenced above may be accessible from a local network connection; but if you are connecting from elsewhere on the Internet, you won’t be able to access them directly.

Except that OpenSSH provides the ProxyCommand option which, when coupled with the OpenSSH client presumed to be available on the intermediate server, supports arbitrary proxy connections through to remotely-accessible servers.

Match host=*.rc.int.colorado.edu
ProxyCommand ssh -W %h:%p login.rc.colorado.edu

Even though you can’t connect directly to Janus compute nodes from the Internet, for example, you can connect to them from a Research Computing login node; so this ProxyCommand configuration allows transparent access to hosts in the internal Research Computing domain.

$ ssh janus-compile1.rc
[user1234@janus-compile1 ~]$

And it even works with scp.

$ echo 'Hello, world!' >/tmp/hello.txt
$ scp /tmp/hello.txt janus-compile1.rc:/tmp
hello.txt                                     100%   14     0.0KB/s   00:00

$ ssh janus-compile1.rc cat /tmp/hello.txt
Hello, world!

Public-key authentication

If you tried the example above, chances are that you were met with an unexpected password prompt that didn’t accept any password that you used. That’s because most internal Research Computing hosts don’t actually support interactive authentication, two-factor or otherwise. Connections from a CURC login node are authorized by the login node; but a proxied connection must authenticate from your local client.

The best way to authenticate your local workstation to an internal CURC host is using public-key authentication.

If you don’t already have an SSH key, generate one now.

$ ssh-keygen -t rsa -b 4096 # if you don't already have a key

Now we have to copy the (new?) public key to the remote CURC ~/.ssh/authorized_keys file. RC provides a global home directory, so copying to any login node will do. Targeting a specific login node is useful, though: the ControlMaster configuration for login.rc.colorado.edu tends to confuse ssh-copy-id.

$ ssh-copy-id login01.rc

(The ssh-copy-id command doesn’t come with OS X, but theres a third-party port available on GitHub. It’s usually available on a Linux system, too. Alternatively, you can just edit ~/.ssh/authorized_keys manually.)

User-selectable authentication methods using pam_authtok

Research Computing is in the process of migrating and expanding our authentication system to support additional authentication methods. Historically we’ve supported VASCO IDENTIKEY time-based one-time-password and pin to provide two-factor authentication.

$ ssh user1234@login.rc.colorado.edu
user1234@login.rc.colorado.edu's password: <pin><otp>

[user1234@login04 ~]$

But the VASCO tokens are expensive, get lost or left at home, have a battery that runs out, and have an internal clock that sometimes falls out-of-sync with the rest of the authentication system. For these and other reasons we’re provisioning most new account with Duo, which provides iOS and Android apps but also supports SMS and voice calls.

Unlike VASCO, Duo is only a single authentication factor; so we’ve also added support for upstream CU-Boulder campus password authentication to be used in tandem.

This means that we have to support both authentication mechanisms–VASCO and password+Duo–simultaneously. A naïve implementation might just stack these methods together.

auth sufficient pam_radius_auth.so try_first_pass # VASCO authenticates over RADIUS
auth requisite  pam_krb5.so try_first_pass # CU-Boulder campus password
auth required   pam_duo.so

This generally works: VASCO authentication is attempted first over RADIUS. If that fails, authentication is attempted against the campus password and, if that succeeds, against Duo.

Unfortunately, this generates spurious authentication failures in VASCO when using Duo to authenticate: the VASCO method fails, then Duo authentication is attempted. Users who have both VASCO and Duo accounts (e.g., all administrators) may generate enough failures to trigger the break-in mitigation security system, and the VASCO account may be disabled. This same issue exists if we reverse the authentication order to try Duo first, then VASCO: VASCO users might then cause their campus passwords to become disabled.

In stead, we need to enable users to explicitly specify which authentication method they’re using.

Separate sssd domains

Our first attempt to provide explicit access to different authentication methods was to provide multiple redundant sssd domains.

[domain/rc]
description = Research Computing
proxy_pam_target = curc-twofactor-vasco


[domain/duo]
description = Research Computing (identikey+duo authentication)
enumerate = false
proxy_pam_target = curc-twofactor-duo

This allows users to log in normally using VASCO, while password+Duo authentication can be requested explicitly by logging in as ${user}@duo.

$ ssh -l user1234@duo login.rc.colorado.edu

This works well enough for the common case of shell access over SSH: login is permitted and, since both the default rc domain and the duo alias domain are both backed by the same LDAP directory, NSS sees no important difference once a user is logged in using either method.

This works because POSIX systems store the uid number returned by PAM and NSS, and generally resolve the uid number to the username on-demand. Not all systems work this way, however. For example, when we attempted to use this authentication mechanism to authenticate to our prototype JupyterHub (web) service, jobs dispatched to Slurm retained the ${user}@duo username format. Slurm also uses usernames internally, and the ${user}@duo username is not populated within Slurm: only the base ${user} username.

Expecting that we would continue to find more unexpected side-effects of this implementation, we started to look for an alternative mechanism that doesn’t modify the specified username.

pam_authtok

In general, a user provides two pieces of information during authentication: a username (which we’ve already determined we shouldn’t modify) and an authentication token or password. We should be able to detect, for example, a prefix to that authentication token to determine what authentication method to use.

$ ssh user1234@login.rc.colorado.edu
user1234@login.rc.colorado.edu's password: duo:<password>

[user1234@login04 ~]$

But we found no such pam module that would allow us to manipulate the authentication token… so we wrote one.

auth [success=1 default=ignore] pam_authtok.so prefix=duo: strip prompt=password:

auth [success=done new_authtok_reqd=done default=die] pam_radius_auth.so try_first_pass

auth requisite pam_krb5.so try_first_pass
auth [success=done new_authtok_reqd=done default=die] pam_duo.so

Now our PAM stack authenticates against VASCO by default; but, if the user provides a password with a duo: prefix, authentication skips VASCO and authenticates the supplied password, followed by Duo push. Our actual production PAM stack is a bit more complicated, supporting a redundant vasco: prefix as well, for forward-compatibility should we change the default authentication mechanism in the future. We can also extend this mechanism to add arbitrary additional authentication mechanisms in the future.

Two software design methods

There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies.

–C.A.R. Hoare, The 1980 ACM Turing Award Lecture

Why hasn’t my (Slurm) job started?

A job can be blocked from being scheduled for the following reasons:

  • There are insufficient resources available to start the job, either due to active reservations, other running jobs, component status, or system/partition size.

  • Other higher-priority jobs are waiting to run, and the job’s time limit prevents it from being backfilled.

  • The job’s time limit exceeds an upcoming reservation (e.g., scheduled preventative maintenance)

  • The job is associated with an account that has reached or exceeded its GrpCPUMins.

Display a list of queued jobs sorted in the order considered by the scheduler using squeue.

squeue --sort=-p,i --priority --format '%7T %7A %10a %5D %.12L %10P %10S %20r'

Reason codes

A list of reason codes [1] is available as part of the squeue manpage. [2]

Common reason codes:

  • ReqNodeNotAvail

  • AssocGrpJobsLimit

  • AssocGrpCPUMinsLimit

  • resources

  • QOSResourceLimit

  • Priority

  • AssociationJobLimit

  • JobHeldAdmin

How are jobs prioritized?

PriorityType=priority/multifactor

Slurm prioritizes jobs using the multifactor plugin [3] based on a weighted summation of age, size, QOS, and fair-share factors.

Use the sprio command to inspect each weighted priority value separately.

sprio [-j jobid]

Age Factor

PriorityWeightAge=1000
PriorityMaxAge=14-0

The age factor represents the length of time a job has been sitting in the queue and eligible to run. In general, the longer a job waits in the queue, the larger its age factor grows. However, the age factor for a dependent job will not change while it waits for the job it depends on to complete. Also, the age factor will not change when scheduling is withheld for a job whose node or time limits exceed the cluster’s current limits.

The weighted age priority is calculated as PriorityWeightAge[1000]*[0..1] as the job age approaches PriorityMaxAge[14-0], or 14 days. As such, an hour of wait-time is equivalent to ~2.976 priority.

Job Size Factor

PriorityWeightJobSize=2000

The job size factor correlates to the number of nodes or CPUs the job has requested. The weighted job size priority is calculated as PriorityWeightJobSize[2000]*[0..1] as the job size approaches the entire size of the system. A job that requests all the nodes on the machine will get a job size factor of 1.0, with an effective weighted job size priority of 28 wait-days (except that job age priority is capped at 14 days).

Quality of Service (QOS) Factor

PriorityWeightQOS=1500

Each QOS can be assigned a priority: the larger the number, the greater the job priority will be for jobs that request this QOS. This priority value is then normalized to the highest priority of all the QOS’s to become the QOS factor. As such, the weighted QOS priority is calculated as PriorityWeightQOS[1500]*QosPriority[0..1000]/MAX(QOSPriority[1000]).

QOS          Priority  Weighted priority  Wait-days equivalent
-----------  --------  -----------------  --------------------
admin            1000               1500                  21.0
janus               0                  0                   0.0
janus-debug       400                600                   8.4
janus-long        200                300                   4.2

Fair-share factor

PriorityWeightFairshare=2000
PriorityDecayHalfLife=14-0

The fair-share factor serves to prioritize queued jobs such that those jobs charging accounts that are under-serviced are scheduled first, while jobs charging accounts that are over-serviced are scheduled when the machine would otherwise go idle.

The simplified formula for calculating the fair-share factor for usage that spans multiple time periods and subject to a half-life decay is:

F = 2**((-NormalizedUsage)/NormalizedShares))

Each account is granted an equal share, and historic records of use decay with a half-life of 14 days. As such, the weighted fair-share priority is calculated as PriorityWeightFairshare[2000]*[0..1] depending on the account’s historic use of the system relative to its allocated share.

A fair-share factor of 0.5 indicates that the account’s jobs have used exactly the portion of the machine that they have been allocated and assigns the job additional 1000 priority (the equivalent of 2976 wait-hours). A fair-share factor of above 0.5 indicates that the account’s jobs have consumed less than their allocated share and assigns the job up to 2000 additional priority, for an effective relative 14 wait-day priority boost. A fair-share factor below 0.5 indicates that the account’s jobs have consumed more than their allocated share of the computing resources, and the added priority will approach 0 dependent on the account’s history relevant to its equal share of the system, for an effective relative 14-day priority penalty.

The curc::sysconfig::scinet Puppet module

I’ve been working on a new module, curc::sysconfig::scinet, which will generally do the Right Thing™ when configuring a host on the CURC science network, with as little configuration as possible.

Let’s look at some examples.

login nodes

class { 'curc::sysconfig::scinet':
  location => 'comp',
  mgt_if   => 'eth0',
  dmz_if   => 'eth1',
  notify   => Class['network'],
}

This is the config used on a new-style login node like login05 and login07. (What makes them new-style? Mostly just that they’ve had their interfaces cleaned up to use eth0 for “mgt” and eth1 for “dmz”.)

Here’s the routing table that this produced on login07:

$ ip route list
10.225.160.0/24 dev eth0  proto kernel  scope link  src 10.225.160.32
10.225.128.0/24 via 10.225.160.1 dev eth0
192.12.246.0/24 dev eth1  proto kernel  scope link  src 192.12.246.39
10.225.0.0/20 via 10.225.160.1 dev eth0
10.225.0.0/16 via 10.225.160.1 dev eth0  metric 110
10.128.0.0/12 via 10.225.160.1 dev eth0  metric 110
default via 192.12.246.1 dev eth1  metric 100
default via 10.225.160.1 dev eth0  metric 110

Connections to “mgt” subnets use the “mgt” interface eth0, either by the link-local route or the static routes via comp-mgt-gw (10.225.160.1). Connections to the “general” subnet (a.k.a. “vlan 2049”), as well as the rest of the science network (“data” and “svc” networks) also use eth0 by static route. The default eth0 route is configured by DHCP, but the interface has a default metric of 110, so it doesn’t conflict with or supersede eth1’s default route, which is configured with a lower metric of 100.

Speaking of eth1, the “dmz” interface is configured statically, using information retrieved from DNS by Puppet.

$ cat /etc/sysconfig/network-scripts/ifcfg-eth1
TYPE=Ethernet
DEVICE=eth1
BOOTPROTO=static
HWADDR=00:50:56:88:2E:36
ONBOOT=yes
IPADDR=192.12.246.39
NETMASK=255.255.255.0
GATEWAY=192.12.246.1
METRIC=100
IPV4_ROUTE_METRIC=100

Usually the routing priority of the “dmz” interface would mean that inbound connections to the “mgt” interface from outside of the science network would be blocked when the “dmz”-bound response is filtered by rp_filter; but curc::sysconfig::scinet also configures routing policy for eth0, so traffic on that interface always returns from that interface.

$ ip rule show | grep 'lookup 1'
32764:  from 10.225.160.32 lookup 1
32765:  from all iif eth0 lookup 1

$ ip route list table 1
default via 10.225.160.1 dev eth0

This allows me to ping login07.rc.int.colorado.edu from my office workstation.

$ ping -c 1 login07.rc.int.colorado.edu
PING login07.rc.int.colorado.edu (10.225.160.32) 56(84) bytes of data.
64 bytes from 10.225.160.32: icmp_seq=1 ttl=62 time=0.507 ms

--- login07.rc.int.colorado.edu ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 1ms
rtt min/avg/max/mdev = 0.507/0.507/0.507/0.000 ms

Because the default route for eth0 is actually configured, outbound routing from login07 is resilient to failure of the “dmz” link.

# ip route list | grep -v eth1
10.225.160.0/24 dev eth0  proto kernel  scope link  src 10.225.160.32
10.225.128.0/24 via 10.225.160.1 dev eth0
10.225.0.0/20 via 10.225.160.1 dev eth0
10.225.0.0/16 via 10.225.160.1 dev eth0  metric 110
10.128.0.0/12 via 10.225.160.1 dev eth0  metric 110
default via 10.225.160.1 dev eth0  metric 110

Traffic destined to leave the science network simply proceeds to the next preferred (and, in this case, only remaining) default route, comp-mgt-gw.

DHCP, DNS, and the FQDN

Tangentially, it’s important to note that the DHCP configuration of eth0 will tend to re-wite /etc/resolv.conf and the search path it defines, with the effect of causing the FQDN of the host to change to login07.rc.int.colorado.edu. Because login nodes are logically (and historically) external hosts, not internal hosts, they should prefer their external identity to their internal identity. As such, we override the domain search path on login nodes to cause them to discover their rc.colorado.edu FQDN’s first.

# cat /etc/dhcp/dhclient-eth0.conf
supersede domain-search "rc.colorado.edu", "rc.int.colorado.edu";

PetaLibrary/repl

The Petibrary/repl GPFS NSD nodes replnsd{01,02} are still in the “COMP” datacenter, but only attach to “mgt” and “data” networks.

class { 'curc::sysconfig::scinet':
  location         => 'comp',
  mgt_if           => 'eno2',
  data_if          => 'enp17s0f0',
  other_data_rules => [ 'from 10.225.176.61 table 2',
                        'from 10.225.176.62 table 2',
                        ],
  notify           => Class['network_manager::service'],
}

This config produces the following routing table on replnsd01

$ ip route list
default via 10.225.160.1 dev eno2  proto static  metric 110
default via 10.225.176.1 dev enp17s0f0  proto static  metric 120
10.128.0.0/12 via 10.225.160.1 dev eno2  metric 110
10.128.0.0/12 via 10.225.176.1 dev enp17s0f0  metric 120
10.225.0.0/20 via 10.225.160.1 dev eno2
10.225.0.0/16 via 10.225.160.1 dev eno2  metric 110
10.225.0.0/16 via 10.225.176.1 dev enp17s0f0  metric 120
10.225.64.0/20 via 10.225.176.1 dev enp17s0f0
10.225.128.0/24 via 10.225.160.1 dev eno2
10.225.144.0/24 via 10.225.176.1 dev enp17s0f0
10.225.160.0/24 dev eno2  proto kernel  scope link  src 10.225.160.59  metric 110
10.225.160.49 via 10.225.176.1 dev enp17s0f0  proto dhcp  metric 120
10.225.176.0/24 dev enp17s0f0  proto kernel  scope link  src 10.225.176.59  metric 120

…with the expected interface-consistent policy-targeted routing tables.

$ ip route list table 1
default via 10.225.160.1 dev eno2

$ ip route list table 2
default via 10.225.176.1 dev enp17s0f0

Static routes for “mgt” and “data” subnets are defined for their respective interfaces. As on the login nodes above, default routes are specified for both interfaces as well, with the lower-metric “mgt” interface eno2 being preferred. (This is configurable using the mgt_metric and data_metric parameters.)

Perhaps the most notable aspect of the PetaLibrary/repl network config is the provisioning of the GPFS CES floating IP addresses 10.225.176.{61,62}. These addresses are added to the enp17s0f0 interface dynamically by GPFS, and are not defined with curc::sysconfig::scinet; but the config must reference these addresses to implement proper interface-consistent policy-targeted routing tables. Though version of Puppet deployed at CURC lacks the semantics to infer these rules from a more semantic data_ip parameter; so the other_data_rules parameter is used in stead.

other_data_rules => [ 'from 10.225.176.61 table 2',
                      'from 10.225.176.62 table 2',
                      ],

Blanca/ICS login node

porting the blanca login node would be great because it’s got a “dmz”, “mgt”, and “data” interface; so it would exercise the full gamut of features of the module.

Linux policy-based routing

How could Linux policy routing be so poorly documented? It’s so useful, so essential in a multi-homed environment… I’d almost advocate for its inclusion as default behavior.

What is this, you ask? To understand, we have to start with what Linux does by default in a multi-homed environment. So let’s look at one.

$ ip addr
[...]
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 78:2b:cb:66:75:c0 brd ff:ff:ff:ff:ff:ff
    inet 10.225.128.80/24 brd 10.225.128.255 scope global eth2
    inet6 fe80::7a2b:cbff:fe66:75c0/64 scope link
       valid_lft forever preferred_lft forever
[...]
6: eth5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP qlen 1000
    link/ether e4:1d:2d:14:93:60 brd ff:ff:ff:ff:ff:ff
    inet 10.225.144.80/24 brd 10.225.144.255 scope global eth5
    inet6 fe80::e61d:2dff:fe14:9360/64 scope link
       valid_lft forever preferred_lft forever

So we have two interfaces, eth2 and eth5. They’re on separate subnets, 10.225.128.0/24 and 10.225.144.0/24 respectively. In our environment, we refer to these as “spsc-mgt” and “spsc-data.” The practical circumstance is that one of these networks is faster than the other, and we would like bulk data transfer to use the faster “spsc-data” network.

If the client system also has an “spsc-data” network, everything is fine. The client addresses the system using its data address, and the link-local route prefers the data network.

$ ip route list 10.225.144.0/24
10.225.144.0/24 dev eth5  proto kernel  scope link  src 10.225.144.80

Our network environment covers a number of networks, however. So let’s say our client lives in another data network–“comp-data.” Infrastructure routing directs the traffic to the -data interface of our server correctly, but the default route on the server prefers the -mgt interface.

$ ip route list | grep ^default
default via 10.225.128.1 dev eth2

For this simple case we have two options. We can either change our default route to prefer the -data interface, or we can enumerate intended -data client networks with static routes using the data interface. Since changing the default route simply leaves us in the same situation for the -mgt network, let’s define some static routes.

$ ip route add 10.225.64.0/20 via 10.225.144.1 dev eth5
$ ip route add 10.225.176.0/24 via 10.225.144.1 dev eth5

So long as we can enumerate the networks that should always use the -data interface of our server to communicate, this basically works. But what if we want to support clients that don’t themselves have separate -mgt and -data networks? What if we have a single client–perhaps with only a -mgt network connection–that should be able to communicate individually with the server’s -mgt interface and its -data interface. In the most pathological case, what if we have a host that is only connected to the spsc-mgt (10.225.128.0/24) interface, but we want that client to be able to communicate with the server’s -data interface. In this case, the link-local route will always prefer the -mgt network for the return path.

Policy-based routing

The best case would be to have the server select an outbound route based not on a static configuration, but in response to the incoming path of the traffic. This is the feature enabled by policy-based routing.

Linux policy routing allows us to define distinct and isolated routing tables, and then select the appropriate routing table based on the traffic context. In this situation, we have three different routing contexts to consider. The first of these are the routes to use when the server initiates communication.

$ ip route list table main
10.225.128.0/24 dev eth2  proto kernel  scope link  src 10.225.128.80
10.225.144.0/24 dev eth5  proto kernel  scope link  src 10.225.144.80
10.225.64.0/20 via 10.225.144.1 dev eth5
10.225.176.0/24 via 10.225.144.1 dev eth5
default via 10.225.128.1 dev eth2

A separate routing table defines routes to use when responding to traffic from the -mgt interface.

$ ip route list table 1
default via 10.225.128.1 dev eth2

The last routing table defines routes to use when responding to traffic from the -data interface.

$ ip route list table 2
default via 10.225.144.1 dev eth5

With these separate routing tables defined, the last step is to define the rules that select the correct routing table.

$ ip rule list
0:  from all lookup local
32762:  from 10.225.144.80 lookup 2
32763:  from all iif eth5 lookup 2
32764:  from 10.225.128.80 lookup 1
32765:  from all iif eth2 lookup 1
32766:  from all lookup main
32767:  from all lookup default

Despite a lack of documentation, all of these rules may be codified in Red Hat “sysconfig”-style “network-scripts” using interface-specific route- and rule- files.

$ cat /etc/sysconfig/network-scripts/route-eth2
default via 10.225.128.1 dev eth2
default via 10.225.128.1 dev eth2 table 1

$ cat /etc/sysconfig/network-scripts/route-eth5
10.225.64.0/20 via 10.225.144.1 dev eth5
10.225.176.0/24 via 10.225.144.1 dev eth5
default via 10.225.144.1 dev eth5 table 2

$ cat /etc/sysconfig/network-scripts/rule-eth2
iif eth2 table 1
from 10.225.128.80 table 1

$ cat /etc/sysconfig/network-scripts/rule-eth5
iif eth5 table 2
from 10.225.144.80 table 2

Changes to the RPDB made with these commands do not become active immediately. It is assumed that after a script finishes a batch of updates, it flushes the routing cache with ip route flush cache.

References