I'm really excited about Warewulf 4! I'm also really excited about OpenHPC, Apptainer, and containerizing HPC workloads, particularly MPI. Today I presented my most recent work in these areas in the CIQ webinar.
I had the pleasure of participating in my first CIQ webinar today! Check it out if you'd like to learn a bit about Apptainer's support for cryptographic signatures, using well-established PGP infrastructure and paradigms.
The Research Computing authentication path is more complex than I'd like.
We start with
pam_ssswhich, of course, authenticates against sssd.
Because we have users from multiple home institutions, both internal and external, sssd is configured with multiple domains.
Two of our configured domains authenticate against Duo and Active Directory. To support this we run two discrete instances of the Duo authentication proxy, one for each domain.
The Duo authentication proxy can present either an LDAP or RADIUS interface. We went with RADIUS. So sssd is configured with
auth_provider = proxy, with a discrete pam stack for each domain. This pam stack uses
pam_radiusto authenticate against the correct Duo authentication proxy.
The relevant Duo authentication proxy then performs AD authentication to the relevant authoritative domain and, on success, performs Duo authentication for second factor.
All of this technically works, and has been working for some
time. However, we've increasingly seen a certain bug in sssd's
proxy authentication provider, which manifests as an incorrect
monitoring or management of authentication threads.
[sssd[be[rc.colorado.edu]]] [dp_attach_req] (0x0400): Number of active DP request: 32
sssd maintains a number of pre-forked children for performing this
proxy authentication. This default to 10 threads, and is configurable
proxy_max_children. Somewhere in sssd a bug exists
that either prevents threads from being closed properly or fails to
decrement the active thread count when they are closed. When the
"Number of active DP request" exceeds
proxy_max_children sssd will
no longer perform authentication for the affected domain.
We have reported this issue to Red Hat, but 8 months on and we
still don't have a fix. Meanwhile, I'm interested in simplifying our
authentication path, hopefully removing the
provider from our configuration in the process, and making sssd
optional for authentication in our environment.
We use 389 Directory Server as our local LDAP server. 389 has with it the capability to proxy authentication via PAM. A previous generation RC LDAP used this to perform authentication; but only in a way that supported a single authentication path. However, with some research and experimentation, we have managed to configure our instance with different proxy authentication paths for each of our child domains.
First we simply activate the
PAM Pass Through Auth plugin by
nsslapd-pluginEnabled: on in the existing LDAP entry.
dn: cn=PAM Pass Through Auth,cn=plugins,cn=config objectClass: top objectClass: nsSlapdPlugin objectClass: extensibleObject objectClass: pamConfig cn: PAM Pass Through Auth nsslapd-pluginPath: libpam-passthru-plugin nsslapd-pluginInitfunc: pam_passthruauth_init nsslapd-pluginType: betxnpreoperation nsslapd-pluginEnabled: on nsslapd-pluginloadglobal: true nsslapd-plugin-depends-on-type: database pamMissingSuffix: ALLOW pamExcludeSuffix: cn=config pamIDMapMethod: RDN pamIDAttr: uid pamFallback: FALSE pamSecure: TRUE pamService: ldapserver nsslapd-pluginId: pam_passthruauth nsslapd-pluginVersion: 22.214.171.124 nsslapd-pluginVendor: 389 Project nsslapd-pluginDescription: PAM pass through authentication plugin
The specifics of authentication can be specified at this level as well, if we're able to express our desired behavior in a single configuration. However, the plugin supports multiple simultaneous configurations expressed as nested LDAP entries.
dn: cn=colorado.edu PAM,cn=PAM Pass Through Auth,cn=plugins,cn=config objectClass: pamConfig objectClass: top cn: colorado.edu PAM pamMissingSuffix: ALLOW pamExcludeSuffix: cn=config pamIDMapMethod: RDN ENTRY pamIDAttr: uid pamFallback: FALSE pamSecure: TRUE pamService: curc-twofactor-duo pamFilter: (&(objectClass=posixAccount)(!(homeDirectory=/home/*@*))) dn: cn=colostate.edu PAM,cn=PAM Pass Through Auth,cn=plugins,cn=config objectClass: pamConfig objectClass: top cn: colostate.edu PAM pamMissingSuffix: ALLOW pamExcludeSuffix: cn=config pamIDMapMethod: RDN ENTRY pamIDAttr: uid pamFallback: FALSE pamSecure: TRUE pamService: csu pamFilter: (&(objectClass=posixAccount)(homeDirectoryfirstname.lastname@example.org))
Our two sets of users are authenticated using different PAM stacks, as before. Only now this proxy authentication is happening within the LDAP server, rather than within sssd. This may seem like a small difference, but there are multiple benefits:
The proxy configuration exists, and need only be maintained, only within the LDAP server. It does not require all login nodes to run sssd and a complex, multi-tiered PAM stack.
The LDAP "PAM Pass Through Auth" plugin does not have the same bug as the sssd
proxyauthentication method, bypassing our immediate problem.
Applications that do not support PAM authentication, such as XDMoD, Foreman, and Grafana, can now be configured with simple LDAP authentication, and need not know anything of the complexity of authenticating our multiple domains.
For now I'm differentiating our different user types based on the name
of their home directory, because it happens to include the relevant
domain suffix. In the future we expect to update usernames in the
directory to match and would then likely update this configuration to
Cleaning up a few remaining issues
However, when I first tied this back into sssd, I DOS'd our LDAP server.
[domain/rc.colorado.edu] debug_level = 3 description = CU Boulder Research Computing id_provider = ldap auth_provider = ldap chpass_provider = none min_id=1000 enumerate = false entry_cache_timeout = 300 ldap_id_use_start_tls = True ldap_tls_reqcert = allow ldap_uri = ldap://ldap.rc.int.colorado.edu ldap_search_base = dc=rc,dc=int,dc=colorado,dc=edu ldap_user_search_base = ou=UCB,ou=People,dc=rc,dc=int,dc=colorado,dc=edu ldap_group_search_base = ou=UCB,ou=Groups,dc=rc,dc=int,dc=colorado,dc=edu
This seemed simple enough: when I would try to authenticate using this
configuration, I would enter my password as usual and then respond to
a Duo "push." But the authentication never cleared in sssd, and I
would keep receiving Duo pushes until I stopped sssd. This despite the
fact that I could authenticate with
ldapsearch as expected.
$ ldapsearch -LLL -x -ZZ -D uid=[redacted],ou=UCB,ou=People,dc=rc,dc=int,dc=colorado,dc=edu -W '(uid=[redacted])' dn Enter LDAP Password: dn: uid=[redacted],ou=UCB,ou=People,dc=rc,dc=int,dc=colorado,dc=edu
I eventually discovered that sssd has a six-second timeout for "calls to synchronous LDAP APIs," including BIND. This timeout is entirely reasonable--even generous--for operations that do not have a manual intervention component. But when BIND includes time to send a notification to a phone, unlock the phone, and acknowledge the notification in an app, it is easy to exceed this timeout. sssd gives up and tries again, prompting a new push that won't be received until the first is addressed. In this way, the timeouts just extend against each other.
Thankfully, this timeout is also configurable as
in the relevant sssd domain section. I went with
90, which is likely longer than anyone will need.
There is still the matter of the fact that this DOS'd the LDAP server, however. I suspect I had exhausted the number of directory server threads with pending, long-living (due to manual intervention required / timeout) BIND requests.
The number of threads Directory Server uses to handle simultaneous connections affects the performance of the server. For example, if all threads are busy handling time-consuming tasks (such as add operations), new incoming connections are queued until a free thread can process the request.
Red Hat suggests that
nsslapd-threadnumber should be 32 for an
eight-CPU system like ours; so for now I simply increased to this
recommendation from 16. If we continue to experience thread exhaustion
in real-world use, we can always increase the number of threads again.
I did some work today figuring out how BeeGFS actually writes its data to disk. I shudder to think that we’d actually use this knowledge; but I still found it interesting, so I want to share.
First, I created a simple striped file in the rcops allocation.
[root@boss2 rcops]# beegfs-ctl --createfile testfile --numtargets=2 --storagepoolid=2 Operation succeeded.
This file will stripe across two targets (chosen by BeeGFS at random)
and is using the default 1M chunksize for the rcops storage pool. You
can see this with
[root@boss2 rcops]# beegfs-ctl --getentryinfo /mnt/beegfs/rcops/testfile --verbose EntryID: 9-5F7E8E87-1 Metadata buddy group: 1 Current primary metadata node: bmds1 [ID: 1] Stripe pattern details: + Type: RAID0 + Chunksize: 1M + Number of storage targets: desired: 2; actual: 2 + Storage targets: + 826 @ boss1 [ID: 1] + 834 @ boss2 [ID: 2] Chunk path: uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1 Dentry path: 50/4/0-5BEDEB51-1/
I write an easily-recognized dataset to the file: 1M of
A to the
file; then 1M of
B and so-on.
[root@boss2 rcops]# python -c 'import sys; sys.stdout.write("A"*(1024*1024))' >>testfile [root@boss2 rcops]# python -c 'import sys; sys.stdout.write("B"*(1024*1024))' >>testfile [root@boss2 rcops]# python -c 'import sys; sys.stdout.write("C"*(1024*1024))' >>testfile [root@boss2 rcops]# python -c 'import sys; sys.stdout.write("D"*(1024*1024))' >>testfile
This gives me a 4M file, precisely 1024*1024*4=4194304 bytes.
[root@boss2 rcops]# du --bytes --apparent-size testfile 4194304 testfile
Those two chunk files, as identified by
doesn’t have a storage directory as part of an experiment to see how
difficult it would be to remove them. I guess we never put it back.)
826, is first in the list, so that’s where
the file starts.
[root@boss1 ~]# dd if=/data/boss106/rcops/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1 bs=1 count=5 status=none AAAAA
if we skip 1M (1024*1024 bytes) we see that that’s where the file
[root@boss1 ~]# dd if=/data/boss106/rcops/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1 bs=1 skip=$(((1024 * 1024))) count=5 status=none
And we can see that actually is precisely where it starts by stepping back a little.
[root@boss1 ~]# dd if=/data/boss106/rcops/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1 bs=1 skip=$(((1024 * 1024)-2)) count=5 status=none AACCC
Cool. So we’ve found the end of the first chunk (made of
the start of the third chunk (made of
C). That means the second
and fourth chunks are over in 834. Which they are.
[root@boss2 rcops]# dd if=/data/boss207/rcops/storage/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1 bs=1 count=5 status=none BBBBB
[root@boss2 rcops]# dd if=/data/boss207/rcops/storage/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1 bs=1 count=5 skip=$(((1024*1024-2))) status=none BBDDD
So, in theory, if we wanted to bypass BeeGFS and re-construct files from their chunks, we could do that. It sounds like a nightmare, but we could do it. In a worst-case scenario.
It’s this kind of transparency and inspectability that still makes me really like BeeGFS, despite everything we’ve been through with it.
Recently I fell victim to an attack on a security vulnerability in SaltStack that left much of my homelab infected with cryptominers. When I rebuilt the environment I found myself in the market for a VPN solution.
I have used OpenVPN for a little while, but I found it inconvenient enough to set up and use that I only used it when absolutely necessary to bridge between otherwise private networks.
But I had been hearing good things about WireGuard, so I performed a test deployment. First between two disparate servers. Then on a workstation. Then another. Each time the software deployed easily and remained reliably available, particularly in contrast to the unreliability I had become accustomed to with the Cisco VPN I use for work.
So I came to the last system in my network: a first-generation Raspberry Pi B+. WireGuard isn't available in the Raspberry Pi OS (née Raspbian) repository, but I found articles describing how to install the packages from either Debian backports or unstable. I generally avoid mixing distributions, but I followed the directions as proof of concept.
wireguard package installed successfully, and little
surprise: it is a DKMS package, after all. However, binaries from
wireguard-tools immediately segfaulted. (I expect this is because
the CPU in the first-generation B+ isn't supported by Debian.)
But then I realized that APT makes source repositories as accessible as binary repositories. Compiling my own WireGuard packages would worry me less as well:
First add the Debian Buster backports repository, including its signing key. (You can verify the key fingerprint at debian.org.)
devscripts package (so we can use
debuild to build
the WireGuard packages) and any build dependencies for WireGuard
Finally, download, build, and install WireGuard.
At this point you should have a fully functional WireGuard deployment,
Research Computing team goals for the period 18 February - 3 March, 2020. If you have any questions or comments please contact email@example.com.
Intro to Python workshop
Reserch Computing is presenting its regular Intro to Python course.
RCAMP portal testing framework
The RC Account Management Portal (RCAMP) handles account requests and group membership in the RC environment. In order to help us better update and develop the portal and its dependencies we are rebuilding and enhancing its automated test infrastructure.
Internal training for upcoming CC* hybrid cloud environment
RC is developing a hybrid "coud" environment with support from the NSF Campus Cyberinfrastructure (CC*) program. Development of this environment is ongoing; but our team is also taking this time to learn more about Amazon EC2 and OpenStack virtual machines in order to better support our users when the platform is ready.
Better staff access to fail2ban on login nodes
RC login nodes are protected from brute-force attacks using fail2ban: if a login node sees a sequence of login failures from the same source, that souce is "banned" from all login node access for a period of time. During a training, however, when such authentication failures are common from multiple people in the same room, it is inconvenient to wait for the ban to expire. RC system administrators have the ability to cancel such a ban, but they are not usually present at trainings. To better support this use case, we will be delegating the ability to cancel such bans to the rest of the RC team.
PetaLibrary monthly status reports
A monthly email status report is sent out to PetaLibrary allocation owners and and contacts; but this report has fallen out of date, and has not been updated to reflect changes in the PetaLibrary infrastructure. We are updating this reporting script so that all PetaLibrary allocations are reported, irrespective of their deployment location.
Updated MPI in rebuilt Core Software
Our efforts to update our core software stack are ongoing, with our next goal being to install up-to-date Intel MPI and OpenMPI.
RC trainings review
Finally, to better plan future RC trainings and other user support activities, we are reviewing the trainings, office hours, and consults that we've supported in CY2019.
I really only love God as much as I love the person I love the least.
~ Dorothy Day, journalist