CU Research Computing

Tech. Team weekly review for 7 August, 2017

This week at the University of Colorado Research Computing Tech. Team...

Final Deployment of HPC UPS

The HPCF UPS has been completely installed and deployed, and Summit should now be protected from upstream power irregularities by the full force of an array of lithium-ion barreries!

There's still work to do in the datacenter; but it's largely about cleaning up hot-aisle containment and finishing up power-distribution work in the north pod; neither of which should affect production for Summit.

It's been a lot of work, and the datacenter operations team and their contractors aren't quite done yet; but we shouldn't need to take any more outages. Thanks to our entire user community for their patience!

XSEDE SSO hub authentication progress

Work continues on the effort to provide access to Summit via the XSEDE SSO hub. We've successfully deployed AMIE (the XSEDE local account and accounting transaction transport), and are now working to configure a set of login nodes that would support the required GSI-SSH authentication.

The XSEDE SSO hub is expected to be the mechanism by which members of the greater RMACC community may access Summit.

PetaLibrary/2 RFP progress

We had the optional pre-bid call for the ongoing PetaLibrary/2 RFP on Friday, and saw a pleasing array of supplier engagement and some good question-and-answer. Notable results of the session will be added to the RFP solicitation as formal "question and answer" entries.

https://bids.sciquest.com/apps/Router/PublicEvent?CustomerOrg=Colorado

Tech. Team weekly review for 31 July, 2017

This week at the University of Colorado Research Computing Tech. Team...

New HPC Storage Admin, Patricia Egeland

Were're bringing on a new team member, Patricia Egeland, as HPC Storage Administrator, starting Tuesday. Patricia will share general system administration duties with the other RC tech. team operational staff; but will carry primary responsibility for RC data storage systems, notably RC Core Storage and the PetaLibrary. She also plans to contribute to ongoing and upcoming software development efforts at RC, and we're looking forward to seeing more interest in that space both internally and in our user community.

Patricia worked most recently as a systems analyst for the Dark Energy Survey (DES) Science Portal, and previously worked as a server, system, and application administrator for the CERN Compact Muon Solenoid (CMS) experiment.

We couldn't be more pleased to have Patricia as a member of the RC tech. team! Please join me in welcoming her if you have an opportunity to work with her.

Final deployment of HPCF UPS

We'll have the final of our three UPS-related HPCF outages starting on Wednesday, and closing out in the afternoon on Friday. During this outage we'll be...

  • re-routing the power cabling between the UPS infrastructure and the in-row power-distribution infrastructure for future maintainability;

  • decommissioning the legacy UPS;

  • installing additional in-row power distribution infrastructure;

  • and bringing the new UPS into full production.

The new HPCF UPS will provide not only power conditioning (remediating a power quality issue that has led to several past Summit compute outages) but also at least 15 minutes of UPS-backed runtime in the event of a complete utility power outage (which should eventually provide us sufficient time to power the system off in a controlled manner).

PetaLibrary/2 RFP goes live

Research Computing successful PetaLibrary service is getting a refresh! Or, at least, that's the intent. We're publishing an RFP today for a new, unified infrastructure, which should extend the life of the PetaLibrary, simplify our service offerings, make the infrastructure more maintainable, and eventually allow us to add additional features and services.

Monday, 31 July 2017

RFP Posted online

Friday, 4 August 2017 (09:00)

Optional Pre-bid call

Friday, 11 August 2017

written questions are due

Tuesday, 5 September

RFP responses are due

https://bids.sciquest.com/apps/Router/PublicEvent?CustomerOrg=Colorado

XSEDE SSO hub authentication progress

RMACC Summit is intended, as its full name implies, to be an RMACC resource, not just a CU or CSU resource. We've planned from the beginning to support access to Summit through XSEDE credentials, but this has required additional (though already planned) service development at XSEDE. Those services are ready for beta testing now, and CU is on hand as an early adopter for their new "single sign-on hub for L3 service providers" service (XCI-36). We're working on deploying this now, and will hopefully be able to start bringing on early-adopters from the RMACC community soon.

Misc. other things

  • We're rebuilding the RC login environment. We've been through a few prototype efforts, but the current plan is to start by deploying a new tutorial login node, tlogin1, which will also be the first recipient of new XSEDE authentication services.

  • We're continuing to develop our internal "curc-bench" automated benchmarking utility for validating the performance of RC HPC resources over time (notably after we make changes). Development is primarily driven by Aaron Holt.

  • We had to rebuild Sneffels (originally "the viz cluster") after a security incident. That work is largely done, and service has been restored, but OIT is sill reviewing viz1 as part of our incident response process.

  • We're updating the Globus software for our data-transfer service, starting with dtn01. We're further taking this opportunity to re-build our DTN configuration in general, which should lead to better and more reliable data-transfer performance due to the correction of a number of networking irregularities on these servers. This work is being done primarily by Dan Milroy.

User-selectable authentication methods using pam_authtok

Research Computing is in the process of migrating and expanding our authentication system to support additional authentication methods. Historically we’ve supported VASCO IDENTIKEY time-based one-time-password and pin to provide two-factor authentication.

$ ssh user1234@login.rc.colorado.edu
user1234@login.rc.colorado.edu's password: <pin><otp>

[user1234@login04 ~]$

But the VASCO tokens are expensive, get lost or left at home, have a battery that runs out, and have an internal clock that sometimes falls out-of-sync with the rest of the authentication system. For these and other reasons we’re provisioning most new account with Duo, which provides iOS and Android apps but also supports SMS and voice calls.

Unlike VASCO, Duo is only a single authentication factor; so we’ve also added support for upstream CU-Boulder campus password authentication to be used in tandem.

This means that we have to support both authentication mechanisms–VASCO and password+Duo–simultaneously. A naïve implementation might just stack these methods together.

auth sufficient pam_radius_auth.so try_first_pass # VASCO authenticates over RADIUS
auth requisite  pam_krb5.so try_first_pass # CU-Boulder campus password
auth required   pam_duo.so

This generally works: VASCO authentication is attempted first over RADIUS. If that fails, authentication is attempted against the campus password and, if that succeeds, against Duo.

Unfortunately, this generates spurious authentication failures in VASCO when using Duo to authenticate: the VASCO method fails, then Duo authentication is attempted. Users who have both VASCO and Duo accounts (e.g., all administrators) may generate enough failures to trigger the break-in mitigation security system, and the VASCO account may be disabled. This same issue exists if we reverse the authentication order to try Duo first, then VASCO: VASCO users might then cause their campus passwords to become disabled.

In stead, we need to enable users to explicitly specify which authentication method they’re using.

Separate sssd domains

Our first attempt to provide explicit access to different authentication methods was to provide multiple redundant sssd domains.

[domain/rc]
description = Research Computing
proxy_pam_target = curc-twofactor-vasco


[domain/duo]
description = Research Computing (identikey+duo authentication)
enumerate = false
proxy_pam_target = curc-twofactor-duo

This allows users to log in normally using VASCO, while password+Duo authentication can be requested explicitly by logging in as ${user}@duo.

$ ssh -l user1234@duo login.rc.colorado.edu

This works well enough for the common case of shell access over SSH: login is permitted and, since both the default rc domain and the duo alias domain are both backed by the same LDAP directory, NSS sees no important difference once a user is logged in using either method.

This works because POSIX systems store the uid number returned by PAM and NSS, and generally resolve the uid number to the username on-demand. Not all systems work this way, however. For example, when we attempted to use this authentication mechanism to authenticate to our prototype JupyterHub (web) service, jobs dispatched to Slurm retained the ${user}@duo username format. Slurm also uses usernames internally, and the ${user}@duo username is not populated within Slurm: only the base ${user} username.

Expecting that we would continue to find more unexpected side-effects of this implementation, we started to look for an alternative mechanism that doesn’t modify the specified username.

pam_authtok

In general, a user provides two pieces of information during authentication: a username (which we’ve already determined we shouldn’t modify) and an authentication token or password. We should be able to detect, for example, a prefix to that authentication token to determine what authentication method to use.

$ ssh user1234@login.rc.colorado.edu
user1234@login.rc.colorado.edu's password: duo:<password>

[user1234@login04 ~]$

But we found no such pam module that would allow us to manipulate the authentication token… so we wrote one.

auth [success=1 default=ignore] pam_authtok.so prefix=duo: strip prompt=password:

auth [success=done new_authtok_reqd=done default=die] pam_radius_auth.so try_first_pass

auth requisite pam_krb5.so try_first_pass
auth [success=done new_authtok_reqd=done default=die] pam_duo.so

Now our PAM stack authenticates against VASCO by default; but, if the user provides a password with a duo: prefix, authentication skips VASCO and authenticates the supplied password, followed by Duo push. Our actual production PAM stack is a bit more complicated, supporting a redundant vasco: prefix as well, for forward-compatibility should we change the default authentication mechanism in the future. We can also extend this mechanism to add arbitrary additional authentication mechanisms in the future.

Why hasn’t my (Slurm) job started?

A job can be blocked from being scheduled for the following reasons:

  • There are insufficient resources available to start the job, either due to active reservations, other running jobs, component status, or system/partition size.

  • Other higher-priority jobs are waiting to run, and the job’s time limit prevents it from being backfilled.

  • The job’s time limit exceeds an upcoming reservation (e.g., scheduled preventative maintenance)

  • The job is associated with an account that has reached or exceeded its GrpCPUMins.

Display a list of queued jobs sorted in the order considered by the scheduler using squeue.

squeue --sort=-p,i --priority --format '%7T %7A %10a %5D %.12L %10P %10S %20r'

Reason codes

A list of reason codes [1] is available as part of the squeue manpage. [2]

Common reason codes:

  • ReqNodeNotAvail

  • AssocGrpJobsLimit

  • AssocGrpCPUMinsLimit

  • resources

  • QOSResourceLimit

  • Priority

  • AssociationJobLimit

  • JobHeldAdmin

How are jobs prioritized?

PriorityType=priority/multifactor

Slurm prioritizes jobs using the multifactor plugin [3] based on a weighted summation of age, size, QOS, and fair-share factors.

Use the sprio command to inspect each weighted priority value separately.

sprio [-j jobid]

Age Factor

PriorityWeightAge=1000
PriorityMaxAge=14-0

The age factor represents the length of time a job has been sitting in the queue and eligible to run. In general, the longer a job waits in the queue, the larger its age factor grows. However, the age factor for a dependent job will not change while it waits for the job it depends on to complete. Also, the age factor will not change when scheduling is withheld for a job whose node or time limits exceed the cluster’s current limits.

The weighted age priority is calculated as PriorityWeightAge[1000]*[0..1] as the job age approaches PriorityMaxAge[14-0], or 14 days. As such, an hour of wait-time is equivalent to ~2.976 priority.

Job Size Factor

PriorityWeightJobSize=2000

The job size factor correlates to the number of nodes or CPUs the job has requested. The weighted job size priority is calculated as PriorityWeightJobSize[2000]*[0..1] as the job size approaches the entire size of the system. A job that requests all the nodes on the machine will get a job size factor of 1.0, with an effective weighted job size priority of 28 wait-days (except that job age priority is capped at 14 days).

Quality of Service (QOS) Factor

PriorityWeightQOS=1500

Each QOS can be assigned a priority: the larger the number, the greater the job priority will be for jobs that request this QOS. This priority value is then normalized to the highest priority of all the QOS’s to become the QOS factor. As such, the weighted QOS priority is calculated as PriorityWeightQOS[1500]*QosPriority[0..1000]/MAX(QOSPriority[1000]).

QOS          Priority  Weighted priority  Wait-days equivalent
-----------  --------  -----------------  --------------------
admin            1000               1500                  21.0
janus               0                  0                   0.0
janus-debug       400                600                   8.4
janus-long        200                300                   4.2

Fair-share factor

PriorityWeightFairshare=2000
PriorityDecayHalfLife=14-0

The fair-share factor serves to prioritize queued jobs such that those jobs charging accounts that are under-serviced are scheduled first, while jobs charging accounts that are over-serviced are scheduled when the machine would otherwise go idle.

The simplified formula for calculating the fair-share factor for usage that spans multiple time periods and subject to a half-life decay is:

F = 2**((-NormalizedUsage)/NormalizedShares))

Each account is granted an equal share, and historic records of use decay with a half-life of 14 days. As such, the weighted fair-share priority is calculated as PriorityWeightFairshare[2000]*[0..1] depending on the account’s historic use of the system relative to its allocated share.

A fair-share factor of 0.5 indicates that the account’s jobs have used exactly the portion of the machine that they have been allocated and assigns the job additional 1000 priority (the equivalent of 2976 wait-hours). A fair-share factor of above 0.5 indicates that the account’s jobs have consumed less than their allocated share and assigns the job up to 2000 additional priority, for an effective relative 14 wait-day priority boost. A fair-share factor below 0.5 indicates that the account’s jobs have consumed more than their allocated share of the computing resources, and the added priority will approach 0 dependent on the account’s history relevant to its equal share of the system, for an effective relative 14-day priority penalty.

The curc::sysconfig::scinet Puppet module

I’ve been working on a new module, curc::sysconfig::scinet, which will generally do the Right Thing™ when configuring a host on the CURC science network, with as little configuration as possible.

Let’s look at some examples.

login nodes

class { 'curc::sysconfig::scinet':
  location => 'comp',
  mgt_if   => 'eth0',
  dmz_if   => 'eth1',
  notify   => Class['network'],
}

This is the config used on a new-style login node like login05 and login07. (What makes them new-style? Mostly just that they’ve had their interfaces cleaned up to use eth0 for “mgt” and eth1 for “dmz”.)

Here’s the routing table that this produced on login07:

$ ip route list
10.225.160.0/24 dev eth0  proto kernel  scope link  src 10.225.160.32
10.225.128.0/24 via 10.225.160.1 dev eth0
192.12.246.0/24 dev eth1  proto kernel  scope link  src 192.12.246.39
10.225.0.0/20 via 10.225.160.1 dev eth0
10.225.0.0/16 via 10.225.160.1 dev eth0  metric 110
10.128.0.0/12 via 10.225.160.1 dev eth0  metric 110
default via 192.12.246.1 dev eth1  metric 100
default via 10.225.160.1 dev eth0  metric 110

Connections to “mgt” subnets use the “mgt” interface eth0, either by the link-local route or the static routes via comp-mgt-gw (10.225.160.1). Connections to the “general” subnet (a.k.a. “vlan 2049”), as well as the rest of the science network (“data” and “svc” networks) also use eth0 by static route. The default eth0 route is configured by DHCP, but the interface has a default metric of 110, so it doesn’t conflict with or supersede eth1’s default route, which is configured with a lower metric of 100.

Speaking of eth1, the “dmz” interface is configured statically, using information retrieved from DNS by Puppet.

$ cat /etc/sysconfig/network-scripts/ifcfg-eth1
TYPE=Ethernet
DEVICE=eth1
BOOTPROTO=static
HWADDR=00:50:56:88:2E:36
ONBOOT=yes
IPADDR=192.12.246.39
NETMASK=255.255.255.0
GATEWAY=192.12.246.1
METRIC=100
IPV4_ROUTE_METRIC=100

Usually the routing priority of the “dmz” interface would mean that inbound connections to the “mgt” interface from outside of the science network would be blocked when the “dmz”-bound response is filtered by rp_filter; but curc::sysconfig::scinet also configures routing policy for eth0, so traffic on that interface always returns from that interface.

$ ip rule show | grep 'lookup 1'
32764:  from 10.225.160.32 lookup 1
32765:  from all iif eth0 lookup 1

$ ip route list table 1
default via 10.225.160.1 dev eth0

This allows me to ping login07.rc.int.colorado.edu from my office workstation.

$ ping -c 1 login07.rc.int.colorado.edu
PING login07.rc.int.colorado.edu (10.225.160.32) 56(84) bytes of data.
64 bytes from 10.225.160.32: icmp_seq=1 ttl=62 time=0.507 ms

--- login07.rc.int.colorado.edu ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 1ms
rtt min/avg/max/mdev = 0.507/0.507/0.507/0.000 ms

Because the default route for eth0 is actually configured, outbound routing from login07 is resilient to failure of the “dmz” link.

# ip route list | grep -v eth1
10.225.160.0/24 dev eth0  proto kernel  scope link  src 10.225.160.32
10.225.128.0/24 via 10.225.160.1 dev eth0
10.225.0.0/20 via 10.225.160.1 dev eth0
10.225.0.0/16 via 10.225.160.1 dev eth0  metric 110
10.128.0.0/12 via 10.225.160.1 dev eth0  metric 110
default via 10.225.160.1 dev eth0  metric 110

Traffic destined to leave the science network simply proceeds to the next preferred (and, in this case, only remaining) default route, comp-mgt-gw.

DHCP, DNS, and the FQDN

Tangentially, it’s important to note that the DHCP configuration of eth0 will tend to re-wite /etc/resolv.conf and the search path it defines, with the effect of causing the FQDN of the host to change to login07.rc.int.colorado.edu. Because login nodes are logically (and historically) external hosts, not internal hosts, they should prefer their external identity to their internal identity. As such, we override the domain search path on login nodes to cause them to discover their rc.colorado.edu FQDN’s first.

# cat /etc/dhcp/dhclient-eth0.conf
supersede domain-search "rc.colorado.edu", "rc.int.colorado.edu";

PetaLibrary/repl

The Petibrary/repl GPFS NSD nodes replnsd{01,02} are still in the “COMP” datacenter, but only attach to “mgt” and “data” networks.

class { 'curc::sysconfig::scinet':
  location         => 'comp',
  mgt_if           => 'eno2',
  data_if          => 'enp17s0f0',
  other_data_rules => [ 'from 10.225.176.61 table 2',
                        'from 10.225.176.62 table 2',
                        ],
  notify           => Class['network_manager::service'],
}

This config produces the following routing table on replnsd01

$ ip route list
default via 10.225.160.1 dev eno2  proto static  metric 110
default via 10.225.176.1 dev enp17s0f0  proto static  metric 120
10.128.0.0/12 via 10.225.160.1 dev eno2  metric 110
10.128.0.0/12 via 10.225.176.1 dev enp17s0f0  metric 120
10.225.0.0/20 via 10.225.160.1 dev eno2
10.225.0.0/16 via 10.225.160.1 dev eno2  metric 110
10.225.0.0/16 via 10.225.176.1 dev enp17s0f0  metric 120
10.225.64.0/20 via 10.225.176.1 dev enp17s0f0
10.225.128.0/24 via 10.225.160.1 dev eno2
10.225.144.0/24 via 10.225.176.1 dev enp17s0f0
10.225.160.0/24 dev eno2  proto kernel  scope link  src 10.225.160.59  metric 110
10.225.160.49 via 10.225.176.1 dev enp17s0f0  proto dhcp  metric 120
10.225.176.0/24 dev enp17s0f0  proto kernel  scope link  src 10.225.176.59  metric 120

…with the expected interface-consistent policy-targeted routing tables.

$ ip route list table 1
default via 10.225.160.1 dev eno2

$ ip route list table 2
default via 10.225.176.1 dev enp17s0f0

Static routes for “mgt” and “data” subnets are defined for their respective interfaces. As on the login nodes above, default routes are specified for both interfaces as well, with the lower-metric “mgt” interface eno2 being preferred. (This is configurable using the mgt_metric and data_metric parameters.)

Perhaps the most notable aspect of the PetaLibrary/repl network config is the provisioning of the GPFS CES floating IP addresses 10.225.176.{61,62}. These addresses are added to the enp17s0f0 interface dynamically by GPFS, and are not defined with curc::sysconfig::scinet; but the config must reference these addresses to implement proper interface-consistent policy-targeted routing tables. Though version of Puppet deployed at CURC lacks the semantics to infer these rules from a more semantic data_ip parameter; so the other_data_rules parameter is used in stead.

other_data_rules => [ 'from 10.225.176.61 table 2',
                      'from 10.225.176.62 table 2',
                      ],

Blanca/ICS login node

porting the blanca login node would be great because it’s got a “dmz”, “mgt”, and “data” interface; so it would exercise the full gamut of features of the module.