CURCast: Thomas Hauser

Our fearless leader joins me in the studio to talk about moving to the US, the genesis of CU Research Computing, and our upcoming HPC resource, 'Summit.'

Music used

  • Fight CU from University of Colorado Boulder Department of Music
  • Western Showdown by Jay Man

Two software design methods

There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies.

--C.A.R. Hoare, The 1980 ACM Turing Award Lecture

CURCast: Roger Goff

Roger Goff from Data Direct Networks visits to talk about the SFA14k, the storage system that serves the scratch filesystem for our upcoming 'Summit' system. We also end up on a number of tangents including procurement processes, Omni-Path, and the minutae of IO design in a high-performance storage system.

Music used

  • Fight CU from University of Colorado Boulder Department of Music
  • Positivity Rocks by Jay Man

CURCast: Introduction and Sound Test

Listen to me talk into a microphone for a few minutes so I can get some practice with the recording studio and the editing software. And maybe you'll find out why I'm recording the podcast in the first place.

Music used

  • Fight CU from University of Colorado Boulder Department of Music
  • Prelude BWV 846 by Jay Man

Why hasn't my (Slurm) job started?

A job can be blocked from being scheduled for the following reasons:

  • There are insufficient resources available to start the job, either due to active reservations, other running jobs, component status, or system/partition size.

  • Other higher-priority jobs are waiting to run, and the job's time limit prevents it from being backfilled.

  • The job's time limit exceeds an upcoming reservation (e.g., scheduled preventative maintenance)

  • The job is associated with an account that has reached or exceeded its GrpCPUMins.

Display a list of queued jobs sorted in the order considered by the scheduler using squeue.

squeue --sort=-p,i --priority --format '%7T %7A %10a %5D %.12L %10P %10S %20r'

Reason codes

A list of reason codes1 is available as part of the squeue manpage.2

Common reason codes:

  • ReqNodeNotAvail
  • AssocGrpJobsLimit
  • AssocGrpCPUMinsLimit
  • resources
  • QOSResourceLimit
  • Priority
  • AssociationJobLimit
  • JobHeldAdmin

How are jobs prioritized?

PriorityType=priority/multifactor

Slurm prioritizes jobs using the multifactor plugin3 based on a weighted summation of age, size, QOS, and fair-share factors.

Use the sprio command to inspect each weighted priority value separately.

sprio [-j jobid]

Age Factor

PriorityWeightAge=1000
PriorityMaxAge=14-0

The age factor represents the length of time a job has been sitting in the queue and eligible to run. In general, the longer a job waits in the queue, the larger its age factor grows. However, the age factor for a dependent job will not change while it waits for the job it depends on to complete. Also, the age factor will not change when scheduling is withheld for a job whose node or time limits exceed the cluster's current limits.

The weighted age priority is calculated as PriorityWeightAge[1000]*[0..1] as the job age approaches PriorityMaxAge[14-0], or 14 days. As such, an hour of wait-time is equivalent to ~2.976 priority.

Job Size Factor

PriorityWeightJobSize=2000

The job size factor correlates to the number of nodes or CPUs the job has requested. The weighted job size priority is calculated as PriorityWeightJobSize[2000]*[0..1] as the job size approaches the entire size of the system. A job that requests all the nodes on the machine will get a job size factor of 1.0, with an effective weighted job size priority of 28 wait-days (except that job age priority is capped at 14 days).

Quality of Service (QOS) Factor

PriorityWeightQOS=1500

Each QOS can be assigned a priority: the larger the number, the greater the job priority will be for jobs that request this QOS. This priority value is then normalized to the highest priority of all the QOS's to become the QOS factor. As such, the weighted QOS priority is calculated as PriorityWeightQOS[1500]*QosPriority[0..1000]/MAX(QOSPriority[1000]).

QOS          Priority  Weighted priority  Wait-days equivalent
-----------  --------  -----------------  --------------------
admin            1000               1500                  21.0
janus               0                  0                   0.0
janus-debug       400                600                   8.4
janus-long        200                300                   4.2

Fair-share factor

PriorityWeightFairshare=2000
PriorityDecayHalfLife=14-0

The fair-share factor serves to prioritize queued jobs such that those jobs charging accounts that are under-serviced are scheduled first, while jobs charging accounts that are over-serviced are scheduled when the machine would otherwise go idle.

The simplified formula for calculating the fair-share factor for usage that spans multiple time periods and subject to a half-life decay is:

F = 2**((-NormalizedUsage)/NormalizedShares))

Each account is granted an equal share, and historic records of use decay with a half-life of 14 days. As such, the weighted fair-share priority is calculated as PriorityWeightFairshare[2000]*[0..1] depending on the account's historic use of the system relative to its allocated share.

A fair-share factor of 0.5 indicates that the account's jobs have used exactly the portion of the machine that they have been allocated and assigns the job additional 1000 priority (the equivalent of 2976 wait-hours). A fair-share factor of above 0.5 indicates that the account's jobs have consumed less than their allocated share and assigns the job up to 2000 additional priority, for an effective relative 14 wait-day priority boost. A fair-share factor below 0.5 indicates that the account's jobs have consumed more than their allocated share of the computing resources, and the added priority will approach 0 dependent on the account's history relevant to its equal share of the system, for an effective relative 14-day priority penalty.

The curc::sysconfig::scinet Puppet module

I've been working on a new module, curc::sysconfig::scinet, which will generally do the Right Thing™ when configuring a host on the CURC science network, with as little configuration as possible.

Let's look at some examples.

login nodes

class { 'curc::sysconfig::scinet':
  location => 'comp',
  mgt_if   => 'eth0',
  dmz_if   => 'eth1',
  notify   => Class['network'],
}

This is the config used on a new-style login node like login05 and login07. (What makes them new-style? Mostly just that they've had their interfaces cleaned up to use eth0 for "mgt" and eth1 for "dmz".)

Here's the routing table that this produced on login07:

$ ip route list
10.225.160.0/24 dev eth0  proto kernel  scope link  src 10.225.160.32 
10.225.128.0/24 via 10.225.160.1 dev eth0 
192.12.246.0/24 dev eth1  proto kernel  scope link  src 192.12.246.39 
10.225.0.0/20 via 10.225.160.1 dev eth0 
10.225.0.0/16 via 10.225.160.1 dev eth0  metric 110 
10.128.0.0/12 via 10.225.160.1 dev eth0  metric 110 
default via 192.12.246.1 dev eth1  metric 100 
default via 10.225.160.1 dev eth0  metric 110

Connections to "mgt" subnets use the "mgt" interface eth0, either by the link-local route or the static routes via comp-mgt-gw (10.225.160.1). Connections to the "general" subnet (a.k.a. "vlan 2049"), as well as the rest of the science network ("data" and "svc" networks) also use eth0 by static route. The default eth0 route is configured by DHCP, but the interface has a default metric of 110, so it doesn't conflict with or supersede eth1's default route, which is configured with a lower metric of 100.

Speaking of eth1, the "dmz" interface is configured statically, using information retrieved from DNS by Puppet.

$ cat /etc/sysconfig/network-scripts/ifcfg-eth1 
TYPE=Ethernet
DEVICE=eth1
BOOTPROTO=static
HWADDR=00:50:56:88:2E:36
ONBOOT=yes
IPADDR=192.12.246.39
NETMASK=255.255.255.0
GATEWAY=192.12.246.1
METRIC=100
IPV4_ROUTE_METRIC=100

Usually the routing priority of the "dmz" interface would mean that inbound connections to the "mgt" interface from outside of the science network would be blocked when the "dmz"-bound response is filtered by rp_filter; but curc::sysconfig::scinet also configures routing policy for eth0, so traffic on that interface always returns from that interface.

$ ip rule show | grep 'lookup 1'
32764:  from 10.225.160.32 lookup 1 
32765:  from all iif eth0 lookup 1

$ ip route list table 1
default via 10.225.160.1 dev eth0

This allows me to ping login07.rc.int.colorado.edu from my office workstation.

$ ping -c 1 login07.rc.int.colorado.edu
PING login07.rc.int.colorado.edu (10.225.160.32) 56(84) bytes of data.
64 bytes from 10.225.160.32: icmp_seq=1 ttl=62 time=0.507 ms

--- login07.rc.int.colorado.edu ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 1ms
rtt min/avg/max/mdev = 0.507/0.507/0.507/0.000 ms

Because the default route for eth0 is actually configured, outbound routing from login07 is resilient to failure of the "dmz" link.

# ip route list | grep -v eth1
10.225.160.0/24 dev eth0  proto kernel  scope link  src 10.225.160.32 
10.225.128.0/24 via 10.225.160.1 dev eth0 
10.225.0.0/20 via 10.225.160.1 dev eth0 
10.225.0.0/16 via 10.225.160.1 dev eth0  metric 110 
10.128.0.0/12 via 10.225.160.1 dev eth0  metric 110 
default via 10.225.160.1 dev eth0  metric 110

Traffic destined to leave the science network simply proceeds to the next preferred (and, in this case, only remaining) default route, comp-mgt-gw.

DHCP, DNS, and the FQDN

Tangentially, it's important to note that the DHCP configuration of eth0 will tend to re-wite /etc/resolv.conf and the search path it defines, with the effect of causing the FQDN of the host to change to login07.rc.int.colorado.edu. Because login nodes are logically (and historically) external hosts, not internal hosts, they should prefer their external identity to their internal identity. As such, we override the domain search path on login nodes to cause them to discover their rc.colorado.edu FQDN's first.

# cat /etc/dhcp/dhclient-eth0.conf 
supersede domain-search "rc.colorado.edu", "rc.int.colorado.edu";

PetaLibrary/repl

The Petibrary/repl GPFS NSD nodes replnsd{01,02} are still in the "COMP" datacenter, but only attach to "mgt" and "data" networks.

class { 'curc::sysconfig::scinet':
  location         => 'comp',
  mgt_if           => 'eno2',
  data_if          => 'enp17s0f0',
  other_data_rules => [ 'from 10.225.176.61 table 2',
                        'from 10.225.176.62 table 2',
                        ],
  notify           => Class['network_manager::service'],
}

This config produces the following routing table on replnsd01...

$ ip route list
default via 10.225.160.1 dev eno2  proto static  metric 110 
default via 10.225.176.1 dev enp17s0f0  proto static  metric 120 
10.128.0.0/12 via 10.225.160.1 dev eno2  metric 110 
10.128.0.0/12 via 10.225.176.1 dev enp17s0f0  metric 120 
10.225.0.0/20 via 10.225.160.1 dev eno2 
10.225.0.0/16 via 10.225.160.1 dev eno2  metric 110 
10.225.0.0/16 via 10.225.176.1 dev enp17s0f0  metric 120 
10.225.64.0/20 via 10.225.176.1 dev enp17s0f0 
10.225.128.0/24 via 10.225.160.1 dev eno2 
10.225.144.0/24 via 10.225.176.1 dev enp17s0f0 
10.225.160.0/24 dev eno2  proto kernel  scope link  src 10.225.160.59  metric 110 
10.225.160.49 via 10.225.176.1 dev enp17s0f0  proto dhcp  metric 120 
10.225.176.0/24 dev enp17s0f0  proto kernel  scope link  src 10.225.176.59  metric 120

...with the expected interface-consistent policy-targeted routing tables.

$ ip route list table 1
default via 10.225.160.1 dev eno2

$ ip route list table 2
default via 10.225.176.1 dev enp17s0f0

Static routes for "mgt" and "data" subnets are defined for their respective interfaces. As on the login nodes above, default routes are specified for both interfaces as well, with the lower-metric "mgt" interface eno2 being preferred. (This is configurable using the mgt_metric and data_metric parameters.)

Perhaps the most notable aspect of the PetaLibrary/repl network config is the provisioning of the GPFS CES floating IP addresses 10.225.176.{61,62}. These addresses are added to the enp17s0f0 interface dynamically by GPFS, and are not defined with curc::sysconfig::scinet; but the config must reference these addresses to implement proper interface-consistent policy-targeted routing tables. Though version of Puppet deployed at CURC lacks the semantics to infer these rules from a more semantic data_ip parameter; so the other_data_rules parameter is used in stead.

  other_data_rules => [ 'from 10.225.176.61 table 2',
                        'from 10.225.176.62 table 2',
                        ],

Blanca/ICS login node

[porting the blanca login node would be great because it's got a "dmz", "mgt", and "data" interface; so it would exercise the full gamut of features of the module]

Linux policy-based routing

How could Linux policy routing be so poorly documented? It's so useful, so essential in a multi-homed environment... I'd almost advocate for its inclusion as default behavior.

What is this, you ask? To understand, we have to start with what Linux does by default in a multi-homed environment. So let's look at one.

$ ip addr
[...]
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 78:2b:cb:66:75:c0 brd ff:ff:ff:ff:ff:ff
    inet 10.225.128.80/24 brd 10.225.128.255 scope global eth2
    inet6 fe80::7a2b:cbff:fe66:75c0/64 scope link 
       valid_lft forever preferred_lft forever
[...]
6: eth5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP qlen 1000
    link/ether e4:1d:2d:14:93:60 brd ff:ff:ff:ff:ff:ff
    inet 10.225.144.80/24 brd 10.225.144.255 scope global eth5
    inet6 fe80::e61d:2dff:fe14:9360/64 scope link 
       valid_lft forever preferred_lft forever

So we have two interfaces, eth2 and eth5. They're on separate subnets, 10.225.128.0/24 and 10.225.144.0/24 respectively. In our environment, we refer to these as "spsc-mgt" and "spsc-data." The practical circumstance is that one of these networks is faster than the other, and we would like bulk data transfer to use the faster "spsc-data" network.

If the client system also has an "spsc-data" network, everything is fine. The client addresses the system using its data address, and the link-local route prefers the data network.

$ ip route list 10.225.144.0/24
10.225.144.0/24 dev eth5  proto kernel  scope link  src 10.225.144.80

Our network environment covers a number of networks, however. So let's say our client lives in another data network--"comp-data." Infrastructure routing directs the traffic to the -data interface of our server correctly, but the default route on the server prefers the -mgt interface.

$ ip route list | grep ^default
default via 10.225.128.1 dev eth2

For this simple case we have two options. We can either change our default route to prefer the -data interface, or we can enumerate intended -data client networks with static routes using the data interface. Since changing the default route simply leaves us in the same situation for the -mgt network, let's define some static routes.

$ ip route add 10.225.64.0/20 via 10.225.144.1 dev eth5
$ ip route add 10.225.176.0/24 via 10.225.144.1 dev eth5

So long as we can enumerate the networks that should always use the -data interface of our server to communicate, this basically works. But what if we want to support clients that don't themselves have separate -mgt and -data networks? What if we have a single client--perhaps with only a -mgt network connection--that should be able to communicate individually with the server's -mgt interface and its -data interface. In the most pathological case, what if we have a host that is only connected to the spsc-mgt (10.225.128.0/24) interface, but we want that client to be able to communicate with the server's -data interface. In this case, the link-local route will always prefer the -mgt network for the return path.

Policy-based routing

The best case would be to have the server select an outbound route based not on a static configuration, but in response to the incoming path of the traffic. This is the feature enabled by policy-based routing.

Linux policy routing allows us to define distinct and isolated routing tables, and then select the appropriate routing table based on the traffic context. In this situation, we have three different routing contexts to consider. The first of these are the routes to use when the server initiates communication.

$ ip route list table main
10.225.128.0/24 dev eth2  proto kernel  scope link  src 10.225.128.80
10.225.144.0/24 dev eth5  proto kernel  scope link  src 10.225.144.80
10.225.64.0/20 via 10.225.144.1 dev eth5
10.225.176.0/24 via 10.225.144.1 dev eth5
default via 10.225.128.1 dev eth2

A separate routing table defines routes to use when responding to traffic from the -mgt interface.

$ ip route list table 1
default via 10.225.128.1 dev eth2

The last routing table defines routes to use when responding to traffic from the -data interface.

$ ip route list table 2
default via 10.225.144.1 dev eth5

With these separate routing tables defined, the last step is to define the rules that select the correct routing table.

$ ip rule list
0:  from all lookup local 
32762:  from 10.225.144.80 lookup 2 
32763:  from all iif eth5 lookup 2 
32764:  from 10.225.128.80 lookup 1 
32765:  from all iif eth2 lookup 1 
32766:  from all lookup main 
32767:  from all lookup default

Despite a lack of documentation, all of these rules may be codified in Red Hat "sysconfig"-style "network-scripts" using interface-specific route- and rule- files.

$ cat /etc/sysconfig/network-scripts/route-eth2
default via 10.225.128.1 dev eth2
default via 10.225.128.1 dev eth2 table 1

$ cat /etc/sysconfig/network-scripts/route-eth5
10.225.64.0/20 via 10.225.144.1 dev eth5
10.225.176.0/24 via 10.225.144.1 dev eth5
default via 10.225.144.1 dev eth5 table 2

$ cat /etc/sysconfig/network-scripts/rule-eth2
iif eth2 table 1
from 10.225.128.80 table 1

$ cat /etc/sysconfig/network-scripts/rule-eth5
iif eth5 table 2
from 10.225.144.80 table 2

Changes to the RPDB made with these commands do not become active immediately. It is assumed that after a script finishes a batch of updates, it flushes the routing cache with ip route flush cache.

References

Oceans (Where my feet may fall)

Spirit lead me where my trust is without borders. Let me walk upon the waters wherever You would call me.

Take me deeper than my feet could ever wander, and my faith will be made stronger in the presence of my Savior.

iTunes store

Two Compasses

A captain and his first mate were sailing on the open ocean. Each possessed a compass, with a third affixed to the helm. The first mate approached the captain, saying, "Sir: in my clumsiness this morning, I have dropped my compass into the sea! What should I do?" After a moment, the captain wordlessly turned and threw his own compass into the sea. The first mate watched and became enlightened.

The Psalms and Me

Every time I read the Psalms:

Hear my prayer, O LORD, Give ear to my supplications!

"Oh, maybe this will be a nice verse to uplift my friend!"

For the enemy has persecuted my soul; He has crushed my life to the ground; He has made me dwell in dark places, like those who have long been dead. Therefore my spirit is overwhelmed within me; My heart is appalled within me.

"Yes! I'll bet this message will really resonate with my friend! Sometimes we all feel downtrodden."

Answer me quickly, O LORD, my spirit fails; Do not hide Your face from me, Or I will become like those who go down to the pit.

"Yes! When we are at our lowest, we should run to God!"

And in Your lovingkindness, cut off my enemies And destroy all those who afflict my soul

"Um... David? We cool, bro?"

Awake to punish all the nations; Do not be gracious to any who are treacherous in iniquity.

"Hey... I don't know if I meant all that..."

Scatter them by Your power, and bring them down, O Lord, our shield.
Destroy them in wrath, destroy them that they may be no more
Deal with them as You did with Midian, [...] they became manure for the ground

"Hold on, there, man! Let's not get too crazy..."

How blessed will be the one who seizes and dashes your little ones Against the rock.

sigh

"Come on, David. Things were going so well. I mean, sure: you murdered a man to cover up the fact that you impregnated his wife... but you were sorry, right?"

Wash me thoroughly from my iniquity And cleanse me from my sin.

"See? There. That's more like it..."

O God, shatter their teeth in their mouth; Break out the fangs of the young lions, O LORD.

"No! Bad David! No biscuit!"

...

"Whatcha got for me, Jesus?"

You have heard that it was said, "You shall love your neighbor and hate your enemy. But I say to you, love your enemies and pray for those who persecute you, so that you may be sons of your Father who is in heaven; for He causes His sun to rise on the evil and the good, and sends rain on the righteous and the unrighteous. For if you love those who love you, what reward do you have?

"Aw, yeah. That's the stuff."

"David, have you met this guy? I think you should probably meet this guy."