Improve your multi-homed servers with policy routing

This article was first published in the Summer 2016 issue of Usenix ;login:

Traditional IP routing systems route packets by comparing the destinaton address against a predefined list of routes to each available subnet; but when multiple potential routes exist between two hosts on a network, the preferred route may be dependent on context that cannot be inferred from the destination alone. The Linux kernel, together with the iproute2 suite, supports the definition of multiple routing tables and a routing policy database to select the preferred routing table dynamically. This additional expressiveness can be used to avoid multiple routing pitfalls, including asymmetric routes and performance bottlenecks from suboptimal route selection.

Background

The CU-Boulder Research Computing environment spans three datacenters, each with its own set of special-purpose networks. A traditionally-routed host simultaneously connected to two or more these networks compounds network complexity by making only one interface (the default gateway) generaly available across network routes. Some cases can be addressed by defining static routes; but even this leads to asymmetric routing that is at best confusing and at worst a performance bottleneck.

Over the past few months we've been transitioning our hosts from a single-table routing configuration to a policy-driven, multi-table routing configuration. The end result is full bidirectional connectivity between any two interfaces in the network, irrespective of underlying topology or a host's default route. This has reduced the apparent complexity in our network by allowing the host and network to Do the Right Thing™ automatically, unconstrained by an otherwise static route map.

Linux policy routing has become an essential addition to host configuration in the University of Colorado Boulder "Science Network." It's so useful, in fact, that I'm surprised a basic routing policy isn't provided by default for multi-homed servers.

## The problem with traditional routing

The simplest Linux host routing scenario is a system with a single network interface.

# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:50:56:88:56:1f brd ff:ff:ff:ff:ff:ff
    inet 10.225.160.38/24 brd 10.225.160.255 scope global dynamic ens192
       valid_lft 60184sec preferred_lft 60184sec

Such a typically-configured network with a single uplink has a single default route in addition to its link-local route.

# ip route list
default via 10.225.160.1 dev ens192
10.225.160.0/24 dev ens192  proto kernel  scope link  src 10.225.160.38

Traffic to hosts on 10.225.160.0/24 is delivered directly, while traffic to any other network is forwarded to 10.225.160.1.

A dual-homed host adds a second network interface and a second link-local route; but the original default route remains. (Figure 1.)

# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:50:56:88:56:1f brd ff:ff:ff:ff:ff:ff
    inet 10.225.160.38/24 brd 10.225.160.255 scope global dynamic ens192
       valid_lft 86174sec preferred_lft 86174sec
3: ens224: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:50:56:88:44:18 brd ff:ff:ff:ff:ff:ff
    inet 10.225.176.38/24 brd 10.225.176.255 scope global dynamic ens224
       valid_lft 69193sec preferred_lft 69193sec

# ip route list
default via 10.225.160.1 dev ens192
10.225.160.0/24 dev ens192  proto kernel  scope link  src 10.225.160.38
10.225.176.0/24 dev ens224  proto kernel  scope link  src 10.225.176.38

The new link-local route provides access to hosts on 10.225.176.0/24, and is sufficient for a private network connecting a small cluster of hosts. In fact, this is the configuration that we started with in our Research Computing environment: .160.0/24 is a low-performance "management" network, while .176.0/24 is a high-performance "data" network.

figure1.svg

Figure 1 - A simple dual-homed server with a traditional default route

In a more complex network, however, link-local routes quickly become insufficient. In the CU Science Network, for example, each datacenter is considered a discrete network zone with its own set of "management" and "data" networks. For hosts in different network zones to communicate, a static route must be defined in each direction to direct performance-sensitive traffic across the high-performance network route. (Figure 2.)

server # ip route add 10.225.144.0/24 via 10.225.176.1
client # ip route add 10.225.176.0/24 via 10.225.144.0

Though managing these static routes can be tedious, they do sufficiently define connectivity between the relevant network pairs: "data" interfaces route traffic to each other via high-performance networks, while "management" interfaces route traffic to each other via low-performance networks. Other networks (e.g., the Internet) can only communicate with the hosts on their default routes; but this limitation may be acceptable for some scenarios.

figure2.svg

Figure 2 - A server and a client, with static routes between their data interfaces

Even this approach is insufficient, however, to allow traffic between "management" and "data" interfaces. This is particularly problematic when a client host is not equipped with a symmetric set of network interfaces. (Figure 3.) Such a client may only have a "management" interface, but should still communicate with the server's high-performance interface for certain types of traffic. (For example, a dual-homed NFS server should direct all NFS traffic over its high-performance "data" network, even when being accessed by a client that itself only has a low-performance "management" interface.) By default, the Linux rp_filter blocks this traffic, as the server's response to the client targets a different route than the incomming request; but even if rp_filter is disabled, this asymmetric route limits the server's aggregate network bandwidth to that of its lower-performing interface.

The server's default route could be moved to the "data" interface--in some scenarios, this may even be preferable--but this only displaces the issue: clients may then be unable to communicate with the server on its "management" interface, which may be preferred for certain types of traffic. (In Research Computing, for example, we prefer that administrative access and monitoring not compete with IPC and file system traffic.)

figure3.svg

Figure 3 - In a traditional routing configuration, the server would try to respond to the client via its default route, even if the request arrived on its data interface

Routing policy rules

Traditional IP routing systems route incoming packets based solely on the the intended destination; but the Linux iproute2 stack supports route selection based on additional packet metadata, including the packet source. Multiple discrete routing tables, similar to the virtual routing and forwarding (VRF) support found in dedicated routing appliances, define contextual routes, and a routing policy selects the appropriate routing table dynamically based on a list of rules.

In this example there are three different routing contexts to consider. The first of these--the "main" routing table--defines the routes to use when the server initiates communication.

server # ip route list table main
10.225.144.0/24 via 10.225.176.1 dev ens224
default via 10.225.160.1 dev ens192
10.225.160.0/24 dev ens192  proto kernel  scope link  src 10.225.160.38
10.225.176.0/24 dev ens224  proto kernel  scope link  src 10.225.176.38

A separate routing table defines routes to use when responding to traffic on the "management" interface. Since this table is concerned only with the default route's interface in isolation, it simply reiterates the default route.

server # ip route add default via 10.225.160.1 table 1
server # ip route list table 1
default via 10.225.160.1 dev ens192

Similarly, the last routing table defines routes to use when responding to traffic on the "data" interface. This table defines a different default route: all such traffic should route via the "data" interface.

server # ip route add default via 10.225.176.1 table 2
server # ip route list table 2
default via 10.225.176.1 dev ens224

With these three routing tables defined, the last step is to define routing policy to select the correct routing table based on the packet to be routed. Responses from the "management" address should use table 1, and responses from the "data" address should use table 2. All other traffic, including server-initiated traffic that has no outbound address assigned yet, uses the "main" table automatically.

server # ip rule add from 10.225.160.38 table 1
server # ip rule add from 10.225.176.38 table 2
server # ip rule list
0:  from all lookup local
32764:  from 10.225.176.38 lookup 2
32765:  from 10.225.160.38 lookup 1
32766:  from all lookup main
32767:  from all lookup default

With this routing policy in place, a single-homed client (or, in fact, any client on the network) may communicate with both the server's "data" and "management" interfaces independently and successfully, and the bidirectional traffic routes consistently via the appropriate network. (Figure 4.)

figure4.svg

Figure 4 - Routing policy allows the server to respond using its data interface for any request that arrived on its data interface, even if it has a different default route

Persisting the configuration

This custom routing policy can be persisted in the Red Hat "ifcfg" network configuration system by creating interface-specific route- and rule- files.

# cat /etc/sysconfig/network-scripts/route-ens192
default via 10.225.160.1 dev ens192
default via 10.225.160.1 dev ens192 table mgt

# cat /etc/sysconfig/network-scripts/route-ens224
10.225.144.0/24 via 10.225.176.1 dev ens224
default via 10.225.176.1 dev ens224 table data

# cat /etc/sysconfig/network-scripts/rule-ens192
from 10.225.160.38 table mgt

# cat /etc/sysconfig/network-scripts/rule-ens224
from 10.225.176.38 table data

The symbolic names mgt and data used in these examples are translated to routing table numbers as defined in the /etc/iproute2/rt_tables file.

# echo "1 mgt" >>/etc/iproute2/rt_tables
# echo "2 data" >>/etc/iproute2/rt_tables

Once the configuration is in place, activate it by restarting the network service. (e.g., systemctl restart network) You may also be able to achieve the same effect using ifdown and ifup on individual interfaces.

Red Hat's support for routing rule configuration has a confusing regression that merits specific mention. Red Hat (and its derivatives) has historically used a "network" initscript and subscripts to configure and manage network interfaces, and these scripts support the aforementioned rule- configuration files. Red Hat Enterprise Linux 6 introduced NetworkManager, a persistent daemon with additional functionality; however, NetworkManager did not support rule- files until version 1.0, released as part of RHEL 7.1. If you're currently using NetworkManager, but wish to define routing policy in rule- files, you'll need to either disable NetworkManager entirely or exempt specific interfaces from NetworkManager by specifying NM_CONTROLLED=no in the relevant ifcfg- files.

In a Debian-based distribution, these routes and rules can be persisted using post-up directives in /etc/network/interfaces.

Further improvements

We're still in the process of deploying this policy-based routing configuration in our Research Computing environment; and, as we do, we discover more cases where previously complex network requirements and special-cases are abstracted away by this relatively uniform configuration. We're simultaneously evaluating other potential changes, including the possibility of running a dynamic routing protocol (such as OSPF) on our multi-homed hosts, or of configuring every network connection as a simultaneous default route for fail-over. In any case, this experience has encouraged us to take a second look at our network configuration to re-evaluate what we had previously thought were inherent limitations of the stack itself.

User-selectable authentication methods using pam_authtok

Research Computing is in the process of migrating and expanding our authentication system to support additional authentication methods. Historically we've supported VASCO IDENTIKEY time-based one-time-password and pin to provide two-factor authentication.

$ ssh joan5896@login.rc.colorado.edu
joan5896@login.rc.colorado.edu's password: <pin><otp>

[joan5896@login04 ~]$

But the VASCO tokens are expensive, get lost or left at home, have a battery that runs out, and have an internal clock that sometimes falls out-of-sync with the rest of the authentication system. For these and other reasons we're provisioning most new account with Duo, which provides iOS and Android apps but also supports SMS and voice calls.

Unlike VASCO, Duo is only a single authentication factor; so we've also added support for upstream CU-Boulder campus password authentication to be used in tandem.

This means that we have to support both authentication mechanisms--VASCO and password+Duo--simultaneously. A naïve implementation might just stack these methods together.

auth sufficient pam_radius_auth.so try_first_pass # VASCO authenticates over RADIUS
auth requisite  pam_krb5.so try_first_pass # CU-Boulder campus password
auth required   pam_duo.so

This generally works: VASCO authentication is attempted first over RADIUS. If that fails, authentication is attempted against the campus password and, if that succeeds, against Duo.

Unfortunately, this generates spurious authentication failures in VASCO when using Duo to authenticate: the VASCO method fails, then Duo authentication is attempted. Users who have both VASCO and Duo accounts (e.g., all administrators) may generate enough failures to trigger the break-in mitigation security system, and the VASCO account may be disabled. This same issue exists if we reverse the authentication order to try Duo first, then VASCO: VASCO users might then cause their campus passwords to become disabled.

In stead, we need to enable users to explicitly specify which authentication method they're using.

Separate sssd domains

Our first attempt to provide explicit access to different authentication methods was to provide multiple redundant sssd domains.

[domain/rc]
description = Research Computing
proxy_pam_target = curc-twofactor-vasco


[domain/duo]
description = Research Computing (identikey+duo authentication)
enumerate = false
proxy_pam_target = curc-twofactor-duo

This allows users to log in normally using VASCO, while password+Duo authentication can be requested explicitly by logging in as ${user}@duo.

$ ssh -l joan5896@duo login.rc.colorado.edu

This works well enough for the common case of shell access over SSH: login is permitted and, since both the default rc domain and the duo alias domain are both backed by the same LDAP directory, NSS sees no important difference once a user is logged in using either method.

This works because POSIX systems store the uid number returned by PAM and NSS, and generally resolve the uid number to the username on-demand. Not all systems work this way, however. For example, when we attempted to use this authentication mechanism to authenticate to our prototype JupyterHub (web) service, jobs dispatched to Slurm retained the ${user}@duo username format. Slurm also uses usernames internally, and the ${user}@duo username is not populated within Slurm: only the base ${user} username.

Expecting that we would continue to find more unexpected side-effects of this implementation, we started to look for an alternative mechanism that doesn't modify the specified username.

pam_authtok

In general, a user provides two pieces of information during authentication: a username (which we've already determined we shouldn't modify) and an authentication token or password. We should be able to detect, for example, a prefix to that authentication token to determine what authentication method to use.

$ ssh joan5896@login.rc.colorado.edu
joan5896@login.rc.colorado.edu's password: duo:<password>

[joan5896@login04 ~]$

But we found no such pam module that would allow us to manipulate the authentication token... so we wrote one.

auth [success=1 default=ignore] pam_authtok.so prefix=duo: strip prompt=password:

auth [success=done new_authtok_reqd=done default=die] pam_radius_auth.so try_first_pass

auth requisite pam_krb5.so try_first_pass
auth [success=done new_authtok_reqd=done default=die] pam_duo.so

Now our PAM stack authenticates against VASCO by default; but, if the user provides a password with a duo: prefix, authentication skips VASCO and authenticates the supplied password, followed by Duo push. Our actual production PAM stack is a bit more complicated, supporting a redundant vasco: prefix as well, for forward-compatibility should we change the default authentication mechanism in the future. We can also extend this mechanism to add arbitrary additional authentication mechanisms in the future.