Linux policy-based routing

How could Linux policy routing be so poorly documented? It’s so useful, so essential in a multi-homed environment… I’d almost advocate for its inclusion as default behavior.

What is this, you ask? To understand, we have to start with what Linux does by default in a multi-homed environment. So let’s look at one.

$ ip addr
[...]
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 78:2b:cb:66:75:c0 brd ff:ff:ff:ff:ff:ff
    inet 10.225.128.80/24 brd 10.225.128.255 scope global eth2
    inet6 fe80::7a2b:cbff:fe66:75c0/64 scope link
       valid_lft forever preferred_lft forever
[...]
6: eth5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP qlen 1000
    link/ether e4:1d:2d:14:93:60 brd ff:ff:ff:ff:ff:ff
    inet 10.225.144.80/24 brd 10.225.144.255 scope global eth5
    inet6 fe80::e61d:2dff:fe14:9360/64 scope link
       valid_lft forever preferred_lft forever

So we have two interfaces, eth2 and eth5. They’re on separate subnets, 10.225.128.0/24 and 10.225.144.0/24 respectively. In our environment, we refer to these as “spsc-mgt” and “spsc-data.” The practical circumstance is that one of these networks is faster than the other, and we would like bulk data transfer to use the faster “spsc-data” network.

If the client system also has an “spsc-data” network, everything is fine. The client addresses the system using its data address, and the link-local route prefers the data network.

$ ip route list 10.225.144.0/24
10.225.144.0/24 dev eth5  proto kernel  scope link  src 10.225.144.80

Our network environment covers a number of networks, however. So let’s say our client lives in another data network–“comp-data.” Infrastructure routing directs the traffic to the -data interface of our server correctly, but the default route on the server prefers the -mgt interface.

$ ip route list | grep ^default
default via 10.225.128.1 dev eth2

For this simple case we have two options. We can either change our default route to prefer the -data interface, or we can enumerate intended -data client networks with static routes using the data interface. Since changing the default route simply leaves us in the same situation for the -mgt network, let’s define some static routes.

$ ip route add 10.225.64.0/20 via 10.225.144.1 dev eth5
$ ip route add 10.225.176.0/24 via 10.225.144.1 dev eth5

So long as we can enumerate the networks that should always use the -data interface of our server to communicate, this basically works. But what if we want to support clients that don’t themselves have separate -mgt and -data networks? What if we have a single client–perhaps with only a -mgt network connection–that should be able to communicate individually with the server’s -mgt interface and its -data interface. In the most pathological case, what if we have a host that is only connected to the spsc-mgt (10.225.128.0/24) interface, but we want that client to be able to communicate with the server’s -data interface. In this case, the link-local route will always prefer the -mgt network for the return path.

Policy-based routing

The best case would be to have the server select an outbound route based not on a static configuration, but in response to the incoming path of the traffic. This is the feature enabled by policy-based routing.

Linux policy routing allows us to define distinct and isolated routing tables, and then select the appropriate routing table based on the traffic context. In this situation, we have three different routing contexts to consider. The first of these are the routes to use when the server initiates communication.

$ ip route list table main
10.225.128.0/24 dev eth2  proto kernel  scope link  src 10.225.128.80
10.225.144.0/24 dev eth5  proto kernel  scope link  src 10.225.144.80
10.225.64.0/20 via 10.225.144.1 dev eth5
10.225.176.0/24 via 10.225.144.1 dev eth5
default via 10.225.128.1 dev eth2

A separate routing table defines routes to use when responding to traffic from the -mgt interface.

$ ip route list table 1
default via 10.225.128.1 dev eth2

The last routing table defines routes to use when responding to traffic from the -data interface.

$ ip route list table 2
default via 10.225.144.1 dev eth5

With these separate routing tables defined, the last step is to define the rules that select the correct routing table.

$ ip rule list
0:  from all lookup local
32762:  from 10.225.144.80 lookup 2
32763:  from all iif eth5 lookup 2
32764:  from 10.225.128.80 lookup 1
32765:  from all iif eth2 lookup 1
32766:  from all lookup main
32767:  from all lookup default

Despite a lack of documentation, all of these rules may be codified in Red Hat “sysconfig”-style “network-scripts” using interface-specific route- and rule- files.

$ cat /etc/sysconfig/network-scripts/route-eth2
default via 10.225.128.1 dev eth2
default via 10.225.128.1 dev eth2 table 1

$ cat /etc/sysconfig/network-scripts/route-eth5
10.225.64.0/20 via 10.225.144.1 dev eth5
10.225.176.0/24 via 10.225.144.1 dev eth5
default via 10.225.144.1 dev eth5 table 2

$ cat /etc/sysconfig/network-scripts/rule-eth2
iif eth2 table 1
from 10.225.128.80 table 1

$ cat /etc/sysconfig/network-scripts/rule-eth5
iif eth5 table 2
from 10.225.144.80 table 2

Changes to the RPDB made with these commands do not become active immediately. It is assumed that after a script finishes a batch of updates, it flushes the routing cache with ip route flush cache.

References