In which Cisco SACK’d iptables

At some unknown point, ssh to at least some of our front-end nodes started spuriously failing when faced with a sufficiently large burst of traffic (e.g., cating a large file to stdout). I was the only person complaining about it, though–no other team members, no users–so I blamed it on something specific to my environment and prioritized it as an annoyance rather than as a real system problem.

Which is to say: I ignored the problem, hoping it would go away on its own.

Some time later I needed to access our system from elsewhere on the Internet, and from a particularly poor connection as well. Suddenly I couldn’t even scp reliably. I returned to the office, determined to pinpoint a root cause. I did my best to take my vague impression of “a network problem” and turn it into a repeatable test case, dding a bunch of /dev/random into a file and scping it to my local box.

$ scp 10.129.4.32:data Downloads
data 0% 1792KB 1.0MB/s - stalled -^CKilled by signal 2.
$ scp 10.129.4.32:data Downloads
data 0% 256KB 0.0KB/s - stalled -^CKilled by signal 2.

Awesome: it fails from my desktop. I duplicated the failure on my netbook, which I then physically carried into the datacenter. I plugged directly into the DMZ and repeated the test with the same result; but, if I moved my test machine to the INSIDE network, the problem disappeared.

Wonderful! The problem was clearly with the Cisco border switch/firewall, because (a) the problem went away when I bypassed it, and (b) the Cisco switch is Somebody Else’s Problem.

So I told Somebody Else that his Cisco was breaking our ssh. He checked his logs and claimed he saw nothing obviously wrong (e.g., no dropped packets, no warnings). He didn’t just punt the problem back at me, though: he came to my desk and, together, we trawled through some tcpdump.

15:48:37.752160 IP 10.68.58.2.53760 > 10.129.4.32.ssh: Flags [.], ack 36822, win 65535, options [nop,nop,TS val 511936353 ecr 1751514520], length 0
15:48:37.752169 IP 10.129.4.32.ssh > 10.68.58.2.53760: Flags [.], seq 47766:55974, ack 3670, win 601, options [nop,nop,TS val 1751514521 ecr 511936353], length 8208
15:48:37.752215 IP 10.68.58.2.53760 > 10.129.4.32.ssh: Flags [.], ack 36822, win 65535, options [nop,nop,TS val 511936353 ecr 1751514520,nop,nop,sack 1 {491353276:491354644}], length 0
15:48:37.752240 IP 10.129.4.32 > 10.68.58.2: ICMP host 10.129.4.32 unreachable - admin prohibited, length 72

The sender, 10.129.4.32, was sending an ICMP error back to the receiver, 10.68.58.2. Niggling memory of this “admin prohibited” message reminded me about our iptables configuration.

-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p tcp -m tcp --dport 22 -m state --state NEW -m comment --comment "ssh" -j ACCEPT
-A INPUT -j REJECT --reject-with icmp-host-prohibited

iptables rejects any packet that isn’t explicitly allowed or part of an existing connection with icmp-host-prohibited, exactly as we were seeing. In particular, it seemed to be rejecting any packet that contained SACK fields.

When packets are dropped (or arrive out-of-order) in a modern TCP connection, the receiver sends a SACK message, “ACK X, SACK Y:Z”. The “selective” acknowledgement indicates that segments between X and Y are missing, but allows later segments to be acknowledged out-of-order, avoiding unnecessary retransmission of already-received segments. For some reason, such segments were not being identified by iptables as part of the ESTABLISHED connection.

A Red Hat bugzilla indicated that you should solve this problem by disabling SACK. That seems pretty stupid to me, though, so I went looking around in the iptables documentation in stead. A netfilter patch seemed to indicate that iptables connection tracking should support SACK, so I contacted the author–a wonderful gentleman named Jozsef Kadlecsik–who confirmed that SACK should be totally fine passing through iptables in our kernel version. In stead, he indicated that problems like this usually implicate a misbehaving intermediate firewall appliance.

And so the cycle of blame was back on the Cisco… but why?

Let’s take a look at that tcpdump again.

15:48:37.752215 IP 10.68.58.2.53760 > 10.129.4.32.ssh: Flags [.], ack 36822, win 65535, options [nop,nop,TS val 511936353 ecr 1751514520,nop,nop,sack 1 {491353276:491354644}], length 0
15:48:37.752240 IP 10.129.4.32 > 10.68.58.2: ICMP host 10.129.4.32 unreachable - admin prohibited, length 72

ACK 36822, SACK 491353276:491354644… so segments 36823:491353276 are missing? That’s quite a jump. Surely 491316453 segments didn’t get sent in a few nanoseconds.

A Cisco support document holds the answer; or, at least, the beginning of one. By default, the Cisco firewall performs “TCP Sequence Number Randomization” on all TCP connections. That is to say, it modifies TCP sequence ids on incoming packets, and restores them to the original range on outgoing packets. So while the system receiving the SACK sees this:

15:48:37.752215 IP 10.68.58.2.53760 > 10.129.4.32.ssh: Flags [.], ack 36822, win 65535, options [nop,nop,TS val 511936353 ecr 1751514520,nop,nop,sack 1 {491353276:491354644}], length 0

…the system sending the sack/requesting the file sees this:

15:49:42.638349 IP 10.68.58.2.53760 > 10.129.4.64.ssh: . ack 36822 win 65535 <nop,nop,timestamp 511936353 1751514520,nop,nop,sack 1 {38190:39558}>

The receiver says “I have 36822 and 38190:39558, but missed 36823:38189.” The sender sees “I have 36822 and 491353276:491354644, but missed 36823:491353275.” 491353276 is larger than the largest sequence id sent by the provider, so iptables categorizes the packet as INVALID.

But wait… if the Cisco is rewriting sequence ids, why is it only the SACK fields that are different between the sender and the receiver? If Cisco randomizes the sequence ids of packets that pass through the firewall, surely the regular SYN and ACK id fields should be different, too.

It’s a trick question: even though tcpdump reports 36822 on both ends of the connection, the actual sequence ids on the wire are different. By default, tcpdump normalizes sequence ids, starting at 1 for each new connection.

-S Print absolute, rather than relative, TCP sequence numbers.

The Cisco doesn’t rewrite SACK fields to coincide with its rewritten sequence ids. The raw values are passed on, conflicting with the ACK value and corrupting the packet. In normal situations (that is, without iptables throwing out invalid packets) this only serves to break SACK; but the provider still gets the ACK and responds with normal full retransmission. It’s a performance degregation, but not a catastrophic failure. However, because iptables is rejecting the INVALID packet entirely, the TCP stack doesn’t even get a chance to try a full retransmission.

Because TCP sequence id forgery isn’t a problem under modern TCP stacks, we’ve taken the advice of the Cisco article and disabled the randomization feature in the firewall altogether.

class-map TCP
match port tcp range 1 65535
policy-map global_policy
class TCP
set connection random-sequence-number disable
service-policy global_policy global

With that, the problem has finally disappeared.

$ scp 10.129.4.32:data Downloads
data 100% 256MB 16.0MB/s 00:16
$ scp 10.129.4.32:data Downloads
data 100% 256MB 15.1MB/s 00:17
$ scp 10.129.4.32:data Downloads
data 100% 256MB 17.1MB/s 00:15