In which Cisco SACK’d iptables
At some unknown point, ssh
to at least some of our front-end nodes
started spuriously failing when faced with a sufficiently large burst of
traffic (e.g., cat
ing a large file to stdout
). I was the only
person complaining about it, though–no other team members, no users–so I
blamed it on something specific to my environment and prioritized it as
an annoyance rather than as a real system problem.
Which is to say: I ignored the problem, hoping it would go away on its own.
Some time later I needed to access our system from elsewhere on the
Internet, and from a particularly poor connection as well. Suddenly I
couldn’t even scp
reliably. I returned to the office, determined to
pinpoint a root cause. I did my best to take my vague impression of “a
network problem” and turn it into a repeatable test case, dd
ing a
bunch of /dev/random
into a file and scp
ing it to my local
box.
$ scp 10.129.4.32:data Downloads data 0% 1792KB 1.0MB/s - stalled -^CKilled by signal 2. $ scp 10.129.4.32:data Downloads data 0% 256KB 0.0KB/s - stalled -^CKilled by signal 2.
Awesome: it fails from my desktop. I duplicated the failure on my netbook, which I then physically carried into the datacenter. I plugged directly into the DMZ and repeated the test with the same result; but, if I moved my test machine to the INSIDE network, the problem disappeared.
Wonderful! The problem was clearly with the Cisco border switch/firewall, because (a) the problem went away when I bypassed it, and (b) the Cisco switch is Somebody Else’s Problem.
So I told Somebody Else that his Cisco was breaking our ssh
. He
checked his logs and claimed he saw nothing obviously wrong (e.g., no
dropped packets, no warnings). He didn’t just punt the problem back at
me, though: he came to my desk and, together, we trawled through some
tcpdump
.
15:48:37.752160 IP 10.68.58.2.53760 > 10.129.4.32.ssh: Flags [.], ack 36822, win 65535, options [nop,nop,TS val 511936353 ecr 1751514520], length 0 15:48:37.752169 IP 10.129.4.32.ssh > 10.68.58.2.53760: Flags [.], seq 47766:55974, ack 3670, win 601, options [nop,nop,TS val 1751514521 ecr 511936353], length 8208 15:48:37.752215 IP 10.68.58.2.53760 > 10.129.4.32.ssh: Flags [.], ack 36822, win 65535, options [nop,nop,TS val 511936353 ecr 1751514520,nop,nop,sack 1 {491353276:491354644}], length 0 15:48:37.752240 IP 10.129.4.32 > 10.68.58.2: ICMP host 10.129.4.32 unreachable - admin prohibited, length 72
The sender, 10.129.4.32
, was sending an ICMP error back to the
receiver, 10.68.58.2
. Niggling memory of this “admin prohibited”
message reminded me about our iptables
configuration.
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT -A INPUT -p tcp -m tcp --dport 22 -m state --state NEW -m comment --comment "ssh" -j ACCEPT -A INPUT -j REJECT --reject-with icmp-host-prohibited
iptables
rejects any packet that isn’t explicitly allowed or part of
an existing connection with icmp-host-prohibited
, exactly as we were
seeing. In particular, it seemed to be rejecting any packet that
contained SACK
fields.
When packets are dropped (or arrive out-of-order) in a modern TCP
connection, the receiver sends a SACK message, “ACK X, SACK Y:Z”. The
“selective” acknowledgement indicates that segments between X and Y are
missing, but allows later segments to be acknowledged out-of-order,
avoiding unnecessary retransmission of already-received segments. For
some reason, such segments were not being identified by iptables
as
part of the ESTABLISHED
connection.
A Red Hat bugzilla
indicated that you should solve this problem by disabling SACK. That
seems pretty stupid to me, though, so I went looking around in the
iptables
documentation in
stead. A netfilter patch
seemed to indicate that iptables
connection tracking should
support SACK, so I contacted the author–a wonderful gentleman named
Jozsef Kadlecsik–who confirmed that SACK should be totally fine
passing through iptables
in our kernel version. In stead, he
indicated that problems like this usually implicate a misbehaving
intermediate firewall appliance.
And so the cycle of blame was back on the Cisco… but why?
Let’s take a look at that tcpdump
again.
15:48:37.752215 IP 10.68.58.2.53760 > 10.129.4.32.ssh: Flags [.], ack 36822, win 65535, options [nop,nop,TS val 511936353 ecr 1751514520,nop,nop,sack 1 {491353276:491354644}], length 0 15:48:37.752240 IP 10.129.4.32 > 10.68.58.2: ICMP host 10.129.4.32 unreachable - admin prohibited, length 72
ACK 36822, SACK 491353276:491354644… so segments 36823:491353276 are missing? That’s quite a jump. Surely 491316453 segments didn’t get sent in a few nanoseconds.
A Cisco support document holds the answer; or, at least, the beginning of one. By default, the Cisco firewall performs “TCP Sequence Number Randomization” on all TCP connections. That is to say, it modifies TCP sequence ids on incoming packets, and restores them to the original range on outgoing packets. So while the system receiving the SACK sees this:
15:48:37.752215 IP 10.68.58.2.53760 > 10.129.4.32.ssh: Flags [.], ack 36822, win 65535, options [nop,nop,TS val 511936353 ecr 1751514520,nop,nop,sack 1 {491353276:491354644}], length 0
…the system sending the sack/requesting the file sees this:
15:49:42.638349 IP 10.68.58.2.53760 > 10.129.4.64.ssh: . ack 36822 win 65535 <nop,nop,timestamp 511936353 1751514520,nop,nop,sack 1 {38190:39558}>
The receiver says “I have 36822 and 38190:39558, but missed
36823:38189.” The sender sees “I have 36822 and 491353276:491354644,
but missed 36823:491353275.” 491353276 is larger than the largest
sequence id sent by the provider, so iptables
categorizes the packet
as INVALID
.
But wait… if the Cisco is rewriting sequence ids, why is it only the
SACK
fields that are different between the sender and the receiver?
If Cisco randomizes the sequence ids of packets that pass through the
firewall, surely the regular SYN
and ACK
id fields should be
different, too.
It’s a trick question: even though tcpdump
reports 36822 on both
ends of the connection, the actual sequence ids on the wire are
different. By default, tcpdump
normalizes sequence ids, starting at
1 for each new connection.
-S Print absolute, rather than relative, TCP sequence numbers.
The Cisco doesn’t rewrite SACK
fields to coincide with its rewritten
sequence ids. The raw values are passed on, conflicting with the
ACK
value and corrupting the packet. In normal situations (that is,
without iptables
throwing out invalid packets) this only serves to
break SACK; but the provider still gets the ACK and responds with normal
full retransmission. It’s a performance degregation, but not a
catastrophic failure. However, because iptables
is rejecting the
INVALID
packet entirely, the TCP stack doesn’t even get a chance to
try a full retransmission.
Because TCP sequence id forgery isn’t a problem under modern TCP stacks, we’ve taken the advice of the Cisco article and disabled the randomization feature in the firewall altogether.
class-map TCP match port tcp range 1 65535 policy-map global_policy class TCP set connection random-sequence-number disable service-policy global_policy global
With that, the problem has finally disappeared.
$ scp 10.129.4.32:data Downloads data 100% 256MB 16.0MB/s 00:16 $ scp 10.129.4.32:data Downloads data 100% 256MB 15.1MB/s 00:17 $ scp 10.129.4.32:data Downloads data 100% 256MB 17.1MB/s 00:15