----- -----
| |-- 1500 MTU ->| |
| A | | B |
| |<-- 500 MTU --| |
----- -----
Here’s the problem: when transferring large volumes of data from A to B over a TCP connection, the transfer rate is much slower than what is expected given a 100Mbps link speed from A to B. The question is why, and what can be done about it? Read on for the discussion, and a proposed and tested solution.
Maximum Segment Size
It was pointed out to me that the Maximum Segment Size (MSS) for the connection would be 460 bytes, based on the smaller of the MTUs minus IP and TCP headers (20 bytes each). When a new TCP connection is established, each endpoint determines the largest “segment” (transport-layer unit of transmitted data) it will send based on the largest underlying packet it thinks can be sent without causing IP fragmentation. Commonly, this is done by inspecting the link-layer MTU of the outbound interface and subtracting the size of the IP and TCP headers. The two hosts then swap their respective MSS values, and the smaller value gets picked by both sides as the MSS for the connection (this is called “MSS negotiation,” which is a bit of a misnomer since each side decides this independently - hint, this fact will be important later!).
So let's apply the MSS determination process to the above network scenario. Computer A sees that its outbound interface has a link-layer MTU of 1500 bytes and subtracts 40 bytes of headers to get an MSS of 1460 bytes. Likewise, computer B sees a link-layer MTU of 500 bytes and gets an MSS of 460 bytes. A and B swap these values during the TCP handshake, and both independently decide that 460 bytes, the smaller of the two values, should be the connection MSS.
We can observe these values being exchanged by simulating asymmetric MTUs on a single physical link between two computers running Linux. Suppose A has IP address 192.168.1.6 on interface eth1 and B has IP address 192.168.1.5 on interface eth0:
----- -----
| A |(eth1 1.6) <--> (1.5 eth0)| B |
----- -----
First, let’s force the MTU of eth0 down to 500 bytes to simulate our B-A link:
<terminal 1 @ B>
# ifconfig eth0 mtu 500
Note that this command only sets the MTU for outbound link-layer frames; inbound frames up to 1500 payload bytes are still accepted. This is key for our simulation to work.
Now let’s start a TCP connection between A and B and see what happens. We’ll pick A to be the TCP server and B to be the client. At B, we’ll start a packet capture using tcpdump, and then set up both TCP endpoints using netcat (nc). At A (the server), the -l flag tells netcat to listen for incoming connections and the -p 1234 specifies that it should use port 1234:
<terminal 1 @ B>
# tcpdump -i eth0 tcp port 1234
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
...
<terminal 1 @ A>
$ nc -l -p 1234
<terminal 2 @ B>
$ nc 192.168.1.6 1234
<terminal 1 @ B>
...
10:10:39.507934 IP 192.168.1.5.35687 > 192.168.1.6.1234: Flags [S], seq 4201136530, win 4600, options [mss 460,sackOK,TS val 561664375 ecr 0,nop,wscale 4], length 0
10:10:39.594333 IP 192.168.1.6.1234 > 192.168.1.5.35687: Flags [S.], seq 2050732648, ack 4201136531, win 65535, options [mss 1460,nop,wscale 3,nop,nop,TS val 850603265 ecr 561664375,sackOK,eol], length 0
10:10:39.594385 IP 192.168.1.5.35687 > 192.168.1.6.1234: Flags [.], ack 1, win 288, options [nop,nop,TS val 561664397 ecr 850603265], length 0
10:10:39.595374 IP 192.168.1.6.1234 > 192.168.1.5.35687: Flags [.], ack 1, win 65535, options [nop,nop,TS val 850603265 ecr 561664397], length 0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
...
<terminal 1 @ A>
$ nc -l -p 1234
<terminal 2 @ B>
$ nc 192.168.1.6 1234
<terminal 1 @ B>
...
10:10:39.507934 IP 192.168.1.5.35687 > 192.168.1.6.1234: Flags [S], seq 4201136530, win 4600, options [mss 460,sackOK,TS val 561664375 ecr 0,nop,wscale 4], length 0
10:10:39.594333 IP 192.168.1.6.1234 > 192.168.1.5.35687: Flags [S.], seq 2050732648, ack 4201136531, win 65535, options [mss 1460,nop,wscale 3,nop,nop,TS val 850603265 ecr 561664375,sackOK,eol], length 0
10:10:39.594385 IP 192.168.1.5.35687 > 192.168.1.6.1234: Flags [.], ack 1, win 288, options [nop,nop,TS val 561664397 ecr 850603265], length 0
10:10:39.595374 IP 192.168.1.6.1234 > 192.168.1.5.35687: Flags [.], ack 1, win 65535, options [nop,nop,TS val 850603265 ecr 561664397], length 0
...
Observe that, during the TCP handshake, the SYN segment from 192.168.1.5 (B) to 192.168.1.6 (A) contains an advertised MSS of 460 bytes, followed by the SYN-ACK segment from A to B with an MSS of 1460 bytes. If we were to start sending bulk data from A to B or from B to A, we would see that segment payloads are no greater than 460 bytes, regardless of direction.
For data sent from B to A, this is a good thing. If B created TCP segments larger than 460 bytes, IP would be forced to fragment them, resulting in unnecessary overhead. However, this is wasteful for data sent from A to B, since it now takes roughly 3 segments to send the data that could be sent in one maximum-sized segment for the A-B link MTU. Not only does this result in overhead for the extra IP and TCP headers for each segment, it negatively impacts TCP's behavior. As a TCP transfer gets going, it initially sends data very slowly so as not to congest the link. As data is successfully transferred and acknowledged, it gradually increases its transfer rate. Whenever a segment is lost (TCP assumes all loss is due to congestion), it backs off and slowly builds its rate up again. The problem is that the rate at which this increases is proportional to the MSS, so a smaller MSS means a slower ramp-up in transfer rate. This turns out to be a major limiter of TCP performance. Again, this is usually considered an acceptable tradeoff to prevent network congestion; however, in unusual cases like the one above it can do more harm than good.
The goal then is to somehow make A think the connection MSS is 1460 bytes while B still thinks the MSS is 460 bytes, so that A-B data transfer is faster while B-A data transfer doesn’t incur fragmentation.
Forcing a higher MSS
A couple ideas were tossed around for solving this. Both computers were assumed to run Linux or similar. One idea was to use setsockopt() with option TCP_MAXSEG to force the MSS at A. A drawback of this approach is that every program would have to be modified and recompiled (or shadow setsockopt() using LD_PRELOAD). This is messy, but doable. The larger problem turned out to be that TCP_MAXSEG only sets an upper bound on MSS; it can still be set to a smaller value based on what the remote endpoint advertises during the TCP handshake. It might be possible to hack this behavior, but it would involve tweaking the TCP protocol implementation in ways that could have unintended side effects.
Another idea was to use the built-in Linux packet filter, controlled by iptables, which has a module called TCPMSS that can modify the MSS contained in a TCP segment using option --set-mss. However, this suffers from the same problem: the designers built in a safeguard such that the MSS in a segment cannot be raised; it can only be lowered. The module is normally used to artificially limit the MSS in certain misbehaving networks where an intermediate device causes the negotiation to choose an MSS that is too large. We're looking to solve the opposite problem!
After some time spent searching forums, it didn’t seem that anyone had proposed another solution. Well… modifying a module used by iptables seemed more straightforward than modifying a TCP implementation, so it was time to dig into some kernel code.
On my Ubuntu system running a 3.2.0-48-generic-pae kernel, I downloaded the kernel source:
# apt-get install linux-source-3.2.0
This places the kernel source as a tarball in /usr/src/. Then:
# cp /usr/src
# tar jxf linux-source-3.2.0.tar.bz2
to unpack the kernel source in /usr/src/linux-source-3.2.0/. The Linux subsystem that underlies the iptables command is called “netfilter.” Its modules live in the kernel subdirectory net/netfilter/ and the file of interest is net/netfilter/xt_TCPMSS.c. Reading through the code, we find in function tcpmss_mangle_packet() starting at line 93:
...
/* Never increase MSS, even when setting it, as
* doing so results in problems for hosts that rely
* on MSS being set correctly.
*/
if (oldmss <= newmss)
return 0;
* doing so results in problems for hosts that rely
* on MSS being set correctly.
*/
if (oldmss <= newmss)
return 0;
...
This is the safeguard that prevents the MSS from being increased; oldmss is the MSS contained in the TCP segment and newmss is the MSS requested to be set. By commenting out the if-statement and return lines, the module will modify the MSS regardless if the requested MSS is larger or not.
The module can be recompiled using the following Makefile contents (it may help to have the kernel headers and source installed first, though it is not necessary to recompile the entire kernel):
obj-m += xt_TCPMSS.o
all:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
Note that $(shell uname -r) evaluates to the full version of the kernel - in my case it is to “3.2.0-48-generic-pae” - and that the make ... line must be preceded by a single tab (not spaces). It’s also worth noting that I first copied xt_TCPMSS.c into a separate directory before modifying and recompiling it, just to be sure I didn’t overwrite anything important. Save the above lines into a file called "Makefile" in the same direction, then run make. The compilation produces several new files, including the compiled kernel module xt_TCPMSS.ko.
Compiled netfilter kernel modules live in /lib/modules/3.2.0-48-generic-pae/kernel/net/netfilter/. I first backed up the old module and then copied in my new version:
# cp /lib/modules/3.2.0-48-generic-pae/kernel/net/netfilter/xt_TCPMSS.ko xt_TCPMSS.ko.old
# cp xt_TCPMSS.ko /lib/modules/3.2.0-48-generic-pae/kernel/net/netfilter/
Now we’re ready to test this out. Assuming we installed the modified TCPMSS module at B (the client for our TCP connection), the iptables rule should look something like this:
Now we’re ready to test this out. Assuming we installed the modified TCPMSS module at B (the client for our TCP connection), the iptables rule should look something like this:
# iptables -t mangle -A POSTROUTING -p tcp -o eth0 --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1460
Breaking this rule down (the documentation for iptables gives detailed explanations of each option):
- netfilter uses a number of “tables” for different functions; we want to use the part of netfilter that mangles or alters packet contents.
- The mangle table has hooks into different stages of receiving and sending packets; we want to mangle packets after the routing decision is made, just before they are sent to the outbound interface. This is called the POSTROUTING hook.
- We use filters so that our rule only applies to the TCP protocol, for packets leaving interface eth0.
- Further, we filter on certain flags in the TCP header. Namely, we only want to affect TCP segments with the SYN flag set and the RST (reset) flag not set, since it is during the handshake that the MSS is advertised, and we want to change it for the initial SYN segment from the client (B) to the server (A).
- The action we take is to use the TCPMSS module (netfilter automatically loads our compiled module on-demand).
- We want to change the MSS of matching segments to 1460 bytes (recall this is the A-B link-layer MTU minus IP and TCP headers).
If instead I wanted to apply this rule at A and modify the MSS on incoming TCP SYN segments (assuming the modified module was installed there first), I could change the rule to:
# iptables -t mangle -A PREROUTING -p tcp -i eth1 --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1460
Repeating the test with tcpdump and nc from above, we observe the following:
<terminal 1 @ B>
...
10:27:37.904130 IP 192.168.1.5.35688 > 192.168.1.6.1234: Flags [S], seq 2575051267, win 4600, options [mss 1460,sackOK,TS val 561918974 ecr 0,nop,wscale 4], length 0
10:27:38.034540 IP 192.168.1.6.1234 > 192.168.1.5.35688: Flags [S.], seq 917538710, ack 2575051268, win 65535, options [mss 1460,nop,wscale 3,nop,nop,TS val 850613438 ecr 561918974,sackOK,eol], length 0
10:27:38.034598 IP 192.168.1.5.35688 > 192.168.1.6.1234: Flags [.], ack 1, win 288, options [nop,nop,TS val 561919007 ecr 850613438], length 0
10:27:38.035596 IP 192.168.1.6.1234 > 192.168.1.5.35688: Flags [.], ack 1, win 65535, options [nop,nop,TS val 850613438 ecr 561919007], length 0
10:27:37.904130 IP 192.168.1.5.35688 > 192.168.1.6.1234: Flags [S], seq 2575051267, win 4600, options [mss 1460,sackOK,TS val 561918974 ecr 0,nop,wscale 4], length 0
10:27:38.034540 IP 192.168.1.6.1234 > 192.168.1.5.35688: Flags [S.], seq 917538710, ack 2575051268, win 65535, options [mss 1460,nop,wscale 3,nop,nop,TS val 850613438 ecr 561918974,sackOK,eol], length 0
10:27:38.034598 IP 192.168.1.5.35688 > 192.168.1.6.1234: Flags [.], ack 1, win 288, options [nop,nop,TS val 561919007 ecr 850613438], length 0
10:27:38.035596 IP 192.168.1.6.1234 > 192.168.1.5.35688: Flags [.], ack 1, win 65535, options [nop,nop,TS val 850613438 ecr 561919007], length 0
...
Notice that now the MSS values advertised by both 192.168.1.5 (B) and 192.168.1.6 (A) are 1460 bytes - our iptables rule with our modified netfilter module works!
Although B internally determines an MSS of 460 bytes, we modified the MSS it shares with A to be 1460 bytes. Now A compares its calculated MSS of 1460 bytes against what it thinks B advertised (also 1460 bytes), and decides that the TCP connection MSS should be 1460 bytes. However, although A advertises a 1460 byte MSS back to B, B still has its internally calculated MSS of 460 bytes and decides independently of A that the TCP connection MSS should be 460 bytes. Each endpoint has a different idea of what the MSS for the connection is, so each side sends maximum-sized segments that don’t incur IP fragmentation!
If we start transferring some data from A to B, we can observe that this trickery does in fact work (of course, it will only be evident for some segments, since inter-segment timing, non MSS-sized data, TCP option headers, et cetera will result in varying-sized segments):
<terminal 1 @ B>
...
10:42:31.346746 IP 192.168.1.6.1234 > 192.168.1.5.35694: Flags [.], seq 1:289, ack 1, win 65535, options [nop,nop,TS val 850622362 ecr 562142334], length 288
10:42:31.346778 IP 192.168.1.5.35694 > 192.168.1.6.1234: Flags [.], ack 289, win 344, options [nop,nop,TS val 562142335 ecr 850622362], length 0
10:42:31.347863 IP 192.168.1.6.1234 > 192.168.1.5.35694: Flags [.], seq 289:1737, ack 1, win 65535, options [nop,nop,TS val 850622362 ecr 562142335], length 1448
10:42:31.347897 IP 192.168.1.5.35694 > 192.168.1.6.1234: Flags [.], ack 1737, win 525, options [nop,nop,TS val 562142335 ecr 850622362], length 0
10:42:31.347923 IP 192.168.1.6.1234 > 192.168.1.5.35694: Flags [FP.], seq 1737:2001, ack 1, win 65535, options [nop,nop,TS val 850622362 ecr 562142335], length 264
10:42:31.348118 IP 192.168.1.5.35694 > 192.168.1.6.1234: Flags [F.], seq 1, ack 2002, win 706, options [nop,nop,TS val 562142335 ecr 850622362], length 0
10:42:31.346778 IP 192.168.1.5.35694 > 192.168.1.6.1234: Flags [.], ack 289, win 344, options [nop,nop,TS val 562142335 ecr 850622362], length 0
10:42:31.347863 IP 192.168.1.6.1234 > 192.168.1.5.35694: Flags [.], seq 289:1737, ack 1, win 65535, options [nop,nop,TS val 850622362 ecr 562142335], length 1448
10:42:31.347897 IP 192.168.1.5.35694 > 192.168.1.6.1234: Flags [.], ack 1737, win 525, options [nop,nop,TS val 562142335 ecr 850622362], length 0
10:42:31.347923 IP 192.168.1.6.1234 > 192.168.1.5.35694: Flags [FP.], seq 1737:2001, ack 1, win 65535, options [nop,nop,TS val 850622362 ecr 562142335], length 264
10:42:31.348118 IP 192.168.1.5.35694 > 192.168.1.6.1234: Flags [F.], seq 1, ack 2002, win 706, options [nop,nop,TS val 562142335 ecr 850622362], length 0
...
Measuring performance impact
This seems nice in theory, but how much does it matter to TCP performance in practice?
To simulate a larger transfer from A to B, I created a 200MB file at A consisting of zeroes:
<terminal 1 @ A>
$ dd if=/dev/zero of=zero.bin bs=1M count=200
I then reran the test with nc from before, but with a couple modifications. The server side takes this file as input using the input file redirect, and -q 0 is added to the netcat server-side to force it to close the TCP connection as soon as the input is fully read. On the client side, I use the bash shell built-in command time to get the “real” or wall-clock time that it takes to run the netcat command, from TCP connection establishment to connection termination.
<terminal 1 @ A>
$ nc -l -p 1234 -q 0 < zero.bin
<terminal 2 @ B>
$ time nc 192.168.1.6 1234 > zerocopy.bin
For our asymmetric link, the wall-clock time averaged about 27 seconds. However, after applying the iptables rule with the modified TCPMSS module, the time averaged about 18 seconds. This is pretty significant: a 33% decrease in transfer time! Alternatively, we can look at this in terms of average transfer rate over the duration of the connection: 200MB * 8bits/byte / 27sec = 59.3Mbps with asymmetric MTUs, versus 200MB * 8bits/byte / 18sec = 88.9Mbps with the iptables rule. For a 100Mbps link, this is doing pretty well!
Things get even more interesting if we start introducing link latency. Earlier, I mentioned that the rate at which TCP transfers ramp up is proportional to the connection MSS. I neglected to mention that this rate is also inversely proportional to the round-trip time (RTT) of the connection path. Linux has the ability to emulate a variety of network effects, including link latency, using the tc command (a good introduction can be found here). Suppose we add a 5 millisecond (ms) delay in each direction:
<terminal 1 @ A>
# tc qdisc add dev eth1 root netem delay 5ms
<terminal 1 @ B>
# tc qdisc add dev eth0 root netem delay 5ms
The transfer time without the rule jumped to about 40 seconds (40Mbps). Again, applying our iptables rule brings the transfer times down to an average of 19 seconds (84.2Mbps). A >50% decrease in transfer time compared to letting the smaller MTU dictate the A-B MSS!
For fun, let’s increase the delay to 10ms in each direction. While the time with the iptables rule increases slightly to about 23 seconds (69.6Mbps), the time without the rule rockets to about 79 seconds (20.3Mbps)! The decrease in transfer time with our rule is now >70%!
In summary, the performance gain using this approach on links with asymmetric MTUs increases as the smaller MSS decreases and as the RTT (latency) increases. Note that this is only beneficial to the extent that data is primarily flowing across the link with the larger MTU; if the primary data transfer was from B to A, performance would still be constrained by the smaller MTU since B still decides its outgoing MSS based on this value.
Update 01/16/2014: It was pointed out to me by a colleague that when measuring network performance, it is best to remove the hard disk as a factor. The measurements above represent a real-world situation where a file is transferred from A to B. However, to strictly test the difference in transfer rates, a better alternative would be to pipe the output of dd into nc at the server (A):
dd if=/dev/zero bs=1M count=200 | nc -l -p 1234 -q 0
Likewise at the client (B), the output can go to the screen (zero bytes are ignored in most terminals) or piped to /dev/null:
time nc 192.168.1.6 1234 > /dev/null
(Note that neither /dev/zero nor /dev/null are regular files, but instead are special devices that do not access a hard disk. Hence their performance is tied only to the speeds of the processor, memory, and hardware buses.)
The code
If you’re running the same kernel version I used for this test, you can patch your xt_TCPMSS.c with the following really short diff :)
96a97
> /*
98a100
> */
Then use the Makefile contents mentioned above, install the modified kernel module, and enjoy better TCP transfer times!
No comments:
Post a Comment