The plots below show cwnd vs. time for a file transfer (ftp) of approximately 5 MB across a transatlantic path with a 64 Kbps bottleneck link. The data was obtained by instrumenting the FreeBSD 2.1.6 kernel to maintain logs of cwnd and other TCP state variables. The receiver window size is 32KB. The mss is 536 bytes. Runs were made weekdays at about 4PM (PDT). The source machine is running FreeBSD, and the sink machine is running OSF/1. This is the result of traceroute between the machines we used:
traceroute to XXXX.CAIRO.EUN.EG (193.227.10.XXX): 1-30 hops, 38 byte packets 1 Core-gateway.Stanford.EDU (36.83.0.3) 2.49 ms 1.75 ms 1.31 ms 2 SUNet-Gateway.Stanford.EDU (171.64.1.34) 2.18 ms 1.72 ms 1.41 ms 3 su-pr1.bbnplanet.net (198.31.10.3) 2.42 ms 1.48 ms 1.25 ms 4 paloalto-br1.bbnplanet.net (131.119.0.193) 2.54 ms 3.6 ms 2.15 ms 5 chicago1-br1.bbnplanet.net (4.0.1.2) 53.8 ms 53.6 ms 53.9 ms 6 washdc1-br1.bbnplanet.net (4.0.1.117) 69.8 ms 69.9 ms 71.0 ms 7 vienna1-br1.bbnplanet.net (4.0.1.90) 190 ms 248 ms 315 ms 8 maeeast-2.bbnplanet.net (4.0.1.94) 72.0 ms 72.0 ms 71.5 ms 9 icm-mae-e-f0/0.icp.net (192.41.177.240) 74.4 ms (ttl=243!) 75.5 ms (ttl=2) 10 icm-dc-1-H1/0-T3.icp.net (198.67.131.10) 86.4 ms (ttl=244!) 76.5 ms (ttl=) 11 icm-dc-2-F2/0.icp.net (198.67.131.34) 74.2 ms 79.8 ms 77.1 ms 12 icm-paris-1-S0-1984k.icp.net (192.157.65.130) 267 ms 263 ms 189 ms 13 193.227.3.1 (193.227.3.1) 317 ms 304 ms 485 ms 14 193.227.1.10 (193.227.1.10) 380 ms 459 ms 417 ms 15 10.0.13.2 (10.0.13.2) 352 ms 372 ms 431 ms 16 193.227.13.29 (193.227.13.29) 410 ms 271 ms 251 ms 17 193.227.13.68 (193.227.13.68) 450 ms (ttl=241!) 554 ms (ttl=241!) 400 ms) 18 XXXX.CAIRO.EUN.EG (193.227.10.XXX) 560 ms (ttl=240!) 315 ms (ttl=240!) 304 ms )
The first plot shows cwnd vs. time for the unmodified TCP implementation; the second plot enlarges a portion of the first. Notice the sudden rise in cwnd after a fast retransmission, followed by the lowering to ssthresh.
The first two plots agree with RFC 2001 (page 5, second paragraph):
"2. Each time another duplicate ACK arrives, increment cwnd by the segment size. This inflates the congestion window for the additional segment that has left the network. Transmit a packet, if allowed by the new value of cwnd. 3. When the next ACK arrives that acknowledges new data, set cwnd to ssthresh (the value set in step 1). This ACK should be the acknowledgment of the retransmission from step 1, one round-trip time after the retransmission. Additionally, this ACK should acknowledge all the intermediate segments sent between the lost packet and the receipt of the first duplicate ACK. This step is congestion avoidance, since TCP is down to one-half the rate it was at when the packet was lost."
The reasoning behind this temporarily inflation of cwnd is to be able to send more segments out for each incoming duplicate-ACK (which indicates that another segment made it to the other side). This is necessary because TCP's sliding window is stuck and will not slide until the first non-duplicate ACK comes back. As soon as the first non-duplicate ACK comes back cwnd is set back to ssthresh and the window continues sliding in normal congestion-avoidance mode.
It has been known for some time now that there are problems with fast recovery, and fixes have been proposed - for instance the recent Floyd and Fall CCR paper which discusses SACK, the Mathis and Mahdavi SIGCOMM '96 paper which discusses FACK, also the Janey Hoe SIGCOMM'96 paper.
The following plots were generated by capturing traces using tcpdump running on a separate machine on the same segment off which the source was running. We used Greg Minshall's tracelook tool to generate the plots then annotated them to provide explanations. The traces were captured for the same runs shown above.
The first figure provides a general explanation of the cwnd inflation during fast recovery. It also shows fast recovery in one of its best situations were a good number of segments are sent during the fast recovery period and only a few number of back-to-back segments are left to be sent when the window slides. Notice also that the source can not send more than the new cwnd (1/2 old cwnd) during the second half of the fast recovery period. This is due to the fact that the window is still stuck at the fast-retransmission packet, but it has been halved, hence the first half of the duplicate ACKs will not cause any new data packets to be sent out. Also this means that the number of duplicate ACKs required before sending any of the fast recovery packets, since the fast-retransmission packet was sent, is 1/2 old cwnd - 3. Or in other words the number of dup-ACKs before any of the fast recovery packets are sent since the last packet in the old cwnd was sent, is 1/2 old cwnd (which is the new cwnd). This can be clearly seen in both of the following figures. So in fact the current fast recovery mechanism allows the network a breathing period of approximatly 1/2 RTT ( or 1/2 old cwnd - 3). (thanks to Bernhard R. Suter and Curtis Villamizar for pointing this out).
The second figure provides another situation where fast recovery failed to operate successfully. Only a few number of segments have been sent during the fast recovery period, and then a large number was sent back-to-back when the non-duplicate ACK arrived causing the window to slide and open up. We have no explanation for this yet, and suggestions are welcomed.
The first figure shows the behavior after the modification. In this case cwnd is increasing as in congestion avoidance, thus it would rarely occur that a segment will get sent during the fast recovery period. Rather all segments will be sent back-to-back when the window slides.
The second figure simply demonstrate the speed at which the burst is going out. This is true with or without the modification. We notice that during the burst the sender is sending the segments out at the speed of its local interface (which is 10Mbps Ethernet), while the receiver is behind a 64kbps bottleneck !
The modification actually leads to a more aggressive and ill-behaved TCP! This is something we didn't realize when we first made the modification - though the cwnd vs. time graph looks much smoother (i.e. there is a lower number of slow-start invocations), tcpdump traces show that we end up sending a greater number of back-to-back packets when a new ACK is received, and there are a couple of reasons for this:
Your feedback is more than welcome, here is how to contact us:
Amr A. Awadallah, aaa@cs.stanford.edu
Chetan Rai, crai@CS.Stanford.EDU
Page maintained by Amr A. Awadallah and Chetan Rai