Brief Comment on cwnd Inflation During Fast Recovery

Plots showing the behavior of `cwnd` against time:

The plots below show cwnd vs. time for a file transfer (ftp) of approximately 5 MB across a transatlantic path with a 64 Kbps bottleneck link. The data was obtained by instrumenting the FreeBSD 2.1.6 kernel to maintain logs of cwnd and other TCP state variables. The receiver window size is 32KB. The mss is 536 bytes. Runs were made weekdays at about 4PM (PDT). The source machine is running FreeBSD, and the sink machine is running OSF/1. This is the result of traceroute between the machines we used:

traceroute to XXXX.CAIRO.EUN.EG (193.227.10.XXX): 1-30 hops, 38 byte packets
 1  Core-gateway.Stanford.EDU (36.83.0.3)  2.49 ms  1.75 ms  1.31 ms
 2  SUNet-Gateway.Stanford.EDU (171.64.1.34)  2.18 ms  1.72 ms  1.41 ms
 3  su-pr1.bbnplanet.net (198.31.10.3)  2.42 ms  1.48 ms  1.25 ms
 4  paloalto-br1.bbnplanet.net (131.119.0.193)  2.54 ms  3.6 ms  2.15 ms
 5  chicago1-br1.bbnplanet.net (4.0.1.2)  53.8 ms  53.6 ms  53.9 ms
 6  washdc1-br1.bbnplanet.net (4.0.1.117)  69.8 ms  69.9 ms  71.0 ms
 7  vienna1-br1.bbnplanet.net (4.0.1.90)  190 ms  248 ms  315 ms
 8  maeeast-2.bbnplanet.net (4.0.1.94)  72.0 ms  72.0 ms  71.5 ms
 9  icm-mae-e-f0/0.icp.net (192.41.177.240)  74.4 ms (ttl=243!)  75.5 ms (ttl=2)
10  icm-dc-1-H1/0-T3.icp.net (198.67.131.10)  86.4 ms (ttl=244!)  76.5 ms (ttl=)
11  icm-dc-2-F2/0.icp.net (198.67.131.34)  74.2 ms  79.8 ms  77.1 ms
12  icm-paris-1-S0-1984k.icp.net (192.157.65.130)  267 ms  263 ms  189 ms
13  193.227.3.1 (193.227.3.1)  317 ms  304 ms  485 ms
14  193.227.1.10 (193.227.1.10)  380 ms  459 ms  417 ms
15  10.0.13.2 (10.0.13.2)  352 ms  372 ms  431 ms
16  193.227.13.29 (193.227.13.29)  410 ms  271 ms  251 ms
17  193.227.13.68 (193.227.13.68)  450 ms (ttl=241!)  554 ms (ttl=241!) 400 ms)
18  XXXX.CAIRO.EUN.EG (193.227.10.XXX)  560 ms (ttl=240!)  315 ms (ttl=240!) 304 ms )

Plots for FreeBSD 2.1.6:

The first plot shows cwnd vs. time for the unmodified TCP implementation; the second plot enlarges a portion of the first. Notice the sudden rise in cwnd after a fast retransmission, followed by the lowering to ssthresh.

FreeBSD 2.1.6 TCP

FreeBSD 2.1.6 TCP - Detail

Plots for FreeBSD 2.1.6 after modification:

The following plots are for the modified TCP implementation - cwnd increases as in congestion avoidance during fast recovery and does not go back to ssthresh when a new ACK is received. Click here to get the diff file generated by diff -u tcp_input.c.original tcp_input.c showing the main changes required. tcp_input.c.original is from the current release of FreeBSD at ftp://ftp.FreeBSD.ORG/pub/FreeBSD/FreeBSD-current/src/sys/netinet/tcp_input.c and the modified tcp_input.c is accessible through this link.

Modified Implementation

Modified Implementation - Detail

The first two plots agree with RFC 2001 (page 5, second paragraph):


"2.  Each time another duplicate ACK arrives, increment cwnd by the
segment size.  This inflates the congestion window for the
additional segment that has left the network.  Transmit a
packet, if allowed by the new value of cwnd.

3.  When the next ACK arrives that acknowledges new data, set cwnd
to ssthresh (the value set in step 1).  This ACK should be the
acknowledgment of the retransmission from step 1, one round-trip
time after the retransmission.  Additionally, this ACK should
acknowledge all the intermediate segments sent between the lost
packet and the receipt of the first duplicate ACK.  This step is
congestion avoidance, since TCP is down to one-half the rate it
was at when the packet was lost."

The reasoning behind this temporarily inflation of cwnd is to be able to send more segments out for each incoming duplicate-ACK (which indicates that another segment made it to the other side). This is necessary because TCP's sliding window is stuck and will not slide until the first non-duplicate ACK comes back. As soon as the first non-duplicate ACK comes back cwnd is set back to ssthresh and the window continues sliding in normal congestion-avoidance mode.

It has been known for some time now that there are problems with fast recovery, and fixes have been proposed - for instance the recent Floyd and Fall CCR paper which discusses SACK, the Mathis and Mahdavi SIGCOMM '96 paper which discusses FACK, also the Janey Hoe SIGCOMM'96 paper.

Plots showing segment sequence numbers and ACKs against time:

The following plots were generated by capturing traces using tcpdump running on a separate machine on the same segment off which the source was running. We used Greg Minshall's tracelook tool to generate the plots then annotated them to provide explanations. The traces were captured for the same runs shown above.

Plots for FreeBSD 2.1.6:

The first figure provides a general explanation of the cwnd inflation during fast recovery. It also shows fast recovery in one of its best situations were a good number of segments are sent during the fast recovery period and only a few number of back-to-back segments are left to be sent when the window slides. Notice also that the source can not send more than the new cwnd (1/2 old cwnd) during the second half of the fast recovery period. This is due to the fact that the window is still stuck at the fast-retransmission packet, but it has been halved, hence the first half of the duplicate ACKs will not cause any new data packets to be sent out. Also this means that the number of duplicate ACKs required before sending any of the fast recovery packets, since the fast-retransmission packet was sent, is 1/2 old cwnd - 3. Or in other words the number of dup-ACKs before any of the fast recovery packets are sent since the last packet in the old cwnd was sent, is 1/2 old cwnd (which is the new cwnd). This can be clearly seen in both of the following figures. So in fact the current fast recovery mechanism allows the network a breathing period of approximatly 1/2 RTT ( or 1/2 old cwnd - 3). (thanks to Bernhard R. Suter and Curtis Villamizar for pointing this out).

The second figure provides another situation where fast recovery failed to operate successfully. Only a few number of segments have been sent during the fast recovery period, and then a large number was sent back-to-back when the non-duplicate ACK arrived causing the window to slide and open up. We have no explanation for this yet, and suggestions are welcomed.

FreeBSD 2.1.6 TCP - Small Number of Back-to-Back Segments

FreeBSD 2.1.6 TCP - Large Number of Back-to-Back Segments

Plots for FreeBSD 2.1.6 after modification:

The first figure shows the behavior after the modification. In this case cwnd is increasing as in congestion avoidance, thus it would rarely occur that a segment will get sent during the fast recovery period. Rather all segments will be sent back-to-back when the window slides.

The second figure simply demonstrate the speed at which the burst is going out. This is true with or without the modification. We notice that during the burst the sender is sending the segments out at the speed of its local interface (which is 10Mbps Ethernet), while the receiver is behind a 64kbps bottleneck !

Modified Implementation - Always Large Number of Back-to-Back Segments

Typical trace of a burst with or without the modification

Arguments for and against the modification:

Argument Against the Modification:

The modification actually leads to a more aggressive and ill-behaved TCP! This is something we didn't realize when we first made the modification - though the cwnd vs. time graph looks much smoother (i.e. there is a lower number of slow-start invocations), tcpdump traces show that we end up sending a greater number of back-to-back packets when a new ACK is received, and there are a couple of reasons for this:

The first is simply that cwnd is bigger at the end of fast recovery: cwnd = (ssthresh + 3 + (num of dup-acks)/ssthresh) instead of the original cwnd = ssthresh.
Second; the original implementation could not send out too many new packets during fast recovery because, though the cwnd was increasing fast, the lower end of the window was stuck, and the window has just been halved. Thus only an equivalent of new cwnd of packets will be sent in the second half of the fast recovery period. The modified implementation will send even fewer packets during fast recovery, and the difference is the extra amount that will be sent when a new ACK is received (even if the modified implementation set cwnd back to ssthresh).

Thus the modification leeds to a more bursty TCP by sending a large number of back-to-back segments at the end of every fast-recovery period. Most researchers would argue that this is already one of observed problems of normal fast recovery, since normal fast recovery some times can lead to this effect. This problem is even worse after the modification since the source will always send this burst of back-to-back segments at the end of each fast-recovery period. Thus this will lead to an ill-behaved source, that will be able to capture bandwidth from other sources.

Argument For the Modification:

By reducing the number of packets that we send during the fast recovery period, we provide the network with a breathing period of approximately 1 RTT (instead of the approximate 1/2 RTT that normal fast recovery provides). During this elongated period the bottleneck link may catch up its breath by delivering queued packets.
For normal fast recovery the attempt to send new packets during the second half of the fast recovery period may further deteriorate the performance. The network just informed the source that it is congested (by dropping a packet), but the source still bangs the network with further packets at the rate with which the duplicate ACKs are coming back (which can be a higher than what the forward path of the network can tolerate). This "may" lead to a chain reaction effect that triggers another packet loss thus repeating the whole cycle again.

Your feedback is more than welcome, here is how to contact us:

Amr A. Awadallah, aaa@cs.stanford.edu
Chetan Rai, crai@CS.Stanford.EDU

Page maintained by Amr A. Awadallah and Chetan Rai

Brief Comment on cwnd Inflation During Fast Recovery

Plots showing the behavior of cwnd against time:

Plots for FreeBSD 2.1.6:

FreeBSD 2.1.6 TCP

FreeBSD 2.1.6 TCP - Detail

Plots for FreeBSD 2.1.6 after modification:

Modified Implementation

Modified Implementation - Detail

Plots showing segment sequence numbers and ACKs against time:

Plots for FreeBSD 2.1.6:

FreeBSD 2.1.6 TCP - Small Number of Back-to-Back Segments

FreeBSD 2.1.6 TCP - Large Number of Back-to-Back Segments

Plots for FreeBSD 2.1.6 after modification:

Modified Implementation - Always Large Number of Back-to-Back Segments

Typical trace of a burst with or without the modification

Arguments for and against the modification:

Argument Against the Modification:

Argument For the Modification:

Plots showing the behavior of `cwnd` against time: