TCP Maintenance Working Group W. Wang Internet-Draft N. Cardwell Intended status: Experimental Y. Cheng Expires: December 10, 2017 E. Dumazet Google, Inc June 8, 2017 TCP Low Latency Option draft-wang-tcpm-low-latency-opt-00 Abstract This document specifies the TCP Low Latency option, which TCP connections can use during the connection establishment handshake to communicate extra parameters that can improve performance in low- latency environments. With the first such parameter, a TCP data receiver can advertise a hint about the Maximum ACK Delay (MAD) it will schedule for its own delayed ACK mechanism. This enables the TCP data sender to achieve lower latencies during loss recovery by using the Maximum ACK Delay advertised by the remote receiver to help compute retransmission timeouts that are potentially much lower than would otherwise be feasible. The Low Latency option is extensible, and later versions of this draft will introduce other mechanisms, including TCP timestamps with a finer granularity than those supported by RFC 7323. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on December 10, 2017. Wang, et al. Expires December 10, 2017 [Page 1] Internet-Draft LL June 2017 Copyright Notice Copyright (c) 2017 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. 1. Introduction TCP receivers typically implement a delayed ACK algorithm, as specified in [RFC1122] Sec 4.2.3.2; as summarized in [RFC5681] sec 4.2, "an ACK SHOULD be generated for at least every second full-sized segment, and MUST be generated within 500 ms of the arrival of the first unacknowledged packet." In practice, many widely-deployed implementations have tended to delay ACKs by up to roughly 200ms. This is probably a historical artifact inherited from the 200ms "fast timeout" mechanism in the BSD TCP implementation from the late 1980s [WS95]. As a result, to avoid spurious timeouts due to delayed ACKs, widely- deployed TCP sender implementations have adapted to this delayed ACK behavior by constraining retransmission timeout (RTO) values to be at least 200ms. Unfortunately, this 200ms value is 2000x the typical RTT of today's commodity datacenter networks (which are typically below 100 microseconds). So senders constraining RTOs to be at least 200ms are paying a latency penalty much higher than the RTT in such environments. The TCP Low Latency option enables a TCP data receiver to advertise a hint about the Maximum ACK Delay (MAD) it will schedule for its own delayed ACK mechanism. The receiver specifies the MAD value in the Low Latency option because the value that is feasible can be quite different for different receivers, based on the CPU's speed, CPU and network workloads, and OS-specific constraints on minimum supported timer granularity. This Low Latency option enables the TCP data sender to achieve lower latencies during loss recovery by using the Maximum ACK Delay Wang, et al. Expires December 10, 2017 [Page 2] Internet-Draft LL June 2017 advertised by the remote receiver to help compute retransmission timeouts that are potentially much lower than would otherwise be feasible. 2. Terminology The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. In this document, "MAD" refers to the Maximum Ack Delay used by the data receiver to delay TCP acknowledgments, and "minRTO" refers to the Minimum Retransmit Timeout. 3. Detailed Protocol 3.1. TCP Low Latency Option The Low Latency option is only valid in SYN or SYN/ACK packets during the three way handshake. It MUST be ignored in other cases. The format of the TCP Low Latency option is as follows: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Kind | Length |M u| MAD | | | | |A n| Value | Res | | | |D i| (10 bits) | | | | | t| | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | ~ ... Reserved ... ~ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Kind: 1 byte: value = IANA-assigned option number Length: 1 byte: value = 4 (or longer in later versions) MAD unit: 2 bits: indicates time unit for MAD value: 0: reserved 1: milliseconds 2: microseconds 3: nanoseconds MAD value: 10 bits: indicates MAD value set on the host: 1 ... 1023: MAD value in the given units 0: no MAD value is specified Reserved: N>=4 bits: value = 0 Wang, et al. Expires December 10, 2017 [Page 3] Internet-Draft LL June 2017 In order to support future extensions, the option is variable-length. Bits beyond those defined so far in IETF standards should be considered "reserved". TCP implementations MUST (a) set to zero any reserved bits they add for padding, and (b) ignore any reserved bits (whether they are set or not). 3.2. Overview The communication, starting from the TCP connection handshake, looks like the following: TCP A (Active) TCP B (Passive) ============== =============== CLOSED LISTEN #1 SYN-SENT ----- ------> SYN-RCVD (Adjust RTO accordingly) #2 ESTABLISHED <---- ----- SYN-RCVD (Adjust RTO accordingly) #3 ESTABLISHED -----------------------> ESTABLISHED #4 Send() --------------------> - | | Delay Ack < 5ms | <-------------------- - #5 Recv() #6 Send() -----------------------> | RTO >= 5ms | | ------------> <------------------------ #7 Recv() 3.3. Configuring maximum ACK delay An implementation that supports the maximum ACK delay parameter MUST provide a user API to configure the maximum ACK delay for a specific connection or all TCP connections. o If the user does not specify a MAD value, then the implementation SHOULD NOT specify a MAD value in the Low Latency option. o If the user specifies a MAD value outside the range of ACK delay values supported by the implementation, then the implementation SHOULD allow the request to succeed, but SHOULD silently constrain the MAD value to be within the valid range (between the minimum and maximum ACK delay for the implementation). This is intended Wang, et al. Expires December 10, 2017 [Page 4] Internet-Draft LL June 2017 to allow applications to portably request a MAD value without needing special logic to search for a valid value. o If the specified connections are not in CLOSED or LISTEN states, the API SHOULD return an error and ignore the request to specify a MAD value. o Otherwise the implementation SHOULD use the user-specified value as the maximum timeout for the delayed ACK and the MAD value in the Low Latency option of the specified TCP connections. The exact design and implementation of such an API is intentionally left to the implementation. We discuss some examples in the appendix. 3.4. Announcing the maximum ACK delay o The maximum ACK delay is announced to the remote TCP endpoint by including a Low Latency option with a non-zero MAD value in the SYN or SYN/ACK packet. A "MAD value" field of 0 in the Low Latency option indicates that the sender is not specifying a MAD value. o If specified, then the MAD value in the Low Latency option MUST be set, as close as possible, to the implementation's actual delayed ACK timeout for the connection. Note that the actual maximum delayed ACK timeout of the connection may be larger than the actual user specified value because of implementation constraints (e.g. timer granularity limitations). o If the user has specified a MAD value for an active connection, then the active open side SHOULD include a Low Latency option with a MAD value in the SYN packet. o If the user has specified a MAD value for a passive connection, and the passive side has received at least one SYN packet with a Low Latency option with a valid MAD value, then the passive open side SHOULD return its MAD value in the Low Latency option. 3.5. Adjusting TCP retransmission timeouts If the MAD value advertised in a received Low Latency option is 0, or greater than the default maximum ACK delay of 200ms, then the option SHOULD be ignored and no further action is needed. Otherwise the (data) sender MAY use the maximum delayed ACK advertised by the receiver to adjust the sender's RTO calculation. Specifically, if the sender implements an RTO calculation based on Wang, et al. Expires December 10, 2017 [Page 5] Internet-Draft LL June 2017 [RFC6298], it MAY replace the 1 second lower-bound specified in step 2.4 in Section 2 with the value of the maximum ACK delay advertised in the Low Latency option, so that the calculation becomes: RTO <- SRTT + max(G, K*RTTVAR) + max(G, max_ACK_delay) instead of RTO <- max(SRTT + max(G, K*RTTVAR), 1 second) /* [RFC6298] */ Here we use the notation of [RFC6298], including SRTT (smoothed round-trip time), RTTVAR (round-trip time variation), and G (clock granularity). Also, if the sender also implements [draft-ietf-tcpm-rack] then it SHOULD replace the maximum delayed ACK parameter (WCDelAckT) with the max_ACK_delay specified in the Low Latency option. Using the MAD value in the RTO calculation helps senders reduce the RTO significantly while still avoiding spurious retransmissions due to delayed acks. With this new algorithm, the RTO can be drastically shortened in most environments where the receiver advertises a MAD. In particular, in data center environments the RTO can often be reduced from more than one second to single-digit milliseconds. Using the MAD to reduce the RTO can improve performance and thus mitigate TCP incast issues. More details are provided in the following Related work section. 4. Related work Several research papers have shown that reducing the minimum retransmission timeout (minRTO) significantly improves the performance of TCP in the datacenter, by mitigating the effect of TCP timeouts. As a result, this can mitigate TCP incast issues. o In "Attaining the Promise and Avoiding the Pitfalls of TCP in the Datacenter" [JS15], the authors show that reducing minRTO from 200ms to 5ms greatly reduced the impact of TCP incast issues. o In "Understanding TCP incast throughput collapse in datacenter networks" [CG09], the authors show significant improvement in goodput when reducing minRTO. o In "Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems" [PK07], the authors show that reducing minRTO from 200 milliseconds to 200 microseconds improved goodput by an order of magnitude in some data center scenarios they evaluated. Wang, et al. Expires December 10, 2017 [Page 6] Internet-Draft LL June 2017 o In "Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication" [VP09], the authors point out that the imbalance between the TCP minRTO and datacenter latencies can result in poor performance for applications sensitive to millisecond-scale delays in query response times. In simulations of datacenter scenarios they show that goodput drops when increasing minRTO above 1ms. Moreover, in some data center scenarios the default minRTO of 200ms results in nearly 2 orders of magnitude lower throughput compared to a minRTO of 1ms. o In Google data centers a TCP option mechanism equivalent to the Low Latency option's MAD parameter has been used since 2005, and the TCP minRTO has been set to 5ms by default since 2013 [CC16]. 5. Middlebox Considerations The new Low Latency option might expose some middlebox issues: o Middleboxes could drop SYNs with a Low Latency option in the case where it treats the Low Latency option as an unknown option. However, this happens fairly rarely according to "Is it still possible to extend TCP?" [HN11], table 3. o In case middleboxes alter the content in the Low Latency option, the receiver SHOULD do a sanity check on the MAD value included in the Low Latency option to verify it is less than or equal to the default maximum ACK delay of 200ms. As explained earlier, it is not practical for users to set MAD value greater than default. So it is safe to consider a MAD value greater than default as a result of a bad user configuration or a malfunctioning middlebox and ignore the Low Latency option completely in such cases. 6. Security Considerations TBD 7. IANA Considerations As no official option number has been issued for the new Low Latency option by IANA yet, experimental option 254 per [RFC6994] with magic number 0xF990 (16 bits) is used for now. The option format with experimental ID is as follows: Wang, et al. Expires December 10, 2017 [Page 7] Internet-Draft LL June 2017 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Kind | Length | RFC 6994 Experiment ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |M u| MAD | | |A n| Value | Res | ... |D i| (10 bits) | | | t| | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Kind: 1 byte: value = 254 Length: 1 byte: value = 6 (or longer in later versions) Experiment ID: 2 bytes: value = 0xF990 MAD unit: 2 bits: indicates time unit for MAD value: 0: reserved 1: milliseconds 2: microseconds 3: nanoseconds MAD value: 10 bits: indicates MAD value set on the host: 1 ... 1023: MAD value in the given units 0: no MAD value is specified Reserved: N>=4 bits: value = 0 We will migrate to using the official option number for the Low Latency option after IANA has assigned one. 8. Appendix 8.1. Example user API in Linux to configure maximum ACK delay 8.1.1. Per-route MAD configuration API A new configuration option called "mad" will be added to the "ip" command line tool in the iproute2 package. Users can use this to configure a per-route MAD value like the following: ip route add 10.1.2.0/24 dev eth0 scope link src 10.1.2.123 mad 5ms This configures all connections destined to 10.1.2.0/24 to have a MAD value of 5ms. When implementing this new MAD option field, the "ip" command line tool will verify that the provided MAD parameter is less than or equal to the default MAD value of 200ms. If the MAD is invalid then the ip route command will ignore the command and report an error to user. Newly-created TCP sockets have the default 200ms MAD value. When a TCP connection is opened, it SHOULD consult the ip routing table to check if there is any configured MAD value for the route. If so, the Wang, et al. Expires December 10, 2017 [Page 8] Internet-Draft LL June 2017 implementation copies the route's MAD value to the connection's MAD value. This per-route configuration will mostly be used by network administrators when configuring routes on the host. 8.1.2. MAD Socket option API Socket options provide per-connection configuration parameters. To allow per-connection configuration of the MAD value in the Low Latency option, a new TCP socket option called TCP_MAD will be added to the TCP implementation. This will allow applications to request a MAD value on a finer granularity than the per-route configuration, depending on the application's requirements. The API will look like the following example: int mad_val = 5 * 1000 * 1000; // in ns unit: 5ms err = setsockopt(fd, SOL_TCP, TCP_MAD, &mad_val, sizeof(mad_val)); The socket option implementation will sanitize the MAD value provided by the user. Per the specification above, in the "Configuring maximum ACK delay" section, if the user specifies a MAD value outside the range of ACK delay values supported by the implementation, then the implementation will allow the request to succeed, but will silently constrain the MAD value to be within the valid range (between the minimum and maximum ACK delay for the implementation). This is intended to allow applications to portably request a MAD value without needing special logic to search for a valid value. Once the implementation has sanitized the provided MAD value, it will record the value in the socket as the socket's own MAD value. Note: the MAD value set by the socket option SHOULD always override the per-route MAD value if there is one. 9. References 9.1. Normative References [draft-ietf-tcpm-rack] Cheng, Y., Cardwell, N., and N. Dukkipati, "RACK: a time- based fast loss detection algorithm for TCP", draft-ietf- tcpm-rack-02 (work in progress), March 2017. [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion Control", RFC 5681, September 2009. Wang, et al. Expires December 10, 2017 [Page 9] Internet-Draft LL June 2017 [RFC6298] Paxson, V., "Computing TCP's Retransmission Timer", RFC 6298, June 2011. [RFC6994] Touch, J., "Shared Use of Experimental TCP Options", RFC 6994, August 2013. 9.2. Informative References [CC16] Cardwell, N., Cheng, Y., and E. Dumazet, "TCP Options for Low Latency: Maximum ACK Delay and Microsecond Timestamps", IETF 97 , November 2016. [CG09] Chen, Y., Griffith, R., Liu, J., and R. Katz, "Understanding TCP incast throughput collapse in datacenter networks", WREN 09 , August 2009. [HN11] Honda, M., Nishida, Y., Raiciu, C., Greenhalgh, A., Handley, M., and H. Tokuda, "Is it Still Possible to Extend TCP?", IMC 11 , November 2011. [JS15] Judd, G. and M. Stanley, "Attaining the Promise and Avoiding the Pitfalls of TCP in the Datacenter", NSDI 15 , May 2015. [PK07] Phanishayee, A., Krevat, E., Vasudevan, V., Andersen, D., Ganger, G., Gibson, G., and S. Seshan, "Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems", September 2007. [VP09] Vasudevan, V., Phanishayee, A., Shah, H., Krevat, E., Andersen, D., Ganger, G., Gibson, G., and B. Mueller, "Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication", SIGCOMM 09 , August 2009. [WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2: The Implementation", 1995. Authors' Addresses Wei Wang Google, Inc 1600 Amphitheater Parkway Mountain View, California 94043 USA Email: weiwan@google.com Wang, et al. Expires December 10, 2017 [Page 10] Internet-Draft LL June 2017 Neal Cardwell Google, Inc 76 Ninth Avenue New York, NY 10011 USA Email: ncardwell@google.com Yuchung Cheng Google, Inc 1600 Amphitheater Parkway Mountain View, California 94043 USA Email: ycheng@google.com Eric Dumazet Google, Inc 1600 Amphitheater Parkway Mountain View, California 94043 Email: edumazet@google.com Wang, et al. Expires December 10, 2017 [Page 11]