INTERNET-DRAFT M. Sun Intended Status: B.Pithawala Expires: December 30,2017 HUAWEI Technologies F.Gao Baidu Inc June 28,2017 draft-sun-idr-bgp-ls-notification-00 Abstract This document describes the use of Border Gateway Protocol (BGP) community. This optional transitive community will instruct router to monitor itself ports . With this community, controller only needs to send route update message once and will get the feedback only if link status changes. In particular this community can help controller get the link status changing notification much faster than current method. Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Copyright and License Notice Copyright (c) 2017 IETF Trust and the persons identified as the document authors. All rights reserved. Marcus, et al. Expires December 30,2017 [Page 1] INTERNET DRAFT Fast Link Status Notification June 28,2017 This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1 Large-scale DC Routing Solution . . . . . . . . . . . . . . 3 1.2 BFD protocol and Hellos Protocol . . . . . . . . . . . . . . 5 2. Another Centralized Link Detection Method Based on BGP . . . . 5 2.1 Basic Principle . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Advantages and Benefits of this solution . . . . . . . . . . 7 3 IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 7 4 References . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.1 Normative References . . . . . . . . . . . . . . . . . . . 8 4.2 Informative References . . . . . . . . . . . . . . . . . . 8 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 8 Marcus, et al. Expires December 30,2017 [Page 2] INTERNET DRAFT Fast Link Status Notification June 28,2017 1 Introduction With the advent of micro services application architecture and the continued advances in massively scaled distributed systems, majority of traffic traversing the data center network is within the data center (east-west). This necessitates the data center network to have deterministic latency (preferably ultra-low), high scalability, high availability and low cost. For those requirements, current large- scale data center network is mostly based on CLOS architecture, [RFC7938] shows a typical 3 layer(5 stages) CLOS architecture(in Figure 1,3 layer means Leaf-Agg-Spine ). Spine +-----+ | | +--| |--+ | +-----+ | +-----------------------+ Agg | | | Agg POD | +-----+ | +-----+ | |+-----+ | +-------------| DEV |------+--| |--+-+| |-------------+ | | +-----| C |------+ | | +-+| |-----+ | | | | +-----+ +-----+ |+-----+ | | | | | | | | | | | +-----+ +-----+ |+-----+ | | | | +-----------| DEV |------+ | | +-+| |-----------+ | | | | | +---| D |------+--| |--+-+| |---+ | | | | | | | | +-----+ | +-----+ | |+-----+ | | | | | | | | | | | | | | | | | +-----+ +-----+ | +-----+ | | +-----+ +-----+| | DEV | | DEV | +--| |--+ | | | | || | A | | B |Leaf | | | Leaf | | | || +-----+ +-----+ +-----+ | +-----+ +-----+| | | | | +----------+-+-----+-+--+ O O O O O O O O Servers Servers Figure 1 3-Layer Clos Topology Note: Leaf is switching node that is connected with servers, Agg is exchange node that aggregates Leaf, and Spine is core exchange node. Nowadays, the scale of this architecture can support 100k servers. The number of links in network is nearly up to 200k links. Managing the large number of switches and links in a data center from a Controller is a difficult scale problem. 1.1 Large-scale DC Routing Solution Marcus, et al. Expires December 30,2017 [Page 3] INTERNET DRAFT Fast Link Status Notification June 28,2017 [RFC7938] introduces a link detection solution based on BGP.This RFC uses ebgp to connect switches (physical link) and use ibgp to connect switches and controller (logical link). The ebgp connections are made using the local loopback addresses of the Routers/Switches.Since this solution does not have any IGP in the network to convey the local loopback addresses to form the EBGP connection, the solution uses a centralized controller to initiate the messages to convey loopback address of a Router to its neighbor. It uses a combination of ibgp and ebgp connections and messages to achieve the following as Figure 2. +----------+ inject Prefix +-----+Controller+----+ for R1 with | +----------+ | expect Prefix one-hop | | for R1 from R2 community +-++ +-++ |R1+------------------+R2| +--+ Prefix for R1 +-++ relayed | Prefix for R1 +-++NOT relayed |R3| +--+ Figure 2 one kind of link detection method In Figure 2, the controller periodically updates the packets to the source of the link, determines link status (status of link connecting to routers/switches) according to whether controller receives update message from destination link node.The controller sends route message to switch R1 periodically, which only contains one-hop community attribute.R1 publishes this message to its neighbor R2 through ebgp with no_export attribute in it.R2 sends this message to controller through ibgp instead of sending message to R3 because of no_export attribute.If controller receives route message from R2 within specified time, it is assumed that R1->R2 status is normal. Otherwise, R1->R2 status is down. But when link detection packets sending frequency is high, the controller load is heavy, i.e. controller processing capacity is not enough, and firewall device does not accept this large flow of traffic.On the other hand,when link detection packets sending frequency is low, the convergence speed of network is slow, that will lead to loop or network interruption and other issues. Network reliability is unacceptable.With single controller multi-threaded Marcus, et al. Expires December 30,2017 [Page 4] INTERNET DRAFT Fast Link Status Notification June 28,2017 exabgp + virtual router vyatta, experimental test data shows that this solution can only support 1k links and 512 servers in non-block network. 1.2 BFD protocol and Hellos Protocol Existing mainstream distributed link monitoring methods are Protocol Hellos [RFC 2328]and BFD protocol[RFC 5880]. Protocol Hellos: Since a protocol (ebgp) is initiated over the link, the status of the link could be inferred by receiving periodic hellos (or the lack of hellos).Protocol hellos are generally regarded as a slow link detection mechanism. Increasing the frequency of hellos only creates a scale issues at many points in the network without really providing sub-second link detection. BFD solution configures BFD session at both ends of the link which need to be detected. Each end sends detection BFD messages and link will report failure if the detection message is not received on time.BFD needs plenty of configurations to different devices and different ports. In VRRP track, 100k servers need to configure 200k links and 200k ends. At the same time, 100k servers use BFD need to configure 200k links and 400k ends which may cause some unexpectable errors with high cost. 2. Another Centralized Link Detection Method Based on BGP 2.1 Basic Principle Considering current large-scale DCN link detection method, there are many problems of periodical detection method. When the frequency of sending and receiving messages is high, the controller load will be too heavy. The controller processing capacity is not enough and firewall devices cannot accept this large flow of traffic. On the other hand, when the frequency is low, the convergence speed of network will decrease. This may cause network interruption and worse network reliability. Compared with traditional link detection method, this solution propose an efficient optimization method which can monitor links automatically. This method can reduce lots of manual configuration work, avoid various types of errors and high cost. Furthermore, it also eases the collection of link status notifications for the controller. In Figure 3, if the controller need to detect link status from R1 to R2, the process is as following. Marcus, et al. Expires December 30,2017 [Page 5] INTERNET DRAFT Fast Link Status Notification June 28,2017 +------------+ +-+ +-----------+ Controller +------+ +-+ |1| | ibgp1 +------------+ ibgp2| |3| +-+ | | +-+ +--+--+ +--+--+ | R1 | ebgp | R2 | | AS1 +------------------------/+ AS2 | +--+--+ +-+ / +--+--+ | |2| / | |ebgp +-+ / |ebgp +--+--+ / +--+--+ |R5 | / | R3 | |AS5 | port is / | AS3 | +-----+ automatically +-----+ monitored Figure 3 the principle of this solution Step 1: a) Controller sends route update message A1 to R1 (nonperiodic, just once) then they can establish a peer. In A1, there's instructions that can enable R1's port (link) status monitoring function. b) is the same as a>, only the objective is R2. c) The A1 message only contains one-hop community attribute and its prefix is used to identify device R1. Step 2: When R1 receives route update message A1 from controller, it will add a no_export attribute so it can only publish to egbp neighbor R2. R2 will publish this route message to controller through ibgp instead of its ebgp neighbor device R3. a) R2 finds that message A1 comes from R1 according to the community in A1. b) Here we need to define a dedicated bit in communities to specify that R2 should start to monitor its link when it receives this indication. Hence, start to monitor all the links from R1 to R2 in this step. step 3 If it detects ports (links) status has changed in step 2 b), on the Marcus, et al. Expires December 30,2017 [Page 6] INTERNET DRAFT Fast Link Status Notification June 28,2017 one hand, if the port status switches from normal to fault, R2 will tell controller a withdraw message through ibgp. On the other hand, R2 will tell controller a announce message through ibgp. step 4 When controller receives route A1 update message from R2: a) Find corresponding link based on received A1 update message . Prefix marks network device R1 and srcIP means device R2. The can tell controller this is the link from R1 to R2. b) If the message is route announce type, link status is normal, otherwise, the withdraw type means link status is fault. It is important to notice here that we do not prefer any link detection mechanism and the BGP implementation on a vendor's device is free to activate any link detection mechanism it chooses (some examples are BFD, either auto-sensing feature etc.). 2.2 Advantages and Benefits of this solution Generally speaking, we need a dedicated bit of communities that can notify R2 to start monitoring the link between R1 and R2. It's quite simple but there are many advantages of this solution. 1. It needs no extra configuration and can monitor corresponding ports (links) automatically. It helps controller know about every link status with existing BGP protocols. It can avoid lots of manual configuration and unnecessary errors and costs caused by manual configuration. 2. It can solve the conflict that network needs fast convergence time but controller capacity constraint. Using this solution, network with single controller can support 100k servers while other method can only support 512 servers. 3. The performance of real-time link failure recovery is better. With experiments, link failure report time reduces from 3s to less than 50ms, link failure recovery time decreases from 1s to less than 50ms. 3 IANA Considerations The IANA has registered Transitive Extended Community Types in RFC7153. This registry contains values of the high-order octet (the "Type" field) of a Transitive Extended Community. Marcus, et al. Expires December 30,2017 [Page 7] INTERNET DRAFT Fast Link Status Notification June 28,2017 This method only needs one unassigned type value to notify device monitoring corresponding links(ports). 4 References 4.1 Normative References [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, April 1998. [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A Border Gateway Protocol 4 (BGP-4)", RFC 4271, January 2006. [RFC5880] Katz, D. and D. Ward, "Bidirectional Forwarding Detection (BFD)", RFC 5880, June 2010. [RFC7153] E. Rosen, Y. Rekhter, "IANA Registries for BGP Extended Communities", RFC 7153, March 2014. [RFC7938] P. Lapukhov, A. Premji, J. Mitchell, Ed., "Use of BGP for Routing in Large-Scale Data Centers", RFC 7938, August 2016. 4.2 Informative References [RFC3765] Huston, G., "NOPEER Community for Border Gateway Protocol (BGP) Route Scope Control", RFC 3765, April 2004. [RFC6286] E. Chen, J. Yuan, "Autonomous-System-Wide Unique BGP Identifier for BGP-4", RFC 6286, June 2011. [RFC6608] J. Dong, M. Chen, A. Suryanarayana, "Subcodes for BGP Finite State Machine Error", RFC 6608, May 2012. [RFC7606] E. Chen, Ed., J. Scudder, Ed., P. Mohapatra, K. Patel, "Revised Error Handling for BGP UPDATE Messages", RFC 7606, August 2015. [RFC7705] W. George, S. Amante, "Autonomous System Migration Mechanisms and Their Effects on the BGP AS_PATH Attribute", RFC 7705, November 2015. [RFC7752] H. Gredler, Ed., J. Medved, S. Previdi, A. Farrel, S. Ray, "North-Bound Distribution of Link-State and Traffic Engineering (TE) Information Using BGP", RFC 7752, March 2016. Authors' Addresses Marcus Sun HUAWEI TECHNOLOGIES CO.,LTD 12 E. Mozhou Rd.Nanjing,Jiangsu China EMail: marcus.sun@huawei.com Burjiz Pithawala HUAWEI TECHNOLOGIES CO.,LTD 2330 Central Expressway, Santa Clara, CA 95050 US EMail: burjiz.pithawala1@huawei.com Feng Gao BAIDU Inc. 10 shangdi shijie Haidian, Beijing Email:gaofeng04@baidu.com Marcus, et al. Expires December 30,2017 [Page 8] INTERNET DRAFT Fast Link Status Notification June 28,2017 Marcus, et al. Expires December 30,2017 [Page 9]