BESS Working Group Ali Sajassi Internet Draft Gaurav Badoni Category: Standard Track Priyanka Warade Suresh Pasupula Cisco Systems Expires: January 2, 2017 July 2, 2017 L3 Aliasing and Mass Withdrawal Support for EVPN draft-sajassi-bess-evpn-ip-aliasing-00.txt Abstract This draft proposes an extension to [RFC7432] to do Aliasing for Layer 3 routes that is needed for symmetric IRB to build a complete IP ECMP. Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Copyright and License Notice Copyright (c) 2017 IETF Trust and the persons identified as the document authors. All rights reserved. Sajassi, et al. Expires January 2, 2017 [Page 1] INTERNET DRAFT IP Aliasing Support for EVPN> July 2, 2017 This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . 3 2 IP Aliasing and Backup Path . . . . . . . . . . . . . . . . . . 4 2.1 Constructing Ethernet A-D per EVPN Instance Route . . . . . 5 3 Fast Convergence for Routed Traffic . . . . . . . . . . . . . . 6 3.1 Constructing Ethernet A-D per Ethernet Segment Route . . . . 7 3.1.1 Ethernet A-D Route Targets . . . . . . . . . . . . . . . 7 3.2 Avoiding convergence issues by syncing IP prefixes . . . . . 7 3.3 Handling Silent Host . . . . . . . . . . . . . . . . . . . . 8 3.4 MAC Aging . . . . . . . . . . . . . . . . . . . . . . . . . 8 4 Determining Reach-ability to Unicast IP Addresses . . . . . . . 9 4.1 Local Learning . . . . . . . . . . . . . . . . . . . . . . . 9 4.2 Remote Learning . . . . . . . . . . . . . . . . . . . . . . 9 4.2.1 Constructing MAC/IP Address Advertisement . . . . . . . 9 4.2.2 Route Resolution . . . . . . . . . . . . . . . . . . . . 9 5 Forwarding Unicast Packets . . . . . . . . . . . . . . . . . . 9 6 Load Balancing of Unicast Packets . . . . . . . . . . . . . . . 10 7 Security Considerations . . . . . . . . . . . . . . . . . . . . 10 8 IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 10 9 References . . . . . . . . . . . . . . . . . . . . . . . . . . 10 9.1 Normative References . . . . . . . . . . . . . . . . . . . 10 9.2 Informative References . . . . . . . . . . . . . . . . . . 10 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 10 Sajassi, et al. Expires January 2, 2017 [Page 2] INTERNET DRAFT IP Aliasing Support for EVPN> July 2, 2017 1 Introduction +---------+ +-------------+ | | | | | | / | PE1 |----| | +-------------+ / | | | MPLS/ | | | / +-------------+ | VxLAN/ | | PE3 |---H3 H1--- | NVGRE | | | \ +-------------+ | |---| | \ | | | | +-------------+ \ | PE2 |----| | | | | | +-------------+ | | | | | | +---------+ Figure 1: Inter-subnet traffic between Multihoming PEs and Remote PE Consider a pair of multi-homing TORs PE1 and PE2. Let there be a host H1 attached to them. Consider another TOR PE3 and a host H3 attached to it. With Asymmetric IRB, if H3 sends inter-subnet traffic to H1, routing will happen at PE3. PE3 will have the destination SVI and will trigger ARP if it does not have an ARP adjacency to H1. Finally routing lookup will resolve destination MAC to H1's MAC address. Furthermore, H1's MAC will point to a VxLAN ECMP to T1 and T2, either due to host route advertisement or MAC Aliasing as detailed in [RFC 7432]. With Symmetric IRB, if H3 sends inter-subnet traffic to H1, routing lookup will happen at PE3. PE3 will do a routing lookup in the L3VNI- VRF context and is not expected to have the destination SVI. Therefore at PE3, we need an IP ECMP list (PE1/PE2) to be built for H1's IP address for proper load balancing. If H1 is locally learnt only at one of the PEs, PE1 or PE2 due to port-channel hashing, we will not be able to build IP ECMP at PE3 as we do not do Aliasing for Layer 3 addresses. This draft proposes an extension to do Aliasing for Layer 3 routes that is needed for symmetric IRB to build a complete IP ECMP. 1.1 Terminology Sajassi, et al. Expires January 2, 2017 [Page 3] INTERNET DRAFT IP Aliasing Support for EVPN> July 2, 2017 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. IRB: Integrated Routing and Bridging IRB Interface: A virtual interface that connects the bridging module and the routing module on an NVE. Broadcast Domain: In a bridged network, the broadcast domain corresponds to a Virtual LAN (VLAN), where a VLAN is typically represented by a single VLAN ID (VID) but can be represented by several VIDs where Shared VLAN Learning (SVL) is used per [802.1Q]. Bridge Table: An instantiation of a broadcast domain on a MAC-VRF. CE: Customer Edge device, e.g., a host, router, or switch. EVI: An EVPN instance spanning the Provider Edge (PE) devices participating in that EVPN. MAC-VRF: A Virtual Routing and Forwarding table for Media Access Control (MAC) addresses on a PE. Ethernet Segment (ES): When a customer site (device or network) is connected to one or more PEs via a set of Ethernet links, then that set of links is referred to as an 'Ethernet segment'. Ethernet Segment Identifier (ESI): A unique non-zero identifier that identifies an Ethernet segment is called an 'Ethernet Segment Identifier'. LACP: Link Aggregation Control Protocol. PE: Provider Edge device. Single-Active Redundancy Mode: When only a single PE, among all the PEs attached to an Ethernet segment, is allowed to forward traffic to/from that Ethernet segment for a given VLAN, then the Ethernet segment is defined to be operating in Single-Active redundancy mode. All-Active Redundancy Mode: When all PEs attached to an Ethernet segment are allowed to forward known unicast traffic to/from that Ethernet segment for a given VLAN, then the Ethernet segment is defined to be operating in All-Active redundancy mode. 2 IP Aliasing and Backup Path Sajassi, et al. Expires January 2, 2017 [Page 4] INTERNET DRAFT IP Aliasing Support for EVPN> July 2, 2017 Host IP and MAC routes are learnt by PEs on the access side via a control plane protocol like ARP. In case where a CE is multihomed to multiple PE nodes using a LAG and is running in All-Active Redundancy Mode, the Host IP will be learnt and advertised in the MAC/IP Advertisement only by the PE that receives the ARP packet. As a result, the remote PE sees only one next-hop for the Host IP and forwards traffic to that advertising PE. Hence, the remote PE is not be able to effectively load balance the traffic towards the multihomed Ethernet Segment. To address this issue, concept of Aliasing that was introduced in RFC 7432 [RFC7432], can be extended for Layer 3 routes as well. The PE SHOULD advertise reachability to an L3 VRF instance on a given ES for IP addresses using the existing EAD/EVI route. In this case, the EVPN instance is the VRF table to which the host IP address belongs. This will henceforth be referred to as the IP-EAD/EVI route. A remote PE that receives an IP route with a non reserved ESI SHOULD consider it reachable by all PEs that have advertised the IP-EAD/EVI advertisement route and the EAD/ES advertisement route containing the VRF Route-Targets for that ES. The EAD/ES route must have the Single- Active bit in the flags of the ESI Label extended community set to 0 for Aliasing to take effect. The IP-EAD/EVI route cannot be used for route forwarding until the associated Ethernet A-D per ES route is received. In case of Single-Active redundancy mode, the remote PE SHOULD use the IP-EAD/EVI route EVPN Layer 2 attribute extended community as mentioned in draft-ietf-bess-evpn-vpws-07 in combination with the EAD/ES route to determine the Backup Path for the IP addresses for the given IP VRF context. This alternate path SHOULD be installed as a backup path for the IP address. 2.1 Constructing Ethernet A-D per EVPN Instance Route This draft proposes the advertisement of per EVI Ethernet A-D route for IP VRFs to enable Aliasing for IP addresses. The usage/construction of this route remains similar to that described in RFC 7432 with a few notable exceptions as below. * The Route-Distinguisher should be set to the corresponding L3VPN context. * The Ethernet Tag should be set to 0. * The L3 EAD/EVI SHOULD carry one or more IP VRF Route-Target (RT) Sajassi, et al. Expires January 2, 2017 [Page 5] INTERNET DRAFT IP Aliasing Support for EVPN> July 2, 2017 attributes. * The L3 EAD/EVI SHOULD carry the RMAC Extended Community attribute. * The MPLS Label usage should be as described in RFC 7432. It is important to note that the prefix for a IP-EAD/EVI and L2- EAD/EVI may be identical. However, since the RD of the IP-EAD/EVI is set to the corresponding L3VPN context and the RD of the L2-EAD/EVI is set to the corresponding MAC-VRF context, the import will happen in the respective IP-VRFs and MAC-VRFs and hence, the prefix will not be overwritten. 3 Fast Convergence for Routed Traffic In EVPN, Host IP reachability is learned via the BGP control plane over the MPLS network. All the hosts that are dually connected behind an ES are advertised by the PEs belonging to the redundancy group. A remote TOR receiving these host routes can loose reachability from any of the PEs either due to box reload or core failure or access failure for that PE. BGP PIC functionality is the existing mechanism for fast convergence as described in https://tools.ietf.org/html/draft-rtgwg-bgp-pic-02. PIC feature doesn't solve the convergence issue for the access failure cases as the PEs are still reachable from the remote TOR. To alleviate this, EVPN defines a mechanism to efficiently and quickly signal, to remote PE nodes, the need to update their forwarding tables upon the occurrence of a failure in connectivity to an Ethernet segment. This is done by having each PE advertise a set of one or more Ethernet A-D per ES routes for each locally attached Ethernet segment (refer to Section 3.1 below for details on how these routes are constructed). A PE may need to advertise more than one Ethernet A-D per ES route for a given ES because the ES may be in a multiplicity of EVIs and the RTs for all of these EVIs may not fit into a single route. Advertising a set of Ethernet A-D per ES routes for the ES allows each route to contain a subset of the complete set of RTs. Each Ethernet A-D per ES route is differentiated from the other routes in the set by a different Route Distinguisher (RD). Upon failure in connectivity to the attached ES, the PE withdraws the corresponding set of Ethernet A-D per ES routes. This triggers all PEs that receive the withdrawal to update their next-hop adjacencies for all IP addresses across IP VRFs associated with the Ethernet segment in question. If no other PE has advertised an Ethernet A-D route for the same segment, then the PE that received the withdrawal simply invalidates the IP entries for that segment. Otherwise, the Sajassi, et al. Expires January 2, 2017 [Page 6] INTERNET DRAFT IP Aliasing Support for EVPN> July 2, 2017 PE updates its next-hop adjacencies accordingly. These routes should be processed with higher priority than other MAC or MAC-IP withdrawals upon failure. Similar priority processing is needed even on the intermittent RRs. This draft is addressing the mass withdrawal behavior for routed traffic. For Layer-2, please refer to Section 8.2 of RFC 7432. 3.1 Constructing Ethernet A-D per Ethernet Segment Route This section describes the procedures used to construct the Ethernet A-D per ES route, which is used for fast convergence (as discussed above). The usage/construction of this route remains similar to that described in section 8.2.1. of RFC 7432 with a few notable exceptions as explained in following sections. 3.1.1 Ethernet A-D Route Targets Each Ethernet A-D per ES route MUST carry one or more Route Target (RT attributes). The set of Ethernet A-D routes per ES MUST carry the entire set of IP VRF RTs for all the IP VRFs in addition to MAC VRF RTS for all the EVPN instance to which the Ethernet segment belongs. 3.2 Avoiding convergence issues by syncing IP prefixes Consider a pair of multi-homing TORs PE1 and PE2. Let there be a host H1 attached to them. Consider another TOR PE3 and a host H3 attached to it. If the host H1 is learnt on both the PEs, ECMP path list is formed on PE3 pointing to (PE1/PE2). Traffic from H3 to H1 is not impacted even if one of the TORs becomes unreachable as the path list gets corrected upon receiving the mass withdrawal route (Ethernet A-D segment). Let us consider a case where H1 is locally learnt only on PE1 due to port-channel hashing. At PE3, H1 has ECMP path list (PE1/PE2) using Aliasing as described in section 2 of this draft. Traffic from H3 can reach either of the TORs PE1 or PE2. On PE2, all the remote MAC-IP routes belonging to the same Ethernet Segment that are advertised by it's respective peers (PE1 in our example) should be synced and installed locally on PE2 but not advertised as local routes by BGP. When the traffic from H3 reaches PE2, it will be able forward the traffic to H1 without any convergence delay caused by triggering ARP/ND. In a scaled setup, the convergence can be significant as the ARP and ND resolution can take Sajassi, et al. Expires January 2, 2017 [Page 7] INTERNET DRAFT IP Aliasing Support for EVPN> July 2, 2017 a lot of time. So syncing the IPv4/6 prefixes that belong to same Ethernet Segment helps in solving convergence issues. 3.3 Handling Silent Host In continuation with the discussion above, if the reachability of PE1 is lost, PE3 will update the ECMP list for H1 to PE2, upon receiving mass withdrawal from PE1. If host H1 is also withdrawn from PE1, then the same route is withdrawn from PE2 and PE3. Hence traffic from H3 to H1 is black-holed till H1 is re-learnt on PE2. This black-holing can be much worse if the H1 behaves like a silent host. IP address of H1 will not be re-learnt on PE2 till H1 re-ARPs or some traffic triggers ARP for H1. PE2 can detect the failure of PE1's reachability in following ways a) When core failure or box reload happens on PE1, next hop reachability to PE1 can be detected by the underlay routing protocols. b) Upon access failure, PE1 sends withdraws the EAD/ES Route and PE2 can use this as a trigger to detect failure. Thus to avoid the black-holing, when PE2 detects loss of reachability to PE1, it should trigger ARP/ND for all remote IP prefixes received from it's ES peers (i.e. PE1) belonging to same Ethernet Segment across IP-VRF contexts. This will force host H1 to reply to the solicited ARP/ND from PE2 and refresh both MAC and IP for the corresponding host in its tables. Even in core failure scenario on PE1, PE1 must withdraw all its local L2 connectivity, as L2 traffic should not be received by PE1. So when ARP/ND is triggered from PE2 the replies from host H1 can only be received by PE2. Thus H1 will be learnt as local route and also advertised from PE2. It is recommended to have a staggered or delayed deletion of the IP routes from PE1, so that ARP/ND refresh can happen on PE2 before the deletion. 3.4 MAC Aging PE1 would do ARP/ND refresh for H1 before it ages out. During this process, H1 on can age out genuinely or due to the ARP/ND reply landing on PE2. PE1 must withdraw the local entry from BGP when H1 entry ages out. PE1 deletes the entry from the local forwarding only Sajassi, et al. Expires January 2, 2017 [Page 8] INTERNET DRAFT IP Aliasing Support for EVPN> July 2, 2017 when there are no remote synced entries. 4 Determining Reach-ability to Unicast IP Addresses 4.1 Local Learning The procedures for local learning do not change from [RFC7432]. 4.2 Remote Learning The procedures for remote learning do not change from [RFC7432]. 4.2.1 Constructing MAC/IP Address Advertisement The procedures for constructing MAC/IP Address Advertisement do not change from RFC 7432 4.2.2 Route Resolution If the ESI field is set to reserved values of 0 or MAX-ESI, the the IP route resolution MUST be based on the MAC-IP route alone. If the ESI field is set to a non-reserved ESI, the IP route resolution MUST happen only when both the MAC-IP route and the associated set of Ethernet AD per ES routes have been received. To illustrate this with an example, consider a pair of multi-homed TORs PE1 and PE2 connected to an Ethernet Segment. ES1 in an all-active redundancy mode. A given host with IP address H1 is leant by PE1 but not by PE2. When the MAC-IP advertisement route from PE1 and a set of EAD/ES and Layer 3 EAD/EVI routes from PE1 and PE2 are received, PE3 can forward traffic destined to H1 to both PE1 and PE2. If after (1) PE1 withdraws EAD/ES, then PE3 will forward the said traffic to PE2 only. If after (1) PE2 withdraws EAD/ES, then PE3 will forward the said traffic to PE1 only. If after (1) PE1 withdraws the MAC-IP route, then PE3 will do delayed deletion of H1, as described in section 3.3. If after (1) PE2 advertised the MAC-IP route, but PE1 withdraws it, PE3 will continue forwarding to both PE1 and PE2 as long as it has the EAD/ES and the Layer 3 EAD/EVI route from both. 5 Forwarding Unicast Packets Sajassi, et al. Expires January 2, 2017 [Page 9] INTERNET DRAFT IP Aliasing Support for EVPN> July 2, 2017 Please refer to Section 5 in the draft-ietf-bess-evpn-inter-subnet- forwarding-01 6 Load Balancing of Unicast Packets The procedures for load balancing of Unicast Packets do not change from [RFC7432] 7 Security Considerations The mechanisms in this document use EVPN control plane as defined in [RFC7432]. Security considerations described in [RFC7432] are equally applicable. This document uses MPLS and IP-based tunnel technologies to support data plane transport. Security considerations described in [R7432] and in [ietf-evpn-overlay] are equally applicable. 8 IANA Considerations 9 References 9.1 Normative References [KEYWORDS] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC1776] Crocker, S., "The Address is the Message", RFC 1776, April 1 1995. [TRUTHS] Callon, R., "The Twelve Networking Truths", RFC 1925, April 1 1996. 9.2 Informative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. Authors' Addresses Sajassi, et al. Expires January 2, 2017 [Page 10] INTERNET DRAFT IP Aliasing Support for EVPN> July 2, 2017 Ali Sajassi Cisco Email: sajassi@cisco.com Suresh Pasupula Cisco Email: spasupula@cisco.com Gaurav Badoni Cisco Email: gbadoni@cisco.com Priyanka Warade Cisco Email: pwarade@cisco.com Sajassi, et al. Expires January 2, 2017 [Page 11]