Cisco Support Community
cancel
Showing results for 
Search instead for 
Did you mean: 

ASR9000/XR: Load-balancing architecture and characteristics

Introduction

In this document it is discussed how the ASR9000 decides how to take multiple paths when it can load-balance. This includes IPv4, IPv6 and both ECMP and Bundle/LAG/Etherchannel scenarios in both L2 and L3 environments

Core Issue

The load-balancing architecture of the ASR9000 might be a bit complex due to the 2 stage forwarding the platform has. In this article the various scenarios should explain how a load-balancing decision is made so you can architect your network around it.

In this document it is assumed that you are running XR 4.1 at minimum (the XR 3.9.X will not be discussed) and where applicable XR42 enhancements are alerted.

Load-balancing Architecture and Characteristics

Characteristics

ASR9000 has the following load-balancing characteristics:

  • ECMP:
  1. Non recursive or IGP paths : 32-way
  2. Recursive or BGP paths:
    1. 8-way for Trident
    2. 32 way for Typhoon
    3. 64 way Typhoon in XR 5.1+
    4. 64 way Tomahawk XR 5.3+ (Tomahawk only supported in XR 5.3.0 onwards)

  • Bundle:
  1. 64 members per bundle

The way they tie together is shown in this simplified L3 forwarding model:

Screen Shot 2012-08-28 at 12.12.25 PM.png

NRLDI = Non Recursive Load Distribution Index

RLDI = Recursive Load Distribution Index

ADJ = Adjancency (forwarding information)

LAG = Link Aggregation, eg Etherchannel or Bundle-Ether interface

OIF = Outgoing InterFace, eg a physical interface like G0/0/0/0 or Te0/1/0/3

What this picture shows you is that a Recursive BGP route can have 8 different paths, pointing to 32 potential IGP ways to get to that BGP next hop, and EACH of those 32 IGP paths can be a bundle which could consist of 64 members each!

Architecture

The architecture of the ASR9000 load-balancing implementation surrounds around the fact that the load-balancing decision is made on the INGRESS linecard.

This ensures that we ONLY send the traffic to that LC, path or member that is actually going to forward the traffic.

The following picture shows that:

Screen Shot 2012-08-28 at 1.58.42 PM.png

In this diagram, let's assume there are 2 paths via the PATH-1 on LC2 and a second path via a Bundle with 2 members on different linecards.

(note this is a bit extraordinary considering that equal cost paths can't be mathematically created by a 2 member bundle and a single physical interface)

The Ingress NPU on the LC1 determines based on the hash computation that PATH1 is going to forward the traffic, then traffic is sent to LC2 only.

If the ingress NPU determines that PATH2 is to be chosen, the bundle-ether, then the LAG (link aggregation) selector points directly to the member and traffic is only sent to the NP on that linecard of that member that is going to forward the traffic.

Based on the forwarding achitecture you can see that the adj points to a bundle which can have multiple members.

Allowing this model, when there are lag table udpates (members appearing/disappearing) do NOT require a FIB update at all!!!

What is a HASH and how is it computed

In order to determine which path (ECMP) or member (LAG) to choose, the system computes a hash. Certain bits out of this hash are used to identify member or path to be taken.

  • Pre 4.0.x Trident used a folded XOR methodology resulting in an 8 bit hash from which bits were selected
  • Post 4.0.x Trident uses a checksum based calculation resulting in a 16 bit hash value
  • Post 4.2.x Trident uses a checksum based calculation resulting in a 32 bit hash value
  • Typhoon 4.2.0 uses a CRC based calculation of the L3/L4 info and computes a 32 bit hash

8-way recursive means that we are using 3 bits out of that hash result

32-way non recursive means that we are using 5 bits

64 members means that we are looking at 6 bits out of that hash result

It is system defined, by load-balancing type (recursive, non-recursive or  bundle member selection) which bits we are looking at for the  load-balancing decision.

Fields used in ECMP HASH

What is fed into the HASH depends on the scenario:

Incoming Traffic Type Load-balancing Parameters
IPv4

Source IP, Destination IP, Source port (TCP/UDP only), Destination port (TCP/UDP only), Router ID

   
IPv6

Source IP, Destination IP, Source port (TCP/UDP only), Destination port (TCP/UDP only), Router ID

   
MPLS - IP Payload, with < 4 labels

Source IP, Destination IP, Source port (TCP/UDP only), Destination port (TCP/UDP only), Router ID

- IP Payload, with > 4 labels

4th MPLS Label (or Inner most) and Router ID

- Non-IP Payload

Inner most MPLS Label and Router ID

* Non IP Payload includes an Ethernet interworking, generally seen on Ethernet Attachment Circuits running VPLS/VPWS.

These have a construction of

EtherHeader-Mpls(next hop label)-Mpls(pseudowire label)-etherheader-InnerIP

In those scenarios the system will use the MPLS based case with non ip payload.

IP Payload in MPLS is a common case for IP based MPLS switching on LSR's whereby after the inner label an IP header is found directly.

Router ID

The router ID is a value taken from an interface address in the system in an order to attempt to provide some per node variation

This value is determined at boot time only and what the system is looking for is determined by:

sh arm router-ids

 

Example:

 

RP/0/RSP0/CPU0:A9K-BNG#show arm router-id

Tue Aug 28 11:51:50.291 EDT

Router-ID         Interface

 

8.8.8.8           Loopback0      

RP/0/RSP0/CPU0:A9K-BNG#

Bundle in L2 vs L3 scenarios

This section is specific to bundles. A bundle can either be an AC or attachment circuit, or it can be used to route over.

Depending on how the bundle ether is used, different hash field calculations may apply.

When the bundle ether interface has an IP address configured, then we follow the ECMP load-balancing scheme provided above.

When the bundle ether is used as an attachment circuit, that means it has the "l2transport" keyword associated with it and is used in an xconnect or bridge-domain configuration, by default L2 based balancing is used. That is Source and Destination MAC with Router ID.

If you have 2 routers on each end of the AC's, then the MAC's are not varying a lot, that is not at all, then you may want to revert to L3 based balancing which can be configured on the l2vpn configuration:

 

RP/0/RSP0/CPU0:A9K-BNG#configure

RP/0/RSP0/CPU0:A9K-BNG(config)#l2vpn

RP/0/RSP0/CPU0:A9K-BNG(config-l2vpn)#load-balancing flow ?

  src-dst-ip   Use source and destination IP addresses for hashing

  src-dst-mac  Use source and destination MAC addresses for hashing

 

Use case scenarios

Screen Shot 2012-08-28 at 1.11.11 PM.png

Case 1 Bundle Ether Attachment circuit (downstream)

In this case the bundle ether has a configuration similar to

 

interface bundle-ether 100.2 l2transport

  encap dot1q 2

  rewrite ingress tag pop 1 symmetric

 

And the associated L2VPN configuration such as:

 

l2vpn

  bridge group BG

  bridge-domain BD

   interface bundle-e100.2

 

In the downstream direction by default we are load-balancing with the L2 information, unless the load-balancing flow src-dest-ip is configured.

Case 2 Pseudowire over Bundle Ether interface (upstream)

The attachment circuit in this case doesn't really matter, whether it is bundle or single interface.

The associated configuration for this in the L2VPN is:

 

l2vpn

  bridge group BG

   bridge-domain BD

    interface bundle-e100.2

    vfi MY_VFI

    neighbor 1.1.1.1 pw-id 2

 

interface bundle-ether 200

  ipv4 add 192.168.1.1 255.255.255.0

 

router static

  address-family ipv4 unicast

    1.1.1.1/32 192.168.1.2

 

In this case neighbor 1.1.1.1 is found via routing which appens to be egress out of our bundle Ethernet interface.

This is MPLS encapped (PW) and therefore we will use MPLS based load-balancing.

Case 3 Routing through a Bundle Ether interface

In this scenario we are just routing out the bundle Ethernet interface because our ADJ tells us so (as defined by the routing).

Config:

interface bundle-ether 200

ipv4 add 200.200.1.1 255.255.255.0

 

show route (OSPF inter area route)

O IA 49.1.1.0/24 [110/2] via 200.200.1.2, 2w4d, Bundle-Ether200

Even if this bundle-ether is MPLS enabled and we assign a label to get to the next hop or do label swapping, in this case

the Ether header followed by MPLS header has Directly IP Behind it.

We will be able to do L3 load-balancing in that case as per chart above.

(Layer 3) Load-balancing in MPLS scenarios

As attempted to be highlighted throughout this technote the load-balacning in MPLS scenarios, whether that be based on MPLS label or IP is dependent on the inner encapsulation.

Depicted in the diagram below, we have an Ethernet frame with IP going into a pseudo wire switched through the LSR (P router) down to the remote PE.

Screen Shot 2012-08-28 at 1.22.26 PM.png

Pseudowires in this case are encapsulating the complete frame (with ether header) into mpls with an ether header for the next hop from the PE left router to the LSR in the middle.

Although the number of labels is LESS then 4. AND there is IP available, the system can't skip beyond the ether header and read the IP and therefore falls back to MPLS label based load-balancing.

How does system differentiate between an IP header after the inner most label vs non IP is explained here:

Just to recap, the MPLS header looks like this:

Screen Shot 2012-08-28 at 1.28.12 PM.png

Now the important part of this picture is that this shows MPLS-IP. In the VPLS/VPWS case this "GREEN" field is likely start with Ethernet headers.

Because hardware forwarding devices are limited in the number of PPS they can handle, and this is a direct equivalent to the number of instructions that are needed to process a packet, we want to make sure we can work with a packet in the LEAST number of instructions possible.

In order to comply with that thought process, we check the first nibble following the MPLS header and if that starts with a 4 (ipv4) or a 6 (ipv6) we ASSUME that this is an IP header and we'll interpret the data following as an IP header deriving the L3 source and destination.

 

Now this works great in the majority scenarios, because hey let's be honest, MAC addresses for the longest time started with 00-0......

in other words not a 4 or 6 and we'd default to MPLS based balancing, something that we wanted for VPLS/VPWS.

However, these days we see mac addresses that are not starting with zero's anymore and in fact 4's or 6's are seen!

This fools the system to believe that the inner packet is IP, while it is an Ether header in reality.

There is no good way to classify an ip header with a limited number of instruction cycles that would not affect performance.

In an ideal world you'd want to use an MD5 hash and all the checks possible to make the perfect decision.

Reality is different and no one wants to pay the price for it either what it would cost to design ASICS that can do high performance without affecting the PPS rate due to a very very comprehensive check of tests.

Bottom line is that if your DMAC starts with a 4 or 6 you have a situation.

Solution

Use the MPLS control word.

Control word is negotiated end to end and inserts a special 4 bytes with zero's especially to accommodate this purpose.

The system will now read a 0 instead of a 4 or 6 and default to MPLS based balancing.

Configuration

to enable control word use the follow template:

 

l2vpn

pw-class CW

  encapsulation mpls

   control-word

  !

!

xconnect group TEST

  p2p TEST_PW

   interface GigabitEthernet0/0/0/0

   neighbor 1.1.1.1 pw-id 100

    pw-class CW

   !

  !

!

!

Alternative solutions: Fat Pseudowire

Since you might have little control over the inner label, the PW label, and you probably want to ensure some sort of load-balancing, especially on P routers that have no knowledge over the offered service or mpls packets it transports another solution is available known as FAT Pseudowire.

FAT PW inserts a "flow label" whereby the label has a value that is computed like a hash to provide some hop by hop variation and more granular load-balancing. Special care is taken into consideration that there is variation (based on the l2vpn command, see below) and that no reserved values are generated and also don't collide with allocated label values.

Fat PW is supported starting XR 4.2.1 on both Trident and Typhoon based linecards.

One note to make is that with the relatively newer functionality of PseudoWire Headend (aka PWHE), FAT labels cannot be used. You can configure it, but the forwarding, especially downstream, entering the pwhe interface will break.

PWHE requires a pindown list to identify the possible ingress/egress interfaces for the pw to come in on (and all features are programmed there). The system will loadbalance based on pwid, on both the a9k startign the pwhe as well as then obviously also interim P nodes. The reason for that is that the egress interface selection is already made at the time the pw encap is made. It isnt a simple change in the current forwarding architecture to "rehash" based on something else, without incurring heavy pps reduction (and possible recirculation).

Packet transformation with a Flow Label

Screen Shot 2012-08-29 at 1.19.42 PM.png

Configuration of FAT Pseudowire

The following is configuration example :

 

l2vpn

load-balancing flow src-dst-ip

pw-class test

encapsulation mpls

   load-balancing

   flow-label both static

   !

!

 

You can also affect the way that the flow label is computed:

Under L2VPN configuration, use the “load-balancing flow” configuration command to determine how the flow label is generated:

l2vpn

    load-balancing flow src-dst-mac

This is the default configuration, and will cause the NP to build the flow label from the source and destination MAC addresses in each frame.

l2vpn

    load-balancing flow src-dst-ip

 

This is the recommended configuration, and will cause the NP to build the flow label from the source and destination IP addresses in each frame.

• Note that IPv6 hashing is not supported in the first release.
 
FAT Pseudowire TLV

Flow Aware Label (FAT) PW signalled sub-tlv id is currently carrying value 0x11 as specified originally in draft draft-ietf-pwe3-fat-pw. This value has been recently corrected in the draft and should be 0x17. Value 0x17 is the flow label sub-TLV identifier assigned by IANA.

When Inter operating between XR versions 4.3.1 and earlier, with XR version 4.3.2 and later. All XR releases 4.3.1 and prior that support FAT

PW will default to value 0x11. All XR releases 4.3.2 and later default to value 0x17.

Solution:

Use the following config on XR version 4.3.2 and later to configure the sub-tlv id

pw-class <pw-name>

   encapsulation mpls

   load-balancing

   flow-label both

  flow-label code 17

NOTE: Got a lot of questions regarding the confusion about the statement of 0x11 to 0x17 change (as driven by IANA) and the config requirement for number 17 in this example.

The crux is that the flow label code is configured DECIMAL, and the IANA/DRAFT numbers mentioned are HEX.

So 0x11, the old value is 17 decimal, which indeed is very similar to 0x17 which is the new IANA assigned number. Very annoying, thank IANA

(or we could have made the knob in hex I guess )

Loadbalancing and priority configurations

In the case of VPWS or VPLS, at the ingress PE side, it’s possible to change the load-balance upstream to MPLS Core in three different ways:

1. At the L2VPN sub-configuration mode with “load-balancing flow” command with the following options:

RP/0/RSP1/CPU0:ASR9000(config-l2vpn)# load-balancing flow ?

  src-dst-ip

  src-dst-mac [default]

2. At the pw-class sub-configuration mode with “load-balancing” command with the following options:

RP/0/RSP1/CPU0:ASR9000(config-l2vpn-pwc-mpls-load-bal)#?

  flow-label [see FAT Pseudowire section]

  pw-label [per-VC load balance]

3. At the Bundle interface sub-configuration mode with “bundle load-balancing hash” command with the following options:

RP/0/RSP1/CPU0:ASR9000(config-if)#bundle load-balancing hash ? [For default, see previous sections]

  dst-ip 

  src-ip

It’s important to not only understand these commands but also that: 1 is weaker than 2 which is weaker than 3.

Example:

l2vpn

load-balancing flow src-dst-ip 

pw-class FAT

 

encapsulation mpls

control-word

   

transport-mode ethernet

load-balancing

pw-label

   

flow-label both static

interface Bundle-Ether1

(...)

bundle load-balancing hash dst-ip

Because of the priorities, on the egress side of the ingress PE (to the MPLS Core), we will do per-dst-ip load-balance (3).

If the bundle-specific configuration is removed, we will do per-VC load-balance (2).

If the pw-class load-balance configuration is removed, we will do per-src-dst-ip load-balance (1).

with thanks to Bruno Oliveira for this priority section

P2MP MPLS TE Tunnels

Only one bundle member will be selected to forward traffic on the P2MP MPLS TE mid-point node.

Possible alternatives that would achieve better load balancing are: a) increase the number of tunnels or b) switch to mLDP.

IPv6

Pre 4.2.0 releases, for the ipv6 hash calculation we only use the last 64 bits of the address to fold and feed that into the hash, this including the regular routerID and L4 info.

In 4.2.0 we made some further enhancements that the full IPv6 Addr is taken into consideration with L4 and router ID.

Determining load-balancing

You can determine the load-balancing on the router by using the following commands

L3/ECMP

For IP :

RP/0/RSP0/CPU0:A9K-BNG#show cef exact-route 1.1.1.1 2.2.2.2 protocol udp ?

  source-port  Set source port

You have the ability to only specify L3 info, or include L4 info by protocol with source and destination ports.

It is important to understand that the 9k does FLOW based hashing, that is, all packets belonging to the same flow will take the same path.

If one flow is more active or requires more bandwidth then another flow, path utilization may not be a perfect equal spread.

UNLESS you provide enough variation in L3/L4 randomness, this problem can't be alleviated and is generally seen in lab tests due the limited number of flows.

For MPLS based hashing :

RP/0/RSP0/CPU0:A9K-BNG#sh mpls forwarding exact-route label 1234 bottom-label 16000 ... location 0/1/cpu0

This command gives us the output interface chosen as a result of hashing with mpls label 16000. The bottom-label (in this case '16000') is either the VC label (in case of PW L2 traffic) or the bottom label of mpls stack (in case of mpls encapped L3 traffic with more than 4 labels). Please note that for regular mpls packets (with <= 4 labels) encapsulating an L3 packet, only IP based hashing is performed on the underlying IP packet.

Also note that the mpls hash algorithm is different for trident and typhoon. The varied the label is the better is the distribution. However, in case of trident there is a known behavior of mpls hash on bundle interfaces. If a bundle interface has an even number of member links, the mpls hash would cause only half of these links to be utlized. To get around this, you may have to configure "cef load-balancing adjust 3" command on the router. Or use odd number of member links within the bundle interface. Note that this limitation applies only to trident line cards and not typhoon.

Bundle member selection

RP/0/RSP0/CPU0:A9K-BNG#bundle-hash bundle-e 100 loc 0/0/cPU0

Calculate Bundle-Hash for L2 or L3 or sub-int based: 2/3/4 [3]: 3

Enter traffic type (1.IPv4-inbound, 2.MPLS-inbound, 3:IPv6-inbound): [1]: 1

Single SA/DA pair or range: S/R [S]:

Enter source IPv4 address [255.255.255.255]:

Enter destination IPv4 address [255.255.255.255]:

Compute destination address set for all members? [y/n]: y

Enter subnet prefix for destination address set: [32]:

Enter bundle IPv4 address [255.255.255.255]:

Enter L4 protocol ID. (Enter 0 to skip L4 data) [0]:

Invalid protocol. L4 data skipped.

Link hashed [hash_val:1] to is GigabitEthernet0/0/0/19 LON 1 ifh 0x4000580

The hash type L2 or L3 depends on whether you are using the bundle Ethernet interface as an Attachment Circuit in a Bridgedomain or VPWS crossconnect, or whether the bundle ether is used to route over (eg has an IP address configured).

Polarization

Polarization pertains mostly to ECMP scenarios and is the effect of routers in a chain making the same load-balancing decision.

The following picture tries to explain that.

Screen Shot 2012-08-28 at 11.11.32 AM.png

In this scenario we assume 2 bucket, 1 bit on a 7 bit hash result. Let's say that in this case we only look at bit-0. So it becomes an "EVEN" or "ODD" type decision. The routers in the chain have access to the same L3 and L4 fields, the only varying factor between them is the routerID.

In the case that we have RID's that are similar or close (which is not uncommon), the system may not provide enough variation in the hash result which eventually leads to subsequent routers to compute the same hash and therefor polarize to a "Southern" (in this example above) or "Northern" path.

In XR 4.2.1 via a SMU or in XR 4.2.3 in the baseline code, we provide a knob that allows for shifting the hash result. By choosing a different "shift" value per node, we can make the system look at a different bit (for this example), or bits.

Screen Shot 2012-08-28 at 11.57.20 AM.png

In this example the first line shifts the hash by 1, the second one shifts it by 2.

Considering that we have more buckets in the real implementation and more bits that we look at, the member or path selection can alter significantly based on the same hash but with the shifting, which is what we ultimately want.

HASH result Shifting

  • Trident allows for a shift of maximum of 4 (performance reasons)
  • Typhoon allows for a shift of maximum of 32.

Command

cef load-balancing algorithm adjust <value>

The command allows for values larger then 4 on Trident, if you configure values large then 4 for Trident, you will effectively use a modulo, resulting in the fact that shift of 1 is the same as a shift of 5

Fragmentation and Load-balancing

When the system detects fragmented packets, it will no longer use L4 information. The reason for that is that if L4 info were to be used, and subsequent fragments don't contain the L4 info anymore (have L3 header only!) the initial fragment and subsequent fragments produce a different hash result and potentially can take different paths resulting in out of order.

Regardless of release, regardless of hardware (ASR9K or CRS), when fragmentation is detected we only use L3 information for the hash computation.

Related Information

Xander Thuijs, CCIE #6775

Sr Tech Lead ASR9000

Version history
Revision #:
1 of 1
Last update:
‎08-28-2012 07:08 AM
Updated by:
 
Labels (1)
Comments
Anonymous
N/A

Hi Xander! how are you...

Let me tell about some test we did with horacio last week in connection with ospf areas and ASR9K. The topology is exact the layout. MPLS configured on all the interfaces. One H-VPLS between NPEs. OSPF as IGP, ABR running on AGGs, all links in the layout with the same cost, BGP neighbors ABRs-uPEs. High availability, same parameters in each guy, gratefull restart etc. We built pseudowires active and back-up between loopbacks so that uPE7 reach uPE8.

When we shutted down AGG-2 we experienced, I mean we had, lost of traffic for about 150 seconds! (see the log) very close the mpls hold-time set in 180s. We saw the time the back-up pseudowire becoming active improved when we remove gratefull restart and set mpls holdtime to low values. Although it was about 80 seconds, we expected this to be below one second.

In some example embodiments, when the active PW path goes down (e.g., is detected using a PW failure detection mechanism), the U-PE can immediately enable one of the backup PW paths (which are selected during configuration) and start forwarding traffic over the backup PW path with little or no signalling between the U-PE and N-PE

So, I wonder what we should do in order to improve this to subseconds. Maybe BFD on pwe? Does the release 4.2.1 or 4.3 support FFD (fast failure detection)? I mean,  will it be possible configured what is supported in 15.2S described in this link or something similar for the 9K?

http://www.cisco.com/en/US/docs/ios-xml/ios/mp_l2_vpns/configuration/15-2s/wan-l2vpn-pw-red.html

"The configuration of a trigger for redundant pseudowire switchover reduces the time that it takes a large number of pseudowires to failover. A fundamental component of bidirectional forwarding detection (BFD) capability is enabled by fast-failure detection (FFD).

The configuration of this feature refers to a BFD configuration, such as the following (the second URL in the bfd map command is the loopback URL in the monitor peer bfd command):

bfd-template multi-hop mh

interval min-tx 200 min-rx 200 multiplier 3 !

bfd map ipv4  10.1.1.0/24  10.1.1.1/32 mh

"

Our test layout

maqueta_ospf.jpg

uPE8 log during AGG-2 shutting down

UPE-8#

*Oct 25 21:55:53.810: %BGP-5-ADJCHANGE: neighbor 190.225.251.152 Down BGP Notification sent

*Oct 25 21:55:53.810: %BGP-3-NOTIFICATION: sent to neighbor 190.225.251.152 4/0 (hold time expired) 0 bytes

UPE-8#

*Oct 25 21:57:05.858: %LDP-5-GR: GR session 190.225.251.152:0 (inst. 1): interrupted--recovery pending  

*Oct 25 21:57:05.858: %LDP-5-NBRCHG: LDP Neighbor 190.225.251.152:0 (0) is DOWN (Discovery Hello Hold Timer expired)

*Oct 25 21:57:05.862: MPLS peer 190.225.251.152 vcid 12, VC DOWN, VC state DOWN

*Oct 25 21:57:05.862: XC L2TP: Failed to find session for peer 190.225.251.152, vcid 12

*Oct 25 21:58:15.238: XC EVT[190.225.251.152:12]: Received status indication DOWN, alarm 0

*Oct 25 21:58:15.242: XC EVT[190.225.251.152:12]: FSM EV: Pri Dn, CHG from PUp,SAv to PDn,SAv,Etmr

*Oct 25 21:58:15.242: XC EVT[190.225.251.152:12]: FSM: Executing action Start Backup Enable Timer

*Oct 25 21:58:15.242: XC EVT[190.225.251.152:12]: FSM EV: Enable Tmr, CHG from PDn,SAv,Etmr to PDn,SUp

*Oct 25 21:58:15.242: XC EVT[190.225.251.152:12]: FSM: Executing action Activate Secondary Member

*Oct 25 21:58:15.242: Activating secondary member 190.225.251.151:23

*Oct 25 21:58:15.242: XC EVT[190.225.251.152:12]: Send event [Request Hot-Standby] to server

*Oct 25 21:58:15.246: XC EVT[190.225.251.152:12]: Received hot-standby complete event

*Oct 25 21:58:15.246: XC EVT[190.225.251.151:23]: Send event [Peer Status UP] to server

*Oct 25 21:58:15.250: XC EVT[190.225.251.151:23]: Received status indication UP, alarm 0

*Oct 25 21:58:15.250: XC EVT[190.225.251.151:23]: FSM EV: Sec Up, CHG from PDn,SUp to PDn,SUp

*Oct 25 21:58:15.250: XC EVT[190.225.251.151:23]: FSM: Executing action Ignore event, no action

*Oct 25 21:58:15.262: MPLS peer 190.225.251.151 vcid 23, VC UP, VC state UP

And without GR last 87 seconds:

*Oct 30 16:19:38.038: %BGP-5-ADJCHANGE: neighbor 190.225.251.152 Down BGP Notification sent

*Oct 30 16:19:38.038: %BGP-3-NOTIFICATION: sent to neighbor 190.225.251.152 4/0 (hold time expired) 0 bytes

*Oct 30 16:20:50.378: %LDP-5-NBRCHG: LDP Neighbor 190.225.251.152:0 (3) is DOWN (Discovery Hello Hold Timer expired)

*Oct 30 16:20:50.390: MPLS peer 190.225.251.152 vcid 12, VC DOWN, VC state DOWN

*Oct 30 16:20:50.390: XC L2TP: Failed to find session for peer 190.225.251.152, vcid 12

*Oct 30 16:20:50.394: XC EVT[190.225.251.152:12]: Received status indication DOWN, alarm 0

*Oct 30 16:20:50.394: XC EVT[190.225.251.152:12]: FSM EV: Pri Dn, CHG from PUp,SAv to PDn,SAv,Etmr

*Oct 30 16:20:50.394: XC EVT[190.225.251.152:12]: FSM: Executing action Start Backup Enable Timer

*Oct 30 16:20:50.394: XC EVT[190.225.251.152:12]: FSM EV: Enable Tmr, CHG from PDn,SAv,Etmr to PDn,SUp

*Oct 30 16:20:50.394: XC EVT[190.225.251.152:12]: FSM: Executing action Activate Secondary Member

*Oct 30 16:20:50.398: Activating secondary member 190.225.251.151:23

*Oct 30 16:20:50.398: XC EVT[190.225.251.152:12]: Send event [Request Hot-Standby] to server

*Oct 30 16:20:50.402: XC EVT[190.225.251.152:12]: Received hot-standby complete event

*Oct 30 16:20:50.402: XC EVT[190.225.251.151:23]: Send event [Peer Status UP] to server

*Oct 30 16:20:50.406: XC EVT[190.225.251.151:23]: Received status indication UP, alarm 0

*Oct 30 16:20:50.406: XC EVT[190.225.251.151:23]: FSM EV: Sec Up, CHG from PDn,SUp to PDn,SUp

*Oct 30 16:20:50.406: XC EVT[190.225.251.151:23]: FSM: Executing action Ignore event, no action

*Oct 30 16:20:50.418: MPLS peer 190.225.251.151 vcid 23, VC UP, VC state UP

*Oct 30 16:20:50.418: XC L2TP: Failed to find session for peer 190.225.251.151, vcid 23

*Oct 30 16:20:50.418: XC EVT[190.225.251.151:23]: Received status indication UP, alarm 0

*Oct 30 16:20:50.418: XC EVT[190.225.251.151:23]: FSM EV: Sec Up, CHG from PDn,SUp to PDn,SUp

*Oct 30 16:20:50.418: XC EVT[190.225.251.151:23]: FSM: Executing action Ignore event, no action

Best regards,

Javier

@testzone_

Cisco Employee

Hi Javier, good to hear from you!

Yes this convergence seems very much related to the MPLS hold time. So bascially in this scenario you want PW convergence happening and not PW re-routing if I understand correctly. BFD would help here, but we don't run BFD on targeted LDP sessions. There are some alternatives in terms of BFD running against the IGP and BGP to trigger failover scenarios a bit faster.

Any case this particular situation is a bit hard to resolve via the support forums and I would want to recommend to open a TAC case so we can investigate this and advice the best.

regards

xander

New Member

Hi Xander!!
Continuing with the same test, when shutting down AGG-2, with GR disabled and changing some LDP parameters, the backup PW becomes active in 1,5 sec aprox.
The different LDP parameters are:

Session:
    Hold time: 15 sec
    Keepalive interval: 5 sec  (not configured)

Discovery:
    Link Hellos:     Holdtime:2 sec, Interval:1 sec
    Targeted Hellos: Holdtime:2 sec, Interval:1 sec
Graceful Restart:
    Disabled

I think it´s difficult to reduce this time (1.5 sec) without bfd on targeted ldp sessions.

Regards.

Horacio Falcon

Cisco Employee

Hi Adrian,

yup this confirms that your convergence is bound to the hold time. BFD for targeted LDP I don't have a definitive roadmap on. I would want to recommend to work with the TAC to see if there are other options for your design that you can leverage and potentially work on a more definitive timeline on BFD over T-LDP if needed.

regards!!

xander

New Member

Hi Xander,

I was wondering how does the FAT PW impact the scalability of the system based on the label allocation per flow? How do these labels consume internal resources? Could I face problems in regards to the number of labels available on the router for other L2/L3 VPN services?

Thanks in Advance!

Luis

Cisco Employee

Hi Luis,

great question, but no worries there! The flow label is merely a "dummy" label with a value computed based on inner packet info at the PE ends.

the scale limitation that many platforms have (including asr9k) are for the locally assigned labels. that is the labels that they can assign based on the locally determined paths. (see the asr9000 route scale document for more detail).

This "dummy" label, or flow label stated in a more sophisticated manner is give to the P routers so they can use that label, as most platforms used the inner most label for loadbalancing determination. Since we provide a label on a per flow basis, we can "hope" that the P router makes different decissions based on teh flow label that the Edge devices provide as part of that fat PW.

The only "overhead" is on the edge devices who have to compute and insert that new label. For P routers it is a standard operation without any overhead.

xander

New Member

Hi Xander,

Excellent!

The concept of FAT PW is working pretty well in my customer and now we are getting more efficiency in the way that both residential and corporate mpls services are being transported across the MPLS Domain.

Thanks for your explanation!

Best regards,

Luis Anzola

Hello,

Thanks for the great doc!

A few questions..

In all cases you describe the capabilities of the platform as a PE router

What happens in the case of ASR9K as P router?

I understand that packets of a single PW will be forwarded based on the bottom label. and that's ok if the packets already contain a flow label.

What happens if there is no flow label? Is there some kind of functionality like the "P Router Internal Load Balancing" on 7600?

Will you support different load balancing methods per bridge domain?

Currently, we can set the load balancing method (mac/ip) only under the L2VPN config, which affects all bridge domains.


Cisco Employee

P routers follow pretty much the same model:

If it can find IP info it will use that balancing method. If there are more then 4 labels or inner is not IP then we follow the label based method.

If there is no flow label (as in fat PW), then we use effectively the PW ID label (which is the inner most) or we use IP if the first nibble after the inner label is a 4 or 6.

xander

In the case of no flow label and PPPoE traffic in the payload, are you going to load balance based on mac addresses or based on the PW ID label?

Thanks

George

Cisco Employee

If there is PPPoE, there is first a ether header and then IP.

We don't look beyond the first header after the inner label.

However if the (dest)MAC start with a 4 or a 6, you have the issue again described above, so if you're carrying pppoe traffic you will want to use control word and use label based balancing.

A PE node may insert a flow label to hash traffic from pppoe sessions more intelligently. but that is PE capability dependent.

xander

New Member

Hi, Xander

Does ASR9K NV load balancing make a difference ? or same as standalone ASR9K ?

Thanks

Cisco Employee

Qingyan,

loadbalancing between 2 paths (or members of a bundle) that are on a satellite are working the same way as the story above. that's easy right?

xander

New Member

Thank you,  Xander

That helps a lot !

If i have two Cisco ASR 9001 with bundle Ether interface ( 3 links 1 GE )   is possible also enable pim, multicast , ospf, mpls, ldp , thats correct ?. Is like Case 3 Routing through bundle , ok?

Thanks

regards

Cisco Employee

hi fernando,

ypu no problem doing that at all.

the pim session, ldp session and pim may take either one member, but the actual traffic, mpls encapped or mcast will be balanced according to the above algorithms

xander

New Member

Xander, can you shed some light on the following?

I have the following connections, with a L2VC between PEs in order to pass IP over Ethernet traffic from CORE1 to CORE2.

CORE1 <=> PE (ASR9k) <=ECMP=> PE (7600) <=> CORE2

So it's like the following in terms of headers:

EtherHeader-Mpls(next hop label)-Mpls(pseudowire label)-etherheader-InnerIP


The l2vpn config on the ASR9k is the following:

l2vpn
load-balancing flow src-dst-ip
!
bridge group CORE
  bridge-domain CORE
   interface TenGigE0/1/0/0.2816                 <= AC
   !
   neighbor 10.201.201.9 pw-id 2816100002        <= PW
   !
  !

RP/0/RSP0/CPU0:ASR9k#sh l2vpn bridge-domain bd-name CORE det | i "Bala|PW:"

  Load Balance Hashing: src-dst-ip
    PW: neighbor 10.201.201.9, PW ID 2816100002, state is up ( established )
      Load Balance Hashing: src-dst-ip


RP/0/RSP0/CPU0:ASR9k#sh l2vpn bridge-domain bd-name CORE det | b "List of Access PWs:"

  List of Access PWs:
    PW: neighbor 10.201.201.9, PW ID 2816100002, state is up ( established )
      PW class not set, XC ID 0xc000002b
      Encapsulation MPLS, protocol LDP
      Source address 10.201.201.240
      PW type Ethernet, control word disabled, interworking none
      PW backup disable delay 0 sec
      Sequencing not set
      Load Balance Hashing: src-dst-ip


      PW Status TLV in use
        MPLS         Local                          Remote
        ------------ ------------------------------ ---------------------------
        Label        16042                          206
        Group ID     0x25                           0x0
        Interface    Access PW                      ** CORE **
        MTU          9200                           9200
        Control word disabled                       disabled
        PW type      Ethernet                       Ethernet
        VCCV CV type 0x2                            0x12
                     (LSP ping verification)        (LSP ping verification)
        VCCV CC type 0x6                            0x6
                     (router alert label)           (router alert label)
                     (TTL expiry)                   (TTL expiry)
        ------------ ------------------------------ ---------------------------

RP/0/RSP0/CPU0:ASR9k#sh cef 10.201.201.9
10.201.201.9/32, version 735, internal 0x4004001 (ptr 0xadab59b0) [1], 0x0 (0xad01834c), 0x440 (0xae47e050)
Updated Oct  3 02:54:59.320
remote adjacency to TenGigE0/1/0/3
Prefix Len 32, traffic index 0, precedence routine (0), priority 1
   via 10.201.10.98, TenGigE0/1/0/3, 12 dependencies, weight 0, class 0 [flags 0x0]
    path-idx 0 [0xae1f2504 0xae3e8110]
    next hop 10.201.10.98
    remote adjacency
     local label 16060      labels imposed {ImplNull}
   via 10.201.10.250, TenGigE0/2/0/2, 12 dependencies, weight 0, class 0 [flags 0x0]
    path-idx 1 [0xae1f30e0 0xae3e816c]
    next hop 10.201.10.250
    remote adjacency
     local label 16060      labels imposed {ImplNull}


RP/0/RSP0/CPU0:ASR9k#sh mpls forwarding prefix 10.201.201.9/32
Local  Outgoing    Prefix             Outgoing     Next Hop        Bytes
Label  Label       or ID              Interface                    Switched
------ ----------- ------------------ ------------ --------------- ------------
16060  Pop         10.201.201.9/32    Te0/1/0/3    10.201.10.98    18255747189680
       Pop         10.201.201.9/32    Te0/2/0/2    10.201.10.250   1072700375625

I can see the traffic being load-balanced in the ASR9k => 7600  direction, but i cannot find the reason based on the above doc.

It's like there is load-balancing happening based on InnerIP, but that's  not supposed to work if i understand correctly the above doc.

Thanks,

Tassos

Cisco Employee

hi tassos, the router makes LB decissions on the ingress LC. on the PE you still have access to the IP fields on the AC, the hash is derived there and when an LB decission is to be made that pre-computed hash is used.

Also cisco live presentation 2904 with some good detail on loadbalancing and some more use cases.

cheers

xander

New Member

Hi

Alexander Thuijs

Any idea on how to troubleshoot FAT PW float label ?

I have a chain of 5 ASR9k all of them linked with Bundle interfaces each of 2xTenG.

PE(ASR9k)<=B=> P (ASR9k) <=B=> P (MX240) <=B=>P (ASR9k)<=B=>PE(ASR9k)

I have setup a FAT PW end to end  loaded with mix of 200 soruce and destination IPs and one of the P routers is not balancing in any direction ?

All the rest are balancing as expected 2 PEs and 2Ps.  XR version 4.3.2 Mix of Trident and Thyphoon cards and RSP8G, RSP440TR and RSP440SE.

I noticed that bundle-hash dosn't have the option for float label ?

BR

Bozhidar

Cisco Employee

Hi Bozhidar,

which device in this chain is not loadbalancing as expected?

remember that the PE that is imposing the fat label will NOT use it for its loadbalancing decission. So I think that PE-left

if traffic is left to right, may be the one that is not balancing it correctly.

this is because the PE imposition path computes the hash BEFORE the fat label is inserted.

That cisco live preso 2904 referenced has some more detail in the LB section that discusses this in a bit more detail

regards

xander

New Member

Hi Alexander Thuijs

Thank you for the quick reply-

Actually the first P from left to right. And what is more strange is that the first PE (left) is balancing and the Ps after the first one (left to right) are balancing as well so the float lable must be there...

This is for the traffic flow left to right... for the oposite direction i have the same situation all devices along the path are balancing just this P again is not balancing when transmiting to the last PE.

PE(ASR9k)<=B=> P (NOT BALANCING in any direction) <=B=> P (MX240) <=B=>P (ASR9k)<=B=>PE(ASR9k)

Cisco Employee

that is interesting and I cant explain that!

can you let us know what the version is running on that device and the installed smu's also send me the (bundle) interface config to left and right and the mpls config.

I also need the cef outputs for the next hops out of those bundle interfaces to the connected devices, because maybe something is not right there.

depending on that we may have a config issue or a bug. at which we might need a TAC case to continue down the bug path, but lets check those outputs first.

regards

xander

New Member

OK let's see what i have -

RP/0/RSP0/CPU0:ASR9K_P1-2#show install active

Fri Oct  4 19:16:18.667 EEST

Secure Domain Router: Owner

  Node 0/RSP0/CPU0 [RP] [SDR: Owner]

    Boot Device: disk0:

    Boot Image: /disk0/asr9k-os-mbi-4.3.2/0x100305/mbiasr9k-rsp3.vm

    Active Packages:

      disk0:asr9k-doc-px-4.3.2

      disk0:asr9k-fpd-px-4.3.2

      disk0:asr9k-k9sec-px-4.3.2

      disk0:asr9k-mcast-px-4.3.2

      disk0:asr9k-mgbl-px-4.3.2

      disk0:asr9k-mini-px-4.3.2

      disk0:asr9k-mpls-px-4.3.2

      disk0:asr9k-optic-px-4.3.2

      disk0:asr9k-services-px-4.3.2

  Node 0/0/CPU0 [LC] [SDR: Owner]

    Boot Device: mem:

    Boot Image: /disk0/asr9k-os-mbi-4.3.2/lc/mbiasr9k-lc.vm

    Active Packages:

      disk0:asr9k-mcast-px-4.3.2

      disk0:asr9k-mini-px-4.3.2

      disk0:asr9k-mpls-px-4.3.2

      disk0:asr9k-optic-px-4.3.2

      disk0:asr9k-services-px-4.3.2

RP/0/RSP0/CPU0:ASR9K_P1-2#show platform 

Fri Oct  4 19:16:35.682 EEST

Node            Type                      State            Config State

-----------------------------------------------------------------------------

0/RSP0/CPU0     A9K-RSP440-SE(Active)     IOS XR RUN       PWR,NSHUT,MON

0/0/CPU0        A9K-8T-L                  IOS XR RUN       PWR,NSHUT,MON

RP/0/RSP0/CPU0:ASR9K_P1-2#

interface Bundle-Ether1

description Bundle to Right

mtu 9192

ipv4 address 10.30.0.5 255.255.255.252

load-interval 30

!

RP/0/RSP0/CPU0:ASR9K_P1-2#sh run int bundle-ether 2

Fri Oct  4 19:16:55.716 EEST

interface Bundle-Ether2

description Bundle to Left

mtu 9192

ipv4 address 10.30.0.1 255.255.255.252

load-interval 30

Fri Oct  4 19:17:26.309 EEST

mpls ldp

router-id 10.11.0.2

nsr

graceful-restart

graceful-restart reconnect-timeout 60

graceful-restart forwarding-state-holdtime 180

session protection

neighbor password encrypted 05061603320142081B

igp sync delay 10

log

  neighbor

  session-protection

  nsr

!

mldp

!

interface Bundle-Ether1

!

interface Bundle-Ether2

!

interface TenGigE0/0/0/5

!

!

mpls oam

!

RP/0/RSP0/CPU0:ASR9K_P1-2#show mpls forwarding

Fri Oct  4 19:17:51.201 EEST

Local  Outgoing    Prefix             Outgoing     Next Hop        Bytes      

Label  Label       or ID              Interface                    Switched   

------ ----------- ------------------ ------------ --------------- ------------

16000  16017       MLDP LSM ID: 0x1   BE2          10.30.0.2       976675330  

       300304      MLDP LSM ID: 0x1   BE1          10.30.0.6       360907262  

16001  16006       10.11.0.1/32       BE2          10.30.0.2       0          

16002  Pop         10.21.0.4/32       BE2          10.30.0.2       404236741030

16003  16022       10.21.0.10/32      BE2          10.30.0.2       1181761    

16004  Pop         10.30.0.8/30       BE2          10.30.0.2       0          

16005  Pop         10.30.0.100/31     BE2          10.30.0.2       0          

16006  Pop         40.0.0.0/22        BE2          10.30.0.2       0          

16007  300240      10.11.0.3/32       BE1          10.30.0.6       0          

16008  Pop         10.21.0.2/32       BE1          10.30.0.6       541225     

16009  300288      10.21.0.3/32       BE1          10.30.0.6       854473     

16010  300256      10.21.0.5/32       BE1          10.30.0.6       277329194115

16011  300272      100.2.0.0/24       BE1          10.30.0.6       0          

16012  Unlabelled  10.30.0.16/30      BE1          10.30.0.6       0          

16013  300240      10.30.0.24/30      BE1          10.30.0.6       0          

16014  300240      10.30.0.200/31     BE1          10.30.0.6       0          

16015  16019       MLDP LSM ID: 0x2   BE2          10.30.0.2       16434      

16016  300320      MLDP LSM ID: 0x3   BE1          10.30.0.6       3168541638 

16017  Aggregate   ZTE: Per-VRF Aggr[V]   \

                                      ZTE                          510300     

RP/0/RSP0/CPU0:ASR9K_P1-2#show cef 10.30.0.2      

Fri Oct  4 19:18:09.701 EEST

10.30.0.0/30, version 5, attached, connected, glean adjacency, internal 0xc0000c1 (ptr 0x71bf03c0) [1], 0x0 (0x7140c690), 0x0 (0x0)

Updated Oct  4 14:03:34.289

Prefix Len 30, traffic index 0, precedence n/a, priority 0

   via Bundle-Ether2, 2 dependencies, weight 0, class 0 [flags 0x8]

    path-idx 0 [0x70f143ec 0x0]

     glean adjacency

RP/0/RSP0/CPU0:ASR9K_P1-2#show cef 10.30.0.6

Fri Oct  4 19:18:11.344 EEST

10.30.0.4/30, version 7, attached, connected, glean adjacency, internal 0xc0000c1 (ptr 0x71bf0570) [1], 0x0 (0x7140c730), 0x0 (0x0)

Updated Oct  4 14:03:35.593

Prefix Len 30, traffic index 0, precedence n/a, priority 0

   via Bundle-Ether1, 2 dependencies, weight 0, class 0 [flags 0x8]

    path-idx 0 [0x70f14440 0x0]

     glean adjacency

RP/0/RSP0/CPU0:ASR9K_P1-2#

Quite and output. Everything looks ok to me? But see the traffic -

Bundle2 left

RP/0/RSP0/CPU0:ASR9K_P1-2#sh int tenGigE 0/0/0/0 | i rate

Fri Oct  4 19:18:51.712 EEST

  30 second input rate 260301000 bits/sec, 37360 packets/sec

  30 second output rate 508151000 bits/sec, 73532 packets/sec

RP/0/RSP0/CPU0:ASR9K_P1-2#sh int tenGigE 0/0/0/1 | i rate

Fri Oct  4 19:18:54.396 EEST

  30 second input rate 252685000 bits/sec, 36400 packets/sec

  30 second output rate 1000 bits/sec, 1 packets/sec

Bundle1 right

RP/0/RSP0/CPU0:ASR9K_P1-2#sh int tenGigE 0/0/0/3 | i rate

Fri Oct  4 19:19:03.392 EEST

  30 second input rate 247453000 bits/sec, 35663 packets/sec

  30 second output rate 512793000 bits/sec, 73745 packets/sec

  RP/0/RSP0/CPU0:ASR9K_P1-2#sh int tenGigE 0/0/0/4 | i rate

Fri Oct  4 19:19:05.811 EEST

  30 second input rate 263007000 bits/sec, 37870 packets/sec

  30 second output rate 1000 bits/sec, 2 packets/sec

RP/0/RSP0/CPU0:ASR9K_P1-2#

Cisco Employee

thanks for that detail, I am thinking something, do you happen to have a mac address that starts with a 4 or 6 by any chance. You may want to try and add the control word to the PW to make sure we are looking at the fat label instead of the (perceived) ip info in the payload.

another thing is also, is this the only device you have with a trident bundle? or do the other devices have a trident card also,

most interested in the hw config of the asr9k-p on the right.

regards

xander

New Member

I was thinking about this but no, my IXIA is pushing only mac's starting with 23.23.x.x.x.x.

I will try the control word trick and let you know any how.  Now when i double the check the other one with trident bundle is the far right PE with the same card but different RSP8G. the ASR on the right of the problematic P is actually J MX240

Because it's working in all other nodes i was sure that i am missing some configuration on this problematic P but actually there is nothing special to configure this is why I am so frustrated.

New Member

Confirmed CW didn't change behaviour

Cisco Employee

yeah if the mac doesnt start with 4 or 6, then the CW wont help, there is obviously an incorrect balancing happening on your PE left. The RP version or type should have no bearing on it as it is the hw that is computing the hash and that is the same for both PE's.

Also it is the outbound LB that is incorrect so the problem is local, if it was inbound, then I could have deflected it to the J .

Although the RP has no direct bearing on the hw forwarding, it could be a programming issue, but that sounds odd also.

Can you do this for me please:

RP/0/RSP0/CPU0:A9K-BNG#bundle-hash bundle-e100 loc 0/0/cPU0

Calculate Bundle-Hash for L2 or L3 or sub-int based: 2/3/4 [3]:

Enter traffic type (1.IPv4-inbound, 2.MPLS-inbound, 3:IPv6-inbound): [1]: 2

Number of ingress MPLS labels is 4 or less: y/n [y]: y

Enter MPLS payload type (1.IPv4, 2:IPv6, 3:other): [1]: 3

Enter the bottom label in decimal (20-bit value) :2

Link hashed [hash:199] to is GigabitEthernet0/0/0/19 ICL () LON 1 ifh 0x4000700

Another? [y]:

Enter the bottom label in decimal (20-bit value) :3

Link hashed [hash:200] to is GigabitEthernet0/0/0/9 ICL () LON 0 ifh 0x4000480

this command is broken in some ways currently (that is the actual member displayed is not the actual member chosen) but it should give us an impression whether it *can* balance on label or not.

With all this detail then captured, I would want to recommend filing a TAC case as this needs to be fixed up.

If you happen to have a typhoon card spare it would be great if you can swap that out and see if that makes a difference as well as the RSP type, but then I am asking you a lot I realize, but if it is easy to do, it would be great additional detail that we can use to narrow down the precise issue and complete it for ddts filing.

regards

xander

New Member

Hey,

10x I hope I got you right -

RP/0/RSP0/CPU0:ASR9K_P1-2#bundle-hash bundle-ether 1 location 0/0/CPU0

Fri Oct  4 21:01:35.116 EEST

Calculate Bundle-Hash for L2 or L3 or sub-int based: 2/3/4 [3]: 3

Enter traffic type (1.IPv4-inbound, 2.MPLS-inbound, 3:IPv6-inbound): [1]: 2

Number of ingress MPLS labels is 4 or less: y/n [y]: y

Enter MPLS payload type (1.IPv4, 2:IPv6, 3:other): [1]: 3

Enter the bottom label in decimal (20-bit value) :2

Link hashed [hash:57] to is TenGigE0/0/0/4 ICL () LON 1 ifh 0x240

Another? [y]:

Enter the bottom label in decimal (20-bit value) :3

Link hashed [hash:59] to is TenGigE0/0/0/4 ICL () LON 1 ifh 0x240

Cisco Employee

yup you understood the concept of what I was after, but do this for a variety of numbers and see if the control plane also picks the other member

xander

New Member

Yep I tried it many times i always see one interface -

RP/0/RSP0/CPU0:ASR9K_P1-2#bundle-hash bundle-ether 1 location 0/0/CPU0  DIRECTION RIGHT

Sat Oct  5 15:10:30.427 EEST

Calculate Bundle-Hash for L2 or L3 or sub-int based: 2/3/4 [3]:

Enter traffic type (1.IPv4-inbound, 2.MPLS-inbound, 3:IPv6-inbound): [1]: 2

Number of ingress MPLS labels is 4 or less: y/n [y]:

Enter MPLS payload type (1.IPv4, 2:IPv6, 3:other): [1]: 3

Enter the bottom label in decimal (20-bit value) :5

Link hashed [hash:63] to is TenGigE0/0/0/4 ICL () LON 1 ifh 0x240

Another? [y]:

Enter the bottom label in decimal (20-bit value) :6

Link hashed [hash:65] to is TenGigE0/0/0/4 ICL () LON 1 ifh 0x240

Another? [y]:

Enter the bottom label in decimal (20-bit value) :7

Link hashed [hash:67] to is TenGigE0/0/0/4 ICL () LON 1 ifh 0x240

Another? [y]:

Enter the bottom label in decimal (20-bit value) :88

Link hashed [hash:229] to is TenGigE0/0/0/4 ICL () LON 1 ifh 0x240

Another? [y]:

Enter the bottom label in decimal (20-bit value) :34

Link hashed [hash:121] to is TenGigE0/0/0/4 ICL () LON 1 ifh 0x240

Another? [y]:

Enter the bottom label in decimal (20-bit value) :2345

Link hashed [hash:135] to is TenGigE0/0/0/4 ICL () LON 1 ifh 0x240

Another? [y]:

Enter the bottom label in decimal (20-bit value) :2345

Link hashed [hash:135] to is TenGigE0/0/0/4 ICL () LON 1 ifh 0x240

Another? [y]:

Enter the bottom label in decimal (20-bit value) :3242356

Invalid label. Label value range is 0-1048575

Another? [y]:

Enter the bottom label in decimal (20-bit value) :4324

Link hashed [hash:253] to is TenGigE0/0/0/4 ICL () LON 1 ifh 0x240

Another? [y]: 345

What is even more strange is that actually the traffic goes via the other interface of the bundle 0/0/0/3

RP/0/RSP0/CPU0:ASR9K_P1-2#sh int tenGigE 0/0/0/3 | i rate

Sat Oct  5 15:15:56.810 EEST

  30 second input rate 1000 bits/sec, 1 packets/sec

  30 second output rate 515656000 bits/sec, 73763 packets/sec

RP/0/RSP0/CPU0:ASR9K_P1-2#sh int tenGigE 0/0/0/4 | i rate

Sat Oct  5 15:16:00.493 EEST

  30 second input rate 2000 bits/sec, 3 packets/sec

  30 second output rate 1000 bits/sec, 2 packets/sec

Same for the other direction left to the PE. Traffic goes out of  0/0/0/0 but I always get from the script 0/0/0/1

What is the best way to proceed is this enough information to open a TAC case and fill in a Deffect ?

Out of this subject but in this topic I wanted to ask you if you have more information on the logic behind the hash algorithm used in calculation of the flow label and then the function used by the P routers to make switching decision on the bundle ?  I am asking this question because I have played a lot with the IXIA and i noticed that the fewer the flows i have (SRC and DST couples) the less even the traffic is devidide between the ports. Mainly I am intersted if somehow the flows size (bandwith) is included in the calculation somehow or not ? My gues here is that it is not but ...

Cisco Employee

It almost makes me believe that there is something wrong with the member! is it distributing?

check the show bundle bundle-e <number> and show lacp bundle-e <number> to see if it is properly enabled in the bundle, because it looks like the control plane doesnt see that other member active...

The HASH algorithm is a crc calculation based on L3 and L4 info (if proto is TCP or UDP) and adding the routerID (to get some per node variation). The different schemes, ECMP, bundle etc look at different bits for the bucket assignemtns.

the buckets there are, are divided over the members (bundle) or paths (ecmp) as available.

There is no inclusion of actual bandwidth (we require currently equal cost paths or members), although there is UCMP available also (see related article on that)

regards

xander

New Member

Yep it is!  If I shut down the interface used now the traffic goes over the other one with no problems -

RP/0/RSP0/CPU0:ASR9K_P1-2#show bundle bundle-ether 2

Sat Oct  5 18:29:57.757 EEST

Bundle-Ether2

  Status:                                    Up

  Local links <active/standby/configured>:   2 / 0 / 2

  Local bandwidth <effective/available>:     20000000 (20000000) kbps

  MAC address (source):                      d867.d95f.5a6a (Chassis pool)

  Inter-chassis link:                        No

  Minimum active links / bandwidth:          1 / 1 kbps

  Maximum active links:                      64

  Wait while timer:                          2000 ms

  Load balancing:                            Default

  LACP:                                      Operational

    Flap suppression timer:                  Off

    Cisco extensions:                        Disabled

  mLACP:                                     Not configured

  IPv4 BFD:                                  Not configured

  Port                  Device           State        Port ID         B/W, kbps

  --------------------  ---------------  -----------  --------------  ----------

  Te0/0/0/0             Local            Active       0x8000, 0x0001    10000000

      Link is Active

  Te0/0/0/1             Local            Active       0x8000, 0x0002    10000000

      Link is Active

RP/0/RSP0/CPU0:ASR9K_P1-2#show lacp bundle-ether 2 

Sat Oct  5 18:30:06.267 EEST

State: a - Port is marked as Aggregatable.

       s - Port is Synchronized with peer.

       c - Port is marked as Collecting.

       d - Port is marked as Distributing.

       A - Device is in Active mode.

       F - Device requests PDUs from the peer at fast rate.

       D - Port is using default values for partner information.

       E - Information about partner has expired.

Bundle-Ether2

  Port          (rate)  State    Port ID       Key    System ID

  --------------------  -------- ------------- ------ ------------------------

Local

  Te0/0/0/0         1s  ascdAF-- 0x8000,0x0001 0x0002 0x8000,d8-67-d9-5f-5a-6c

   Partner          1s  ascdAF-- 0x8000,0x0001 0x0001 0x8000,b4-a4-e3-93-44-5c

  Te0/0/0/1         1s  ascdAF-- 0x8000,0x0002 0x0002 0x8000,d8-67-d9-5f-5a-6c

   Partner          1s  ascdAF-- 0x8000,0x0002 0x0001 0x8000,b4-a4-e3-93-44-5c

  Port                  Receive    Period Selection  Mux       A Churn P Churn

  --------------------  ---------- ------ ---------- --------- ------- -------

Local

  Te0/0/0/0             Current    Fast   Selected   Distrib   None    None  

  Te0/0/0/1             Current    Fast   Selected   Distrib   None    None  

RP/0/RSP0/CPU0:ASR9K_P1-2#

Cisco Employee

thanks again for all these tests and verifications.

ok, file a tac case please and collect a show tech bundle along with the info that we have collected so far.

To see if we can restore this behavior, you can try to do a proc restart on the bundle manager and removing and reconfiguring the bundle ether that is currently at fault.

there is a bug that is for sure, this detail and the show tech will help for a post mortem in case either of the 2 approaches "fixes" this current issue.

regards

xander

New Member

Hi Alexander,

Great tech node(BRKSPG2904) i watched the complete video and was able to get many new thinks. Keep the good work! 

Now back to my topic with Trident not balancing when we have MPLS+NonIP. As adviced by TAC we tried -

cef load-balancing algorithm adjust <value> i used 1 and this amaizingly made it work !!!

Now you can imagine my frostration when the TAC engineer said that this is normal and the issue was polarization.

I was thinking a lot on this subject and my conclusion is as follows please correct me if i am wrong -

Hash result shifting should affect the balancing in case we are making the same decision on many nodes in a chain !!! i.e all taking the left link for example

My node was not balancing at all so shitfting the result with one should not make a difference.

They only think that could made it is if hash shifting was changing on a round roubin fashion - for examlpe first flow - shift, second one don't shift etc... but then we can't gurantee that all packets from a specific flow will take the same  interface out - not good idea.  Now the only other think that could be the reason for the node not to balance is if the platform by default was looking at a bit or bits not changing at all for all flows (for example udp port number) thoes not balancing before we shift the hash result forcing it to use bits in the hash results that are actually chaning. If this is the case then maybe this should be fixed ?

BR

B.

Cisco Employee

I didn't realize that recording was posted also , good to hear you enjoyed it!

And also nice to see that your issue is resolved.

The general use case for the hash shift command is to prevent polarization in case you have chains of devices that use the same hash calculation approach. The hash shift makes the loadbalancer look at different bits (still the same position, but since they are shifted the bit values are "hopefully" different), if we use "random shifts" on different devices we can try to balance more effectively.

It seems like your flow distribution has an entropy variation that didnt produce enough randomness and resulted in polarization. I can see why the hash shift resolves it, but at the same time I am surprised it was necessary in this scenario already.

xander

New Member

Hi Alexander,

I see that you updated the document but unfortunatelly the document comparison says - "

The document body was too large to do a version comparison

"

Can you just briefly note what was the change ? THe document is really to big have it eye spoted

BR

Bozhidar

Cisco Employee

hi bhozidar, ah thanks for noticing! that is correct, I added the section on:

FAT Pseudowire TLV

cheers

xander

New Member

10x

cheers

b.

New Member

Hi Alexander,

I have another tricky setup with loadbalancing.

I have a NG-MVPN with P2MP-TE with BGP-AD and BGP c-mcast routing used for IP TV. Basically what i see is that on the edge router from where the muticast traffic is sourced (MTE tunnel headend) the mcast traffic is balanced on the bundle interface towards the core. The problem is that on the next hop - i.e the core to core bundle interface the traffic is not balanced anymore.  Now the question is - Is this normal and expected behaviour or I need to add something to my configuraion in order to start balancing the multicast traffic ?

I have tried the bundle-hash comand but looks like it;s not accepting mcast address for a destination -

Enter destination IPv4 address [255.255.255.255]:  236.5.14.24

Invalid destination address

Another? [y]: n


BR

Bozhidar

Cisco Employee

hi Bozhidar,

have you configured for the right multicast loadbalancing configuration on the router to use the source/next-hop and or group?

also it may help using hte hash shift command which affects the bundle hash selection also.

cheers!

xander

New Member

Guys you are getting faster than TAC ;o)

I received your update almost 12 hours ahead of the TAC reply to my case reagading absolutely the same question -

"

P2MP MPLS TE Tunnels

Only one bundle member will be selected to forward traffic on the P2MP MPLS TE mid-point node.

Possible alternatives that would achieve better load balancing are: a) increase the number of tunnels or b) switch to mLDP.

"

Keep up the good work.

BR

B.

New Member

Hello,

Great article, very useful. Thanks for writing it.

Just one questions. If I was to trace an actual path for a pseudowire over MPLS using flow labels how would I go about doing it? On the PE can do I just use the sh mpls forwarding exact-route label  ...and include the payload ipv4 field (src/dst)? Will that command show me the correct outgoing intraface by calcualting the flow label?

How about on the P routers. Woudln't I need to know the flow label for that as well? How could I generate it in the CLI?

Thank you.

Cisco Employee

Hi tomek0001, thanks for the comment!

the flow label is computed at the PE edge devices, so on P routers they just look at the inner label (if they see a 0 after the inner label, or that is non 4 or non 6), so you could use the show mpls forwarding command by providing the next hop label and bottom label that is used for loadbalancing.

The flow label is computed based on L3 and potentially L4 info fed by the RID also, and on the PE edge device that show mpls forwarding exact route can be used there also by providing a "self computed" flow label, or you can provide IP info also, but that doesnt necesarily mean/reflect the computed flow label.

a traceroute for MPLS would not take flow label into consideration as far as I am aware, so other then providing an actual traffic stream there is little that would reflect the "reality".

You can do a multipath traceroute to verify and identify multipaths in mpls, but there are no easy tools available that would allow you to follow a path based on flow labels.

regards

xander

New Member

Thank you for your response. I just wanted to clarify my questions little more and expand on it.

1. One thing I found confusing in the article is the requirement of the control word. Do you still require it when using flow labels? When you enable flow labels, is the detection of IPv4/IPv6 header disabled? So even if you are using flow labels with mac addresses that start with 4 or 6, routers will only look at the bottom of the stack label and won't try to detect if it's an IPv4 header? If that's not the case do you still have to enable control word when using flow labels?

2. I'm currently troubleshooting a situation with ECMP and Bundle interfaces that are causing out of order delivery. Outside of the issue of not using using control word and having mac addresses start with 4/6, have you seen anything that could cause out of order delivery with pseudo-wires, IP load sharing or bundle hashing?

3. I understand the traceroute would not work, but I was looking for a manual hop by hop way to tracing one particular flow encapsulated in a pseudorwire. This is where my original question was mostly about. I was trying to use the command "sh mpls forwarding exact-route label 16004 bottom-label ?" both on Ps and PEs, but in this case how do I know what the bottom label is...i.e. the flow label value that's calculated? Once I know it, I should be able to go  manually from each MPLS device and find out the exact path for a particular pseudowire encapsulating a particular src/dst ipv4 packet.

Thank you again for your help.

Tom

Cisco Employee

hi tom,

CW and flow label solve different issues:

CW is used to prevent a device from interpreting a 4 or 6 afer the inner label as IP

Flow label is used to provide a new entropie label to do loadbalancing on as opposed to the next inner one (which is usually PE next hop or PW label and rather static for a point to point connection).

If the inner payload is truly IP, then the CW makes the device no longer interpret payload as IP hence starts

to LB on the PW label. Flow label provides then that granularity of giving the option of a more per flow based balancing.

the out of order delivery i have seen happening mainly because of that dmac sarting with 4 or 6 in PW's.

the bottom label is the PW label. it could be visible from the pw connection or the cef rewrite string.

regards

xander

New Member

Hi alexander.

If i have 2 ASR 9010 in a nV edge topology with bundle-ether to core and to access (one tengig interface to each ASR chasis from CRS in core and access), how is the load balancing ?? In this scenario is usefull the locality feature ??

CRS_core <==bundle_ether_ipv4==> asr nv Edge <==bundle_ether_ipv4==> CRS_access

Cisco Employee

hi guillermo,

in cluster you will always benefit from the rack locality feature for bundle. This because then you will forward the traffic receiving fromt eh core link to the local member of the bundle and this prevents traffic to b sent over the IRL and hence limits traffic over that IRL.

It then relies on the "proper" loadbalancing from teh upstream which you can control via IGP or BGP if necessary.

In short you always want to limit the IRL usage obviosuly.

so short answer , yes rack locaility will be a benefit here in your setup.

xander

Cisco Employee

Hey Xander,

 

A quick query, as simple n stupid as it may sound. So the "load-balancing flow" cli (and flow-label ofcourse) tweaks on the flow label creation which will be used by P routers. What are the ways to load-balance at PE routers ? - say if we have 2 PW b/w back to back routers and there's no P router in between ?

Cisco Employee

hey rajat! this is a good question.

so here is how it works, whenever a packet comes in, one of the first things that happens is the calculation of the hash. Because we know that the PW is the egress, we are by default going to use the label. The flow label is only added later on after the initial hash is computed so it cant be used for LB on the PE device for ECMP or bundle as an egress core interface.

IT is something I am looking at enhancing, but the ingress path is already heavily loaded and the more clauses are added the more pps it costs, so we need to do some work unfortunately.

If you're interested in the LB stuff in more detail, besides this above, also check cisco live 2904 from this year sanfran. I have a few slides on the hashing.

regards

xander

New Member

Hello Xander, 

>>"The flow label is only added later on after the initial hash is computed so it cant be used for LB on the PE device for ECMP or bundle as an egress core interface."

This is how it works with any other router (computing the hash then assigning flow labels accordingly) I can't get why won't it work? It worked with me with different platforms (ASR1K and Juniper MX) your elaboration will be much appreciated.

One last questionl; is this applicable to all XR platforms, I mean will the CRS behave in the same way?

Thanks in advance.

Ramy