Re: Fastest possible convergence for ibgp

John Blakley · ‎10-18-2010

All,

What's the fastest possible convergence time that you've seen using ibgp? I have a L3 switch that I have 2 routers connected to. The routers are configured with HSRP and the primary automatically fails over if the serial side goes down. The problem is that the inside interface is still responding. I've used conditional advertising to stop advertising the routes from the primary if the serial side subnet is not in the routing table. This works, but it seems that it takes a long time to fail over. I've made a few changes that I haven't been able to test yet, so I may have solved my problem, but I'm just wondering what the absolute fastest time could be?

Thanks,

John

HTH, John *** Please rate all useful posts ***

Calin C. · ‎10-26-2010

Hello John,

What timers do you have set on the BGP? I'm asking more for the dead interval, as you don't want your BGP router to wait to long for a reply from it's neighbor.

Second, with the HSRP, I can suggest to use "tracking"? In this way the HSRP will react immediately when your Serial interface is down.

Here is a good example about HSRP together with interface track:

http://www.cisco.com/en/US/tech/tk648/tk362/technologies_tech_note09186a0080094e8c.shtml

Let me know if this useful!

Cheers,

Calin

John Blakley · ‎11-08-2010

Okay, I did my test this weekend, so I'd like to revisit this in more detail. First of all, I'm using SLA to track my peer by pinging. That's not even the real issue though. Here's what I have:

RouterA RouterB

\ /

C o r e S w i t c h

RouterA:

router bgp 1

network 192.168.1.0

network 192.168.2.0

neighbor GO-Internal peer-group

neighbor GO-Internal remote-as 65101

neighbor GO-Internal timers 1 3

neighbor GO-Internal fall-over

neighbor GO-Internal next-hop-self

neighbor GO-Internal advertise-map PERMITROUTES exist-map MustExist

neighbor Router-B peer-group GO-Internal

neighbor CoreSwitch peer-group GO-Internal

RouterB:

router bgp 1

network 192.168.1.0

network 192.168.2.0

neighbor GO-Internal peer-group

neighbor GO-Internal remote-as 65101

neighbor GO-Internal timers 1 3

neighbor GO-Internal fall-over

neighbor GO-Internal next-hop-self

neighbor Router-A remote-as 65101

neighbor Router-A timers 1 3

neighbor Router-A fall-over

neighbor Router-A next-hop-self

neighbor CoreSwitch peer-group GO-Internal

Core Switch:

router bgp 1

neighbor Router-A remote-as 65101

neighbor Router-A transport path-mtu-discovery

neighbor Router-A timers 1 3

neighbor Router-A fall-over

neighbor Router-A weight 45000

neighbor Router-B remote-as 65101

neighbor Router-B transport path-mtu-discovery

neighbor Router-B timers 1 3

neighbor Router-B fall-over

neighbor Router-B weight 43000

Core Switch bgp table:

*>i10.125.6.0/24 Router-A 0 100 45000 13979 65006 i

* i Router-B 0 100 43000 65027 65001 13979 65006 i

*>i10.125.7.0/24 Router-A 0 100 45000 13979 65007 i

* i Router-B 0 100 43000 65027 65001 13979 65007 i

As you can see, I have .2 as the primary and .3 as the backup. When I completely power off .2, it takes about 30 - 45 seconds for .3 to pick up. Can I speed this up? My timers are set as low as they'll go and I have each peer set to fall-over, so I'm not sure what else I can do.

Thanks,

John

HTH, John *** Please rate all useful posts ***

hbruyere · ‎11-08-2010

Hello!

Since you don't use update-source loopback, the switch is probably seeing the BGP peers as directly connected on a vlan interface. That's why fall-over does nothing: you don't lose the route to the peer ip since the vlan interface remains up and you don't lose the route to the connected peer.

For fall-over to be useful, you would need to use update-source loopback and advertise the loopbacks with a dynamic routing protocol with short timers.

With this the switch would lose quickly the route to the peer ip (the loopback) and fall-over would bring down the BGP peer.

But now there is also something weird: as you have configured the BGP peers with hello 1 sec and hold timer 3 sec, the BGP peer to the router that you

turn off must go down in maximum 3 seconds. So I don't know where these 30-45 secs comes from. Probably you are not telling us everything.... or you configured the BGP timers without clearing the BGP sessions...

Regards,

Herve

John Blakley · ‎11-08-2010

"Probably you are not telling us everything...."

Seriously?

Anyway, I can create loopbacks to use for a test. But, since I'm shutting the router off completely, peering would die anyway. Both RouterB and the switch would see that RouterA dropped out and the switch should start passing traffic to RouterB.

I lose about 30 packets before it starts to pass traffic again. It does fail over, just not fast enough, and no, I'm not leaving anything out....

Another question:

"With this the switch would lose quickly the route to the peer ip (the loopback) and fall-over would bring down the BGP peer."

What would be the difference for me to use a loopback versus the physical address of the peer. Either way, the peer would fall out, so why wouldn't using the physical address give me the same result?

And according to Cisco, fall-over works over any peering address, not just loopbacks:

BGP Fast Peering Session Deactivation

BGP fast peering session deactivation improves BGP convergence and response time to adjacency changes with BGP neighbors. This feature is event driven and configured on a per-neighbor basis. When this feature is enabled, BGP will monitor the peering session with the specified neighbor. Adjacency changes are detected and terminated peering sessions are deactivated in between the default or configured BGP scanning interval.

John

HTH, John *** Please rate all useful posts ***

hbruyere · ‎11-08-2010

Hello!

The bgp fall-over brings down a BGP peer when the route to the peer ip is lost. But here if the peer is on a directly connected vlan, the peer subnet still exist in the routing table when you power off the router, correct? So bgp fall-over does not kick in.

On the other hand, if the peer ip is a loopback (or anything else) that you learn from a dynamic routing protocol, the loss of this route would make BGP

fall-over bring the BGP neighbor down before the expiration of the hold timer.

But please don't lose too many time configuring loopbacks. As you have configured a BGP hold timer of 3 seconds the BGP neighborship must go down within 3 seconds and you must see a log about about it. Now during these 30 seconds of packet loss, look a 'show ip bgp', 'show ip route' and 'show ip cef' on the switch for the destination that you test.

The path pointing to the router that you power off must also be removed within 3 secs, otherwise we have a weird bug.

Note that when troubleshooting this kind of convergence issue, it is very important to check in which direction the packets are lost. The forward path

might be OK, and the packets lost on the return path! So you may also need to investigate what happens with the routing for the return path.

Regards,

Herve

John Blakley · ‎11-08-2010

Herve,

I don't have a way of testing this again for a while; Saturday was my one window. So, I did look at my routing table and the switch had already converged to RouterB, and RouterB always has the other path (not Router A) as its failover:

RouterB bgp table:

* i10.125.8.0/24    Router A            0    100      0 13979 65008 i
*>                  172.27.1.1                         45000 65027 65001 xxxx 65008 i
* i10.125.10.0/24   Router A             0    100      0 13979 65010 i
*>                  172.27.1.1                         45000 65027 65001 xxxx 65010 i
* i10.125.11.0/24   Router A             0    100      0 ?

Notice the Router A is not wanted as the primary, nor should it in this scenario. The switch should fail over to Router B if Router A goes down. Router B *should* immediately start passing traffic when Router A goes down. Router B's routing table doesn't change because everything points to it's outside peer. So, when Router A goes down, the switch should drop its peer address out of the table, the routing table should start forwarding everything to Router B instead.

Right now, I have a full mesh with Router A, B, and the Core Switch. I've thought about removing the peering from Router A and B to see if that would help the convergence time. Also, are you saying that a fall-over wouldn't work if you're using ibgp on the same subnet between 2 peers? That's not the way I'm reading it. It says that if the peer goes down, not if that peer's subnet is still listed in the table. If that's the case, then can you provide me some documentation aside from what I found stating this?

Thanks,

John

HTH, John *** Please rate all useful posts ***

milan.kulik · ‎11-09-2010

Hi John,

I'd agree with Herve here: The problem might be the opposite direction routing.

I suppose that router A is preferred path to your subnets from the backbone point of view.

When you power off your router A, you need the backbone to take router B as the path to your site.

And it might take some time (30 seconds?) for the backbone convergence.

So it's possible everything is OK on you site - the core switch is forwarding the outgoing traffic to router B, router B is forwarding it to the backbone, but the packets are not returning back as the backbone has not removed the route back through router A yet!

BR,

Milan

John Blakley · ‎11-09-2010

Now that you put it like that, yes, I could see that and that would make sense. The route in my other routers (all ebgp) haven't fallen out yet. So, aside from jacking with timers on my other routers, is there a way to speed convergence up through ebgp?

Thanks!

John

HTH, John *** Please rate all useful posts ***

milan.kulik · ‎11-09-2010

Hi,

quite a challenge, I'm afraid.

bgp fast-external-fallover might help to detect the failure of your router A by his neighbors.

But to speed-up the backbone convergence - I don't see any easy way.

How does your backbone look like?

Pure BGP managed by you? Or an MPLS from some provider?

How many hops to the target you were testing?

BR,

Milan

John Blakley · ‎11-09-2010

Yeah, it's an MPLS provider. The problem is that my remote sites have one egress point to the provider. I, however, have 2 paths that can go out, so that's not going to help me much. :-) All that will ever be in the bgp table at the remote site is their 1 peer which is the ISP.

Thanks!

John

HTH, John *** Please rate all useful posts ***

thiru.vel10 · ‎01-18-2018

Enabled BFD between the peer with default mode bfd-echo will help to bring down the bgp peer within 1 Sec also you can set the minimum and maximum ms for bfd hello packets hope this helps