BGP next hop does not change when IGP route fails

eagles-nest · ‎09-01-2013

Hi

My BGP is a bit rusty and I would be interested in any input on a problem I saw. I have resolved the issue but in investigating it I found what I thought was an unusual event.

I have 2 MPLS attached routers at a site. Each use a different provider and each have been allocated different AS numbers. So when I peer them with each other it is an eBGP connection.

So my routes come in from all sites and all is ok. I then have a "backdoor" link between 2 sites and have run an IGP on this. I want these 2 sites to pass traffic to each other over this link. I initially did a backdoor config on the BGP setup but when I did a test by failing the backdoor link at the far site I lost some connectivity to that site. When I looked in the BGP table of my 2 MPLS routers at one side the next hop of the prefix was pointing to the internal IGP router that minks the backdoor connection to the other site. So traffic was being sent there and black holed because its IGP backdoor link had failed. The IGP router had withdrawn the IGP routes due to the failure but the BGP router still had it as the next hop. I thought with the IGP routes failing then the BGP router would update its next hop since the one it used previously was now invalid.

Anyway, I resolved the issue but I didn't expect this to happen and am curious on any input on this. Should the BGP next hop have been updated because that next hop cld no longer reach the subnet ?

As I mentioned I'm not particularly looking for ways to resolve the issue. That has been done. I'm just looking for input of the function of BGP in htis instance and whether it should have done what it did.

Many thanks, St.

Aninda Chatterjee · ‎09-01-2013

Hello,

I'm no expert at BGP having just started my studies on it, but I'd like to try and add some input here, if possible.

There are a few things that I'd like to confirm first:

1. You said you have an eBGP peer between the two. Are you directly peering between the two sites or do you peer with your service provider?

2. If you are directly peering between the two sites, are you peering using the physical interfaces themselves (and if so which one - the one that goes to your service provider or the one that forms the backdoor?) or are you peering using loopbacks on each end?

You stated that after configuring the backdoor connection on the BGP setup and failing the link, the BGP table showed you that the next hop was the backdoor link itself. Did you check the BGP table before you failed the link? What was the next hop at that point?

Basically what BGP backdoor does is that it increases the administrative distance of a BGP prefix (assuming that the same prefix has been learned via both BGP and an IGP on the local router) on the local router to 200. This implies that now, instead of the BGP route being installed in the routing table, you'd have the IGP route installed - because the prefix is the same, the next tie breaker between two different routing protocols would be the administrative distance and the lower one gets the preference.

So a BGP backdoor really does no other change to a BGP learned prefix apart from increasing its AD to 200 as far as I know.

Regards,

Aninda

eagles-nest · ‎09-01-2013

Thanks Aninda

Perhaps I explained wrongly. There are 2 sites. Each of which has 2 x MPLS connections to 2 providers. So on Site A I have 2 x eBGP links to provider A and Provider B. The same at the other site. The providers have allocated me different AS numbers so I have a router at site A running BGP 65000 and another at site A running BGP 65001

So if we take a single site for simplicity.

I peer with Provider A and my AS number is 65000

I peer with Provider B and my AS number is 65001

For resilience, in case I lose one link, I peer my routers with each other so we peer between AS 65000 and AS 65001 on routers at my site.

I do the same at the other site.

Now I introduce a site to site direct link and want site to site traffic to use it but if it fails for them to divert over the MPLS links. So I run OSPF across the link and only advertise the routes connected at each site.

My 1st attempt I used the bgp backdoor command on specific prefixes and as expected it raised the AD to 200 and caused the OSPF route to be preferred. All as expected.

So when I look at the BGP tables at either end I have my best route with next hop the OSPF intersite router and a RIB failure, as expected, because OSPF has a better AD. That's exactly what I expect.

However, when I fail the OSPF link somewhere in the chain I expect the OSPF routes to be withdrawn and BGP to use an MPLS route. This does not happen. OSPF does withdraw the routes as expected but in my BGP table I still see that the route to a prefix at the other site still has the OSPF router as its next hop. I thought that since OSPF had withdrawn the route BGP would update the next hop because there is no longer a route to the prefix via that next hop. So because the BGP table maintains the old next hop traffic is sent there. It has no route since the site to site link has failed and the traffic actually ends up going into a loop where it is sent back to the original BGP router via a default route.

So I know how bgp backdoor works and it did seem to do the job. Until the OSPF route failed and BGP did not seem to update the next hop to the prefixes at the other site.

I then did a config where I reduced the OSPF AD to 15 and this did exactly the same as the backdoor config when the site to site link failed. Next hop stayed the same.

I resolved the issue with a next-hop-self command between my routers at each site but I am wondering why the next hop did not update on the other occasions when I lost the site to site link and OSPF had withdrawn its reachability to the prefix.

St.

Aninda Chatterjee · ‎09-01-2013

Hello,

Perhaps my lack of experience with BGP will creep up now or I'm just having a really bad day. Either way, I'm having a hard time understanding the following:

OSPF does withdraw the routes as expected but in my BGP table I still see that the route to a prefix at the other site still has the OSPF router as its next hop.

When you say that the BGP table shows you the OSPF router as the next hop, I'm assuming you mean that the next-hop IP is the IP address of the other site router for the directly connected, site-to-site link which is being used as the backdoor, correct?

If the above is true, why should the next-hop ever be that in the first place? That is what throws me off. I mean, let's say you advertise the prefix X.X.X.X/24 on site B into BGP. Site A eventually learns this through its providers and the next-hop for the router in Site-A is Provider-A. Now you introduce the site-to-site connection, run OSPF on it and advertise the same prefix into OSPF. Site-A learns the prefix in OSPF but it doesn't get installed into the routing table because the eBGP route has better AD.

At this stage, you still have the prefix X.X.X.X/24 with a next-hop of Provider-A installed in the BGP and routing table.

Now you configure the backdoor keyword for the X.X.X.X/24 prefix. The AD for this prefix jumps to 200, the OSPF route gets installed into the routing table with a next-hop of Site-B (the direct link's IP). The BGP learned prefix shows a RIB-failure but the next-hop is still Provider-A.

Through all the steps above, at no stage would we have the IP of the other site's direct connection as the next-hop for the prefix in the BGP table. Wouldn't you consider seeing it as a next-hop in the first place to be a problem?

This is quite an interesting problem. Perhaps a few experts could chime in as well. It will be good to see what they make of this situation.

Regards,

Aninda

Peter Paluch · ‎09-01-2013

Aninda, St.

I am by no means an expert to BGP but this issue intrigues me as well.

Aninda, you have very nicely spotted and highlighted a very important issue to consider: why is the BGP NEXT_HOP attribute set to some internal OSPF router's IP address, rather than being set to the IP address of Provider-A or Provider-B through which the routes from the other location should be learned in BGP?

St., you have to keep in mind that BGP does not update the NEXT_HOP attribute just because the IGP routing table changes. The NEXT_HOP attribute is set/modified

Initially by the router injecting the route into BGP, possibly retaking the next hop value from its own routing table
Later by the latest eBGP neighbor that advertised the route to you
Or by your iBGP neighbor from which you learned that route if next-hop-self is configured on that neighbor

The fact that the value of the NEXT_HOP BGP attribute is set to an internal OSPF speaker suggests that either it is that OSPF speaker that advertises the route into BGP, or that one of its neighbor redistributes the route from OSPF to BGP and retakes the next hop.

I'd say that a network diagram showing the routers, individual ASes, BGP sessions, direct link, and OSPF/BGP redistributions would be helpful. It seems that the OSPF routes are somehow being redistributed into BGP. The interaction between AS 65000 and 65001 may also be the culprit requiring a closer attention.

Best regards,

Peter

eagles-nest · ‎09-01-2013

Thanks Peter

It's not easy for me to upload a diagram but points you make help.

The route is indeed injected to BGP via the network backdoor command. And since I have a point to point OSPF link from which I learn the route BGP takes the next hop as that of the local OSPF router, not the remote one. That's when the following point comes into effect.

Initially by the router injecting the route into BGP, possibly retaking the next hop value from its own routing table

You are correct that the NEXT_HOP attribute is that of an OSPF speaker but it is a local one not one at the remote site.

So correctly in this case BGP injects the route into its table via the network backdoor command and makes the AD 200. OSPF beats this with AD 110 and the OSPF route is chosen.

What then happens is I fail the site to site link. So the route is no longer known from the local OSPF router that is NEXT_HOP in BGP. But the NEXT_HOP attribute is maintained even though the OSPF router can no longer reach the subnet.

I think Aninda's confusion is that he thinks the next-hop attribute is that of a router at the remote site. It's not, it's a local router on the same LAN as the BGP WAN routers.

I think the following from Cisco helps too

Default BGP Scanner Behavior

BGP monitors the next hop of installed routes to verify next-hop reachability and to select, install, and validate the BGP best path. By default, the BGP scanner is used to poll the RIB for this information every 60 seconds. During the 60 second time period between scan cycles, Interior Gateway Protocol (IGP) instability or other network failures can cause black holes and routing loops to temporarily form.

This suggests once BGP has a route it scans every 60 seconds to see if the next hop is still reachable. In my case it is even though that next hop cannot reach the subnet after the site to site link failure. So BGP THINKS the next hop is still valid since it's reachable even though that next hop can no longer reach the remote subnet.

Sorry I can't upload a diagram easily from this computer. It's very tied down and I don't have access to my own computer at present.

Many thanks, Stuart.