Detecting BGP route failures while eBGP neighbors are still active

Unanswered Question
Mar 29th, 2008

This is the situation. We have two Carriers' MPLS clouds at two locations for redundancy. Each Carrier provides us a managed (CE) router to connect to their MPLS cloud. We connect two of our routers to the two Carriers' CE routers at each location. We then create eBGP neighbors between the two Carrier's CE routers and then create iBGP neighbors between our two routers at each location using local preference. Currently, one MPLS cloud is primary for all our network traffic (call that cloud A) and the other cloud is a backup (call that cloud B). If we lose eBGP neighbors to the MPLS cloud A's CE router, all the traffic fails over perfectly to backup MPLS cloud B. The issue we are having is when the eBGP neighbors stay up between MPLS cloud A's CE router and our router, but all our BGP routes are lost somewhere with the MPLS cloud. Because we still have eBGP neighbors with MPLS cloud A's CE router, our router continues to send the network traffic to that CE router, thus everything goes into a black hole. The only way to fix the problem is to manual shutdown the interface to MPLS cloud A's CE router, so all the traffic fails to MPLS cloud B's CE router. How can we detect that we are no longer receiving BGP routes from MPLS cloud A's CE router and automatically failure over to MPLS cloud B's CE router?

I have this problem too.
0 votes
  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 5 (1 ratings)
Loading.
Edison Ortiz Sat, 03/29/2008 - 15:08

My first suggestion, if that's a common occurrence, is to change the provider :)

You can configure IP SLA on the BGP speaking router for route reachability out of that link and along with EEM automatically shutdown the interface so BGP can failover to the other MPLS provider.

Information about EEM can be found at:

http://www.cisco.com/en/US/prod/collateral/iosswrel/ps6537/ps6550/prod_white_paper0900aecd803a4dad_ps6815_Products_White_Paper.html

The problem is, how do you know when the problem is solved ? That's the reason my first suggestion still stands. If that's a common occurrence, you must complain to the provider. Indirect failures are very hard to troubleshoot.

HTH,

___

Edison.

Richard Burts Sat, 03/29/2008 - 15:34

John

I agree with Edison that it sounds more like a problem in your provider setup than any problem on your end. And it would be better to get the provider to fix it than to have to make adjustments on your end. Is the provider advertising routes to you or just advertising a default route? I would guess from the symptoms that they may be advertising only a default route and that they continue to advertise it even if they have lost upstream functionality. Ask the provider if they would change their CE router config so that it only advertises the route when it has upstream functionality.

If they do not fix the issue, then I believe that the suggestion from Edison is a pretty good fix for your end.

HTH

Rick

jrtuckiii Sat, 03/29/2008 - 17:14

Rick, Edison,

Thank you for the information. This really helps. This has been a problem with both AT&T's and Verizon's MPLS networks for us. Unfortunately, I simplified the example. In reality, it is more like 54 core sites with dual MPLS links and another 150 smaller sites with single MPLS links. As for the advertisements, we have about 700 routes include or default gateways routes. I appreciate you taking the time to answer my question.

Thank you,

John

sundar.palaniappan Sat, 03/29/2008 - 20:51

John,

Did you have Verizon and AT&T investigate why their CE was advertising the routes when communication to the PE was lost. As you are probably aware BGP would advertise the route(s) only if those routes exist in the routing table. The fact CE(s) were advertising routes to your routers clearly show they had matching routes in the routing table at the time of the problem.

If the CEs are setup to learn the routes dynamically, very likely via BGP, from the PE and advertise the same to your routers then the problem has to be somewhere beyond the PE. The second possibility is they may have been learning those networks from another source. Could it be possible those two CEs are setup to peer with each other and in which case the primary CE might have been advertising those networks even when it lost communication with the PE. The third possibility, though highly doubt this, is CEs setup with static routes.

In any case you would have engage both service providers to get to the root of the problem as you mayn't have access to CE configuration.

HTH

Sundar

ruwhite Sun, 03/30/2008 - 06:25

You don't say how long this condition persists (is it permanent), etc, so there's a wide array of possible problems, from the CE's being configured with static routes (providers prefer not to learn routing information dynamically from their customers), to BGP convergence issues in the provider's core.

Anyway, since you're seeing it with both networks, I'd say this is going to be something you're not going to get any traction on.

Your best bet is probably going to be running IPSLA across the cloud, or running a tunnel on the L3VPN endpoints, and running an IGP across these, to get around this problem. We actually some large customers who take both of these approaches to resolve the sorts of problems you're seeing.

:-)

Russ

Actions

This Discussion