cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1927
Views
0
Helpful
4
Replies

Routing problems after HSRP failover between two BGP routers

gchevalley
Level 1
Level 1

We have two BGP routers each connected to a separate ISP.  The routers are each connected to a separate switch which is than connected via trunk to each other.  Our firewalls connect to these switches which is the aggregation point for our network to the external routers.  We are currently running HSRP between the two routers using IP SLA on the primary router tracking the status of the external interface and a ping to Google DNS.  While testing HSRP failover we noticed some issues with the traffic getting lost for a few minutes.  The testing was performed by shutting down the external facing interface on the primary router.  HSRP worked as intended and the secondary router became the active router.  From the secondary router we were able to ping several public DNS servers (4.2.2.2 & 8.8.8.8) fine.  However, the continuous ping to 8.8.8.8 I was running from my desktop died.  Checking from the firewall, it too was unable to ping returning only question marks.

FW-01-1# ping 8.8.8.8
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 8.8.8.8, timeout is 2 seconds:
?????
Success rate is 0 percent (0/5) 

This was odd given that it was working fine just before failover.  The question marks indicate that it is receiving packets of unknown type in response.  Checking the routing and arp tables, all seemed to be good.

Gateway of last resort is 1x.xxx.xxx.xx1 to network 0.0.0.0

S*   0.0.0.0 0.0.0.0 [1/0] via 1x.xxx.xxx.xx1, outside
FW-01-1# sh arp
        outside 1x.xxx.xxx.xx3 d0d0.fdbb.4771 0
        outside 1x.xxx.xxx.xx1 0000.0c07.ac01 146
        outside 1x.xxx.xxx.xx2 e8b7.48d7.cd51 150

 

I checked the mac address tables in both switches and verified that the virtual mac address of HSRP group was moving correctly from the trunk to the access interface in response to the failover.  Everything looked good on the secondary router as well with it's outside facing interface now showing both incoming and out going packets and even verified with the secondary ISP that they were seeing the traffic.  This issue will continue for 2 to 2.5 minutes than all the sudden correct itself and all is good.  The interesting thing is that this only occurs when we failover from the primary router to the secondary.  When we open the outside interface on the primary router HSRP switches the vIP back to the primary router without a single lost packet.  If we fail back over to the secondary we encounter the same two minutes wait.

 

Has anyone else encountered an issue like this?

 

I've included the router and switch config's along with a diagram.

 

 

 

 

 

 

 

 

 

 

1 Accepted Solution

Accepted Solutions

milan.kulik
Level 10
Level 10

Hi,


 

IMHO, the reason could be the  BGP convergence time.

When you shutdown your primary line, it takes some time (2-3 minutes, as you say) until the old routing info for your LAN subnet(s) is replaced by the new path through the secondary ISP from the 8.8.8.8 point of view.

Which means, the returning packets might be routed incorrectly for that time and not delivered (as your primary line is down).

 

When you enable the primary line again, it takes some time again to get the BGP converged.

But as the secondary line is still Up, the returning packets which are routed based on the old routing info (to your secondary line in this case) can be still delivered to your LAN. So possibly no packet lost (or some smaller number lost comparing to the former case).

 

Best regards,

Milan


 

View solution in original post

4 Replies 4

milan.kulik
Level 10
Level 10

Hi,


 

IMHO, the reason could be the  BGP convergence time.

When you shutdown your primary line, it takes some time (2-3 minutes, as you say) until the old routing info for your LAN subnet(s) is replaced by the new path through the secondary ISP from the 8.8.8.8 point of view.

Which means, the returning packets might be routed incorrectly for that time and not delivered (as your primary line is down).

 

When you enable the primary line again, it takes some time again to get the BGP converged.

But as the secondary line is still Up, the returning packets which are routed based on the old routing info (to your secondary line in this case) can be still delivered to your LAN. So possibly no packet lost (or some smaller number lost comparing to the former case).

 

Best regards,

Milan


 

If it is a BGP convergence issue, is there anything that either we or the secondary ISP can do to reduce the convergence time?

Hi,

 

you could try to tune the BGP timers with your primary ISP.

See http://networkgeekstuff.com/networking/cisco-bgp-timers-re-explained/

for some details.

But as there are additional providers involved along the path to the server, I'm quite sceptic it would bring a considerable convergence time improve.

 

Best regards,

Milan

With some further testing I've found that shutting down the BGP neighbor for the primary ISP on router 1 works very well.  There is no loss of traffic while waiting a few minutes for the internet to converge on the new route.  We can then shut down the outside interface with no loss of traffic.

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: