Re: Dual MPLS WAN Failover Issue

dtran · ‎08-09-2006

I am running a single AS EIGRP WAN with dual routers at each remote site. Please see attachment for overview of the WAN. Both links are active at all time and the issue I am having is that if one link fails it takes approximately 30 minutes for traffics to failover to the other link.

Has anyone ran into this issue before with similar WAN setup ?

Can someone tells me if this is a normal behavior and what can be done to reduce convergent time or speed up the failover process ?

Thanks in advance !!!

Danny

jackyoung · ‎08-09-2006

Is your LAN switches (6509 & 3560) enable the L3 routing ? Can you provide the switches and routers configuration and sh ip route, sh ip eigrp top ? Please also describe the detail fail-over sympton, e.g. which link/device down, and where take long time to fail-over.

It is a problem to fail-over for 30 mins. Do you mean 30 sec. ?

If both WAN networks are up all the time, the problem should be at your router / switch sides. I thought both providers will not talk to each other. So, it should not related to MPLS issue.

Wait for your info.

dtran · ‎08-09-2006

Hi Jack ! I will provide you the configs and Show commands output tomorrow when I am back in the office. Can I email them to Jackyoung@hotmail.com ?

I do have msfc on both 6509 switches, the 3560 is only a L2 switch.

The problem occurs when one of the links fails, for example today the SBC/AT&T T1 fails at the remote site and it took 30 minutes for traffics to failover to Sprint T1. When the SBC/AT&T T1 was down at the remote site, I did a traceroute on the Corporate HQ 3662 router to the remote site and I saw there was a routing loop between the two WAN networks. But traffics eventually failover to the Sprint T1 after about 30 minutes.

And yes, both WAN networks are up all the time.

We just recently migrated from a HUB and Spoke frame relay network to a full mesh MPLS network and I find MPLS is a lot more complex and difficult to troubleshoot.

I really appreciate your help on this issue !!!!

Danny Tran

jackyoung · ‎08-09-2006

Thanks. Danny. Could you mind to upload the information in this forum. Because there will be more Netpro can help and share the knowledge.

In this case, what I think it can be fixed by fine tune the routing. If there is difficultly to provide the full config., at least please provide the routing section and sh ip route then let's check is it sifficient.

Changing from FR to MPLS should not be difficult, you can treat it as media changing only.

I will wait for your update.

dtran · ‎08-10-2006

Thanks Jack !!! I have attached the config files and the show command ouputs you needed. Please let me know what you find ! or if you need more informations from me !

Thank you for your time and I really appreciate your help !!!

Danny

jackyoung · ‎08-10-2006

Thanks a lot for your details info. According to the sh ip route result, it looks like there are multiple paths to reach a single subnet. Could you please confirm it is your preferred design ? I suggest to make the link as active-standby mode. It was because two paths are not equal, if you load-sharing it automatically, it will not be real load-balancing two MPLS links.

Moreover, there are many backdoors for the packet to route to remote side but it may finally reach the same exit-point (other MPLS provider) to the remote. Please design the path as your preferred and minmize the backdoor. I suggest to add the "bandwidth xxxx" command at interface of the provider router which is connecting to the MPLS. The EIGRP will use this figure to generate the distance for the routing protocol, but this is only for calculation and not affect the real throughput, so don't worry it will limit the bandwidth.

And, the HSRP priority is not useful in the configuration. It was because both routers are configured as same priority, it means it relaied on the router to select the active path to remote side. I suggest to select an ISP as preferred path then set that provider router at higher priority. But the difference of the priority between two routers at same HSRP group should not larger than 10.

Below are some links for your reference. You can find useful info. there. Otherwise, if you can, please provide the trace route result when there is a link down but not yet fall-over to other link.

EIGRP

http://www.cisco.com/en/US/tech/tk365/technologies_tech_note09186a0080093f07.shtml

http://www.cisco.com/en/US/products/sw/iosswrel/ps1831/products_configuration_guide_chapter09186a00800d97f8.html

HSRP

http://www.cisco.com/en/US/tech/tk648/tk362/technologies_q_and_a_item09186a00800a9679.shtml

http://www.cisco.com/en/US/tech/tk1330/technologies_design_guide_chapter09186a008066670b.html

Hope this helps.

dtran · ‎08-10-2006

Hi Jack ! Please follow the link below to see the traceroute that I capture when the link at the remotesite was down.

http://forum.cisco.com/eforum/servlet/NetProf;jsessionid=232A6F3966B3A064B7AA6744A041CD24.SJ5B?page=netprof&forum=Network%20Infrastructure&topic=WAN%2C%20Routing%20and%20Switching&CommCmd=MB%3Fcmd%3Ddisplay_location%26location%3D.1ddbe155

Thanks Jack !!

Danny

dtran · ‎08-10-2006

Hi Jack ! here is the Traceroute output that I captured when the SBC T1 was down at RemoteSite.

Type escape sequence to abort.

Tracing the route to 172.16.48.15

1 71.137.174.217 4 msec 4 msec 0 msec - HQ SBC PE router

2 70.250.121.98 44 msec 40 msec 40 msec - DR SBC CE router

3 172.16.142.2 40 msec 40 msec 40 msec - DR 3825 router g0/0

4 172.20.131.185 40 msec 40 msec 40 msec - DR Sprint PE router

5 172.20.46.29 40 msec 40 msec 40 msec - HQ Sprint PE router

6 172.20.46.30 36 msec 40 msec 36 msec - HQ Sprint CE router

7 10.40.8.20 40 msec 40 msec 44 msec - HQ 3662 router Fa0/0

Loop repeats here

8 71.137.174.217 44 msec 40 msec 40 msec

9 70.250.121.98 80 msec 80 msec 80 msec

10 172.16.142.2 80 msec 84 msec 80 msec

11 172.20.131.185 80 msec 80 msec 76 msec

12 172.20.46.29 80 msec 80 msec 80 msec

13 172.20.46.30 84 msec 80 msec 80 msec

14 10.40.8.20 84 msec 84 msec 80 msec

Loop repeats here

15 71.137.174.217 84 msec 88 msec 84 msec

16 70.250.121.98 120 msec 124 msec 128 msec

17 172.16.142.2 124 msec 120 msec 120 msec

18 172.20.131.185 116 msec 120 msec 120 msec

19 172.20.46.29 116 msec 120 msec 120 msec

20 172.20.46.30 116 msec 120 msec 116 msec

21 10.40.8.20 120 msec 120 msec 120 msec

Loop repeats here

22 71.137.174.217 120 msec 120 msec 120 msec

23 70.250.121.98 160 msec 156 msec 156 msec

24 172.16.142.2 160 msec 156 msec 160 msec

25 172.20.131.185 156 msec 156 msec 160 msec

26 172.20.46.29 156 msec 156 msec 160 msec

27 172.20.46.30 156 msec 156 msec 160 msec

28 10.40.8.20 156 msec 164 msec 156 msec

Loop repeats here

29 71.137.174.217 160 msec 164 msec 160 msec

30 70.250.121.98 200 msec 196 msec 200 msec

Thanks again Jack !!!!

Danny

jackyoung · ‎08-10-2006

Thanks. Danny for the quick reply.

Yes, you find the looping and you need to design the routing carefully to avoid the looping.

The correct path should be :

Outgoing path

HQ 6509 --> HQ SBC --> HQ Sprint --> Remote Sprint --> Remote 3560

Return path

Remote 3560 --> Remote SCB --> Remote Sprint --> HQ Sprint --> HQ 6509

What I think it is due to too many back door but not related to HSRP, HSRP is for the local LAN user only and routers talk to each other use real IP address.

In this case, try to add the bandwidth command at the WAN interface first and fine tune the DR site to be less preferred, otherwise, the packet may route via the DR then Remote. In your old case, this is FR, so it may be point-to-point connection. In current design, it looks like point-to-multiple, so looping is created. The above path is my suggestion only, you can design a path for active and standby and how they route during failure. Check the link that I provided before to fine tune the EIGRP.

If there is still looping or unwanted multiple path, you may need to apply the access control list at the point which should not have those route.

Moreover, I suspect there may be some static route somewhere but redistribute into EIGRP, but those static route keep the EIGRP treat it as active, so do not remove it even the WAN link down.

Because there are DR site and dual link. You have to design which one is the preferred destination when the active link down and the sequence.

Hope this helps.

dtran · ‎08-10-2006

Hi Jack ! can you tell me on which router I should apply the bandwidth command to ?

Thanks Jack !!!

Danny

jackyoung · ‎08-10-2006

You're welcome. You can apply the bandwidth command to the router which is connecting to the WAN network. i.e. provider router's serial interface, SBC, Sprint at three sites.

The most important in this case is to define the sequence of which path will be selected if the primary link down. i.e. if primary link at HQ down, should the traffic via DR site or direct to the remote site ?

It is quite complicated in your case, need to design it carefully. ;)

olorunloba · ‎08-11-2006

I will really say that you have a complex setup here but very interesting. A number of issues has been raised and I think there are still more.

Going back to the origin of the post, the 30 mins convergence, it will be hard to find out the reason for this long convergence without having outputs from your providers as well, before and after the convergence. Bear in mind that in MPLS VPN, your provider participate in your routing, therefore their routing setup is key to your operation as well.

If you can get your vrf routing table from the providers, as well as your bgp table and eigrp topology table that will help.

Having said all of that, I noticed the following also

1. Sprint network was preferring to reach the remote site through the 45M to the HQ. This should not be, as they should prefer the T1 connection. Talk to Sprint and let them clarify why this is so.

2. At the remote site, it still had routes from the SBC, I thought the link was supposed to be down.

If you are willing to configure your sites to be non-transit, then your scenario could be less complex. In this situation, use a distribute list to deny prefixes of other sites from going out. A sample config at the HQ will be

router eigrp 1

network 10.0.0.0

network 172.16.0.0

network 172.20.0.0

distribute-list 99 out

no auto-summary

access-list 99 deny

access-list 99 permit any

Sorry for the long post and please get back, as this is an interesting case.

jackyoung · ‎08-11-2006

Where Olorunloba suggested a good solution to fix the problem. i.e. filter out the unwanted routes and apply control policy on the route advertisement.

However, you still need to design the plan which and how the traffic flow from one site to another site under different sinarios.

Hope this helps.

dtran · ‎08-11-2006

Hi Alorunloba ! thanks for your reponse !

The reason that you see routes from SBC at the remote site is because the Show command outputs that I attached the post was captured during normal operation with both Sprint and SBC active.

Danny

olorunloba · ‎08-11-2006

Ok, that clarifies things a little. If this be the case, Sprint network should see their T1 as the best path to the remote site, and this should be advertised to the DR and HQ Site. From the topology table of HQ and DR, they are not recieving any advertisement from Sprint. Confirm why this is so? This could be the reason why it is taking the network to converge as maybe Sprint do not have the route in their tables already.

I will fully agree with Jackyoung that you really need a design to suit how you want your traffic paths to be, in different scenarios.