HSRP Issues

Unanswered Question
Mar 31st, 2009

I have two 1900 series routers in production for the last two years.

The routers have 2 interfaces LAN & WAN. HSRP is enabled on LAN interface with link monitoring for WAN interface. Routers were tested for failover before putting into production and worked fine.

Today suddenly the active HSRP router got hung and I could not connect to it remotely. Surprisingly the standby router did not become active. Since this was at the remote location, I asked the remote staff to shutdown the faulty router . Then HSRP switchover took place. After 10 minutes the faulty router was powered on and became active again.

I have no logs on syslog server to identify the issue. How can I pinpoint this issue? It seems like when the router got hung it did not give up its HSRP priority.

I have this problem too.
0 votes
  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 4.3 (3 ratings)
Loading.
thotsaphon Tue, 03/31/2009 - 05:55

avil,

Seems they are still sending/receiving the hello packets while one of them got hang.

Maybe this link can help you some.

http://www.cisco.com/en/US/tech/tk648/tk362/technologies_tech_note09186a0080094afd.shtml

Let's take a look at the HSRP timer section.

Just want to ask you questions,When the router got hang. It cannot forward anything. Right? You cannot telnet to it. Can you ping it?

HTH,

Toshi

avilt Tue, 03/31/2009 - 05:59

When the router got hung,I couldnot telnet. I have one question on HSRP priority.

I have set priority as 80 on standy by router, so when the other router (active) gets hung by what value it decrements the priority?

thotsaphon Tue, 03/31/2009 - 06:06

Avil,

Well, On the standby router you have set priority as 80. It will take affect when the active router is gone. Let's the standby router lost all hello messages within the time you configured. It will promote itself to be an active router.

So in your case you don't have any tracking applied in on the active router. Priority you mention will come into play when they first elect who will be an active router and when the active router is alive again after refreshing something on it. (grin)

HTH,

Toshi

avilt Tue, 03/31/2009 - 06:14

so how do I correct this problem? Shall I set the priority on standby router as 95?

thotsaphon Tue, 03/31/2009 - 06:42

Avil,

I'm afraid that changing the priority will not solve the problem. The first thing you should do is that you have to log on the standby router when the active got hang. Using the a "show standby" command to see what is going on. When the active router got hang can you do a ping command from the standby router to the active router?. I mean to test connections between the segment. I'm not sure that the router you cannot telnet to. It actually doesn't forward anything. Or they are still sending/receiving hsrp packets.

Note: Don't forget the link I provided.

Toshi

avilt Tue, 03/31/2009 - 07:11

From the standby router I could ping to the hung router LAN interface, but I could not telnet. I do not have the output of the show standby command. I feel its safe to set the prioroty of standby router to 90 and leave the priority on active router to default.

Joseph W. Doherty Tue, 03/31/2009 - 08:37

What may have happened, part of the hung router was still functioning and part wasn't. This happens very rarely, but when it does, you can enounter strange situtations.

What sometimes helps to avoid this, is running a later release of the same version, e.g. 12.2(4) vs. 12.2(18).

In later IOS versions, there's additional features to define some self monitoring although it can quickly become complex and I don't think it will guarantee 100% problem avoidance.

avilt Tue, 04/07/2009 - 21:46

Today again the active router got hung but the standby router did not take over. So I telneted into the standby router and from there I could telnet into hanging router and rebooted the router. During reboot the standby router became active. I also took the log before the reboot. Kindly find the attached file. I could not find any useful information form the log.

Attachment: 
avilt Tue, 04/07/2009 - 22:25

Yes, the command is enabled on both the routers. Is it causing the problem?

srinivas_816 Tue, 04/07/2009 - 22:54

"%PQ3_TSEC-5-LATECOLL: PQ3/FE(1), Late collision" --

Error Message

%PQ3_FE-5-LATECOLL : PQ3/FE([dec]/[dec]), Late collision

Explanation Late collisions occurred on the Fast Ethernet interface.

Recommended Action If the interface is Fast Ethernet, verify that both peers are in the same duplex mode. Otherwise, no action is required.

avilt Wed, 04/08/2009 - 03:46

This log I have been seeing for a very long time and its just a notice.

avilt Wed, 04/08/2009 - 15:40

What I also found is that when the router hangs, I execute the command show ip interface brief. The output shows both LAN & WAN interface as up but I am not able to ping the WAN interface from the hung router itself.

My current image is Cisco IOS Software, C181X Software (C181X-ADVIPSERVICESK9-M), Version 12.4(11)T2, RELEASE SOFTWARE (fc4).

Can I upgrade it to c181x-advipservicesk9-mz.124-24.T.bin

avilt Wed, 04/08/2009 - 19:38

I am attaching the router configuration here from both active and standby. Kindly advice

Attachment: 
rpfinneran Sat, 04/11/2009 - 02:58

Hello,

You need to focus more on why the router is hanging as opposed to why HSRP isn't working when its hung.

When the router gets hung, as you stated all interfaces are still up/up. I would check CPU utilization (show proc cpu). When the utilization gets high, the router must choose what to allocate the processor too, and unfortunately IP routing isn't high on the list, and may result in ICMP being dropped. However, that doesn't mean that the HSRP keepalives are dropped, which could explain why HSRP is not "working correctly".

I would investigate why the router is getting hung (likely a bug). In the mean time, you could mitigate it by adding an IP SLA object and tracking ICMP reachability the router that keeps getting hung, since as you stated it is not pingable when hung.

I have attached the configs that could be used to mitigate this issue, though ultimately I suggest you resolve the router being hung (have I said this enough?). Also, as someone above noted, fix the duplex mismatch.

avilt Sun, 04/12/2009 - 21:06

Rpfinneran,

Thank You for the support. Let me give you my network topology and the troubleshooting steps I followed. Last time the HSRP failed again during non production hours and I could try some troubleshooting steps.

1. I cannot say that router got hung as I was able to telnet from the standby router to its LAN interface. Both are running OSPF but I was unable to ping its WAN interface from its OSPF neighbours.

2. I have tried making the standby router active by setting its priority to 105, but after some days even that router failed.

3. During the HSRP hung period the router CPU utilization is around 20% only.

4. When the HSRP on the router fails, it retains its HSRP state as active and no log is shown in the console.

5. I execute tha command show ip interface brief on the failed router. Both interface/line protocol were up.

6. On the LAN side I was using non cisco switch. Currently I have replaced this switch with Cisco 2950. The routers are working fine since 09th April-2009

7. On the WAN interface I have the command "ip verify unicast reverse-path" When this command is applied I cannot ping the WAN interface from its own console. Is it the dafault behaviour?

Kindly let me know if its a IOS issue.

Thanks You.

Attachment: 
rpfinneran Sun, 04/12/2009 - 22:22

Avilt, no problem...I hope we get this resolved.

"1. I cannot say that router got hung as I was able to telnet from the standby router to its LAN interface. Both are running OSPF but I was unable to ping its WAN interface from its OSPF neighbours"

Interesting. Did OSPF stay up? What other symptoms are you seeing when the router fails...how are you determining that it is failing (is it based only on the inability to ping, or are other issues reported?). When the active router fails, you can telnet to it. Once you are in, are you able to ping its own LAN interface (90.3)? If not, this could help in terms of finding a way to mitigate. I would certainly try to document as much about the failure as possible, get a show tech, and open a tac case. You need to find out why the routers are failing.

"4. When the HSRP on the router fails, it retains its HSRP state as active and no log is shown in the console."

Okay, so when the router fails, are you able to ping 90.3 from the router itself? If not, implement the mitigation technique I mentioned in a previous post.

"6. On the LAN side I was using non cisco switch. Currently I have replaced this switch with Cisco 2950. The routers are working fine since 09th April-2009"

It would be very interesting to see if this resolves the issue.

"7. On the WAN interface I have the command "ip verify unicast reverse-path" When this command is applied I cannot ping the WAN interface from its own console. Is it the dafault behaviour?"

Yes, this is normal. It is a simple but effective anti-spoofing control. See http://www.cisco.com/en/US/docs/ios/12_2/security/command/reference/srfrpf.html#wp1023632

Keep me posted on whether the issue occurs again or not. As I said, if the router fails again, be sure to get a show tech and open a TAC case. Try to get details about what interfaces you can ping, OSPF states, BGP states (if running), etc. The more info the better the TAC will be able to assist.

bdmas Sun, 04/12/2009 - 23:43

Have you provide the "priority" command int the active router.It seems some problem with configuration.

avilt Sun, 04/12/2009 - 23:51

On active router it takes default priority 100.

ACTIVE#show standby

FastEthernet0 - Group 60

State is Active

13 state changes, last state change 3d19h

Virtual IP address is x.x.90.2

Active virtual MAC address is 0000.0c07.ac3c

Local virtual MAC address is 0000.0c07.ac3c (v1 default)

Hello time 3 sec, hold time 10 sec

Next hello sent in 0.380 secs

Preemption enabled

Active router is local

Standby router is x.x.90.1, priority 96 (expires in 9.388 sec)

Priority 100 (default 100)

Track interface FastEthernet1 state Up decrement 10

IP redundancy name is "hsrp-Fa0-60" (default)

avilt Mon, 04/13/2009 - 16:32

Rpfinneran,

Your feedback helps me in finding the root cause of this issue. So far its running fine without any issue.

The HSRP site is at the remote location connected thru WAN links to 4 sites. Active router connected to ISP-B running OSPF area 1 and Standby router is connected to ISP-A, area 0. Please note that I have configured passive interface command on LAN interface on the routers.

On our site we have a ping monitor tool which pings both LAN/WAN interfaces of the router. During HSRP issue the PING monitor

reports both interfaces as down. Aslo the clients at the remote HSRP site are not able to communicate to other sites. When the HSRP issue occurs I login the standby router and from there I either telnet to problamatic router and reboot it OR I can increase the priority of the standby router to solve the issue.

Now my guess is that its a OSPF issue. Last time when the router had problem, I execute the command "show ip interface brief"

it showd both interface/protocol as up but the show log had the below entry.

004864: .Apr 8 04:20:47: %OSPF-5-ADJCHG: Process 65182, Nbr X.X.X.252 on FastEthernet1 fr

Down: Dead timer expired

004865: .Apr 8 04:20:49: %OSPF-5-ADJCHG: Process 65182, Nbr X.X.X.9 on FastEthernet1 fro

Down: Dead timer expired

004866: .Apr 8 04:20:51: %OSPF-5-ADJCHG: Process 65182, Nbr X.X.X.1 on FastEthernet1 fro

Down: Dead timer expired

Unfortunately I could not perform more OSPF tests. Is there any possiblity of OSPF going down even when both interface and protocol is UP? I have configured loopback address on the router for OSPF.

If the WAN interface goes down then definitely the router will decrement its OSPF priority that did not happen during the HSRP issue. But it lost OSPF communication with its neighbours and even though HSRP state was active it did not have the route to reach the destination.

Even though its working fine now if the problem happens again how can I mitigate this issue forever?

I have also performed a manual test by removing the WAN cable on the active router during which the active router decreased its HSRP priority and the standby router took over.

rpfinneran Mon, 04/13/2009 - 23:08

Avilt,

No problem. I am sure we can get this resolved. Is it possible for you to make a drawing (visio or paint) of how this whole thing is layed out? I am a little confused, it looks like normally your router has two neighbors on Fa1, but you said you have the LAN interfaces as passive? Also, I don't understand why your two routers are in different OSPF areas? If you could make a drawing and attach the OSPF / Interface configs with the IP address scrubbed that would help.

naderzaman Tue, 04/14/2009 - 14:27

On the "active router" you have duplex/speed as "auto". On the "standby router" you have hard coded speed:100 and duplex: full. This is a NO NO. Both sides must be configured exactly the same. When one side is configured as auto and the other hard coded, the auto side will always negotiate to half-duplex. I am assuming that this router is connected to a switch or hub as the means of connecting the local PCs. If there is a switch, check the duplex/speed settings on the ports. Make sure both sides are configured as auto.

rpfinneran Tue, 04/14/2009 - 22:20

I agree with naderzaman, with one small correction. GigabitEth interfaces will default to 100 Full, not half.

Either way, it has been said several times in this forum that you should fix this duplex mismatch, it could absolutely be causing issues. Please correct it and provide a drawing if you can.

avilt Wed, 04/15/2009 - 03:09

I will upload the network diagram soon.

This network is in production for the last 2 years without any issues. On the switch side it's set to auto/auto. I will try to set it to auto on the interface side as well.

avilt Wed, 04/15/2009 - 20:53

Rpfinneran,

I am uploading the OSPF topology diagram.

ISP-1 = KD

ISP-2 = JT

The HSRP issue is at location D.

On those routers on the LAN segment interfaces are negotiated to 100mbps/full. But on WAN segment its negotiated to 100mbps/half duplex. I will hard code it to Full Duplex.

Also let me know your idea on OSPF topology. Do I need to join OSPF area 0 and area 1 at any other location?

rpfinneran Wed, 04/15/2009 - 23:14

Avilt,

Great drawing, this is very helpful.

At site D, on KD router, what area is 90.1 interface in? On JT router, what area is 90.3 in?

As for the OSPF design. Why did you guys choose to use multiple areas for this? I would have probably put everything in area 0, and simply set the OSPF cost of all interfaces that you have in area 1 to something much higher than the primary interfaces. In your current design, you could end up with a discontinuous area 0 depending on how you answer my question.

avilt Wed, 04/15/2009 - 23:34

Except Main Office OSPF is running only on WAN interfaces. OSPF is not enabled on LAN interface.

Well initially we had 14 branch offices connected thru OSPF in 2 areas. Now its reduced to just 4 location. May be I should put everything in one area. In such case should I enable OSPF on LAN interface at each location?

rpfinneran Wed, 04/15/2009 - 23:37

So, how are your LAN segments advertised into OSPF?

Specifically, the LAN segment 90.x?

Can you provide the JT and KD OSPF configs for Site D?

avilt Thu, 04/16/2009 - 04:14

I am attaching the configuration. OSPF authentication is not enabled on LAN interface. Also its defined as passive interface.

Attachment: 
rpfinneran Thu, 04/16/2009 - 04:23

Exactly what I thought. You have created a discontiguous area 0. What happens is when OSPF on the primary router goes down, as you noted HSRP works now and the other router becomes active. But now on the LAN you have area 0 trying to traverse area 1 to get to area 0.

avilt Thu, 04/16/2009 - 04:49

Please elaborate. I did not understand your explanation.

Is there any possibility for OSPF to go down on the active router even when the WAN link is active. When the WAN link goes down the active router give up its HSRP state.

So what is the solution? Shall I define OSPF authentication on LAN interfaces as well so that on LAN segment route exchange happens between Area0 and Area1?

So what is the root cause for OSPF to go down even when the link is active? Its a 100mbps stable link.

rpfinneran Sun, 04/19/2009 - 21:22

I will get you a detailed explanation and solution today. Sorry for the delay, I have been out of town.

Thanks,

Ryan

rpfinneran Mon, 04/20/2009 - 02:24

Avilt,

Sorry for the confusion, I mis-read your configs. I see that you have JT listed as the active router now.

One thing to not is this: As traffic from your Main Office is attempting to reach Site D, it will go in via the KD router.

This is because the KD router is advertising the LAN subnet (90.0) into Area 0, whereas the JT router is advertising it into area 1. OSPF always prefers Intra-area routes over inter-area routes. The return traffic from D to Main Office will use area 1, since the JT router is active.

At any rate, this is not an approach that I would prefer. I would make the following changes...

First, ensure you resolve any duplex mismatches as we have discussed above. You shouldn't see collisions incrementing on the interfaces. Next, hard code the OSPF cost of all KD wan interfaces to 1, and all JT wan interfaces to 100. Next, move all backup (JT) connections to area 0. Then move the LAN connections on the JT routers to area 0, and allow OSPF to form an adjacency between KD and JT routers at each site. Finally, restore your HSRP settings so that KD is active and JT is standby.

I have attached a drawing. I believe this will correct any asymmetric routing and simplifies your design. There is really no need to have multi-area OSPF with only four sites. I hope this helps,

Ryan

Attachment: 
avilt Mon, 04/20/2009 - 04:05

Thank You very much for the detailed diagram. What is the reason to hard code the OSPF cost of all KD wan interfaces to 1, and all JT wan interfaces to 100?

Currently I cannot merge 2 areas into one due to non technical reasons. I will try to merge it later. Can I just define OSPF authentication on LAN interfaces at all locations so that they form adjacency?

I will close this case after one week. The issue is most likely with the non cisco switch that I was using on LAN side.

rpfinneran Mon, 04/20/2009 - 04:11

I tend to agree.

The reason to hard code costs is to make one link the primary, and the other a backup. The higher cost links will be backups.

You need to do authentication and also remove the passive-interface command to allow adjacency to form (do this with the new design, you cannot do it in your current design since the routers have different areas for the LAN interface).

Good luck,

Ryan

avilt Mon, 04/20/2009 - 17:44

Final clarifications:

1. I will use the default cost on both the routers so that all the routers/links are utilized.

2. At site D, On the LAN interfaces I will enable OSPF authentication and remove passive interface command so that there is OSPF exchange between Area0 and Area1.

Hope its not going to create any problems.

rpfinneran Mon, 04/20/2009 - 20:34

Avilt,

OSPF will only form a neighborship if both sides are in the same area. So, if you want to setup OSPF between the two routers at Site D on the LAN, you must put both into the same area (0).

Also, the default cost on all interfaces could result in some interesting routing. If you have two seperate ISP's, you may want to be very careful how you utilize both. Some applications are very sensitive to jitter and delay, so be sure to route this traffic in a consistent manner.

Actions

This Discussion