Re: HSRP Issues - Page 2

avilt · ‎03-31-2009

I have two 1900 series routers in production for the last two years.

The routers have 2 interfaces LAN & WAN. HSRP is enabled on LAN interface with link monitoring for WAN interface. Routers were tested for failover before putting into production and worked fine.

Today suddenly the active HSRP router got hung and I could not connect to it remotely. Surprisingly the standby router did not become active. Since this was at the remote location, I asked the remote staff to shutdown the faulty router . Then HSRP switchover took place. After 10 minutes the faulty router was powered on and became active again.

I have no logs on syslog server to identify the issue. How can I pinpoint this issue? It seems like when the router got hung it did not give up its HSRP priority.

rpfinneran · ‎04-11-2009

Hello,

You need to focus more on why the router is hanging as opposed to why HSRP isn't working when its hung.

When the router gets hung, as you stated all interfaces are still up/up. I would check CPU utilization (show proc cpu). When the utilization gets high, the router must choose what to allocate the processor too, and unfortunately IP routing isn't high on the list, and may result in ICMP being dropped. However, that doesn't mean that the HSRP keepalives are dropped, which could explain why HSRP is not "working correctly".

I would investigate why the router is getting hung (likely a bug). In the mean time, you could mitigate it by adding an IP SLA object and tracking ICMP reachability the router that keeps getting hung, since as you stated it is not pingable when hung.

I have attached the configs that could be used to mitigate this issue, though ultimately I suggest you resolve the router being hung (have I said this enough?). Also, as someone above noted, fix the duplex mismatch.

rpfinneran · ‎04-11-2009

Here is the attachment...you may need to tweak the configs just slightly...

avilt · ‎04-12-2009

Rpfinneran,

Thank You for the support. Let me give you my network topology and the troubleshooting steps I followed. Last time the HSRP failed again during non production hours and I could try some troubleshooting steps.

1. I cannot say that router got hung as I was able to telnet from the standby router to its LAN interface. Both are running OSPF but I was unable to ping its WAN interface from its OSPF neighbours.

2. I have tried making the standby router active by setting its priority to 105, but after some days even that router failed.

3. During the HSRP hung period the router CPU utilization is around 20% only.

4. When the HSRP on the router fails, it retains its HSRP state as active and no log is shown in the console.

5. I execute tha command show ip interface brief on the failed router. Both interface/line protocol were up.

6. On the LAN side I was using non cisco switch. Currently I have replaced this switch with Cisco 2950. The routers are working fine since 09th April-2009

7. On the WAN interface I have the command "ip verify unicast reverse-path" When this command is applied I cannot ping the WAN interface from its own console. Is it the dafault behaviour?

Kindly let me know if its a IOS issue.

Thanks You.

rpfinneran · ‎04-12-2009

Avilt, no problem...I hope we get this resolved.

"1. I cannot say that router got hung as I was able to telnet from the standby router to its LAN interface. Both are running OSPF but I was unable to ping its WAN interface from its OSPF neighbours"

Interesting. Did OSPF stay up? What other symptoms are you seeing when the router fails...how are you determining that it is failing (is it based only on the inability to ping, or are other issues reported?). When the active router fails, you can telnet to it. Once you are in, are you able to ping its own LAN interface (90.3)? If not, this could help in terms of finding a way to mitigate. I would certainly try to document as much about the failure as possible, get a show tech, and open a tac case. You need to find out why the routers are failing.

"4. When the HSRP on the router fails, it retains its HSRP state as active and no log is shown in the console."

Okay, so when the router fails, are you able to ping 90.3 from the router itself? If not, implement the mitigation technique I mentioned in a previous post.

"6. On the LAN side I was using non cisco switch. Currently I have replaced this switch with Cisco 2950. The routers are working fine since 09th April-2009"

It would be very interesting to see if this resolves the issue.

"7. On the WAN interface I have the command "ip verify unicast reverse-path" When this command is applied I cannot ping the WAN interface from its own console. Is it the dafault behaviour?"

Yes, this is normal. It is a simple but effective anti-spoofing control. See http://www.cisco.com/en/US/docs/ios/12_2/security/command/reference/srfrpf.html#wp1023632

Keep me posted on whether the issue occurs again or not. As I said, if the router fails again, be sure to get a show tech and open a TAC case. Try to get details about what interfaces you can ping, OSPF states, BGP states (if running), etc. The more info the better the TAC will be able to assist.

bdmas · ‎04-12-2009

Have you provide the "priority" command int the active router.It seems some problem with configuration.

rpfinneran · ‎04-12-2009

Not necessary, the default priority is 100.

avilt · ‎04-12-2009

On active router it takes default priority 100.

ACTIVE#show standby

FastEthernet0 - Group 60

State is Active

13 state changes, last state change 3d19h

Virtual IP address is x.x.90.2

Active virtual MAC address is 0000.0c07.ac3c

Local virtual MAC address is 0000.0c07.ac3c (v1 default)

Hello time 3 sec, hold time 10 sec

Next hello sent in 0.380 secs

Preemption enabled

Active router is local

Standby router is x.x.90.1, priority 96 (expires in 9.388 sec)

Priority 100 (default 100)

Track interface FastEthernet1 state Up decrement 10

IP redundancy name is "hsrp-Fa0-60" (default)

avilt · ‎04-13-2009

Rpfinneran,

Your feedback helps me in finding the root cause of this issue. So far its running fine without any issue.

The HSRP site is at the remote location connected thru WAN links to 4 sites. Active router connected to ISP-B running OSPF area 1 and Standby router is connected to ISP-A, area 0. Please note that I have configured passive interface command on LAN interface on the routers.

On our site we have a ping monitor tool which pings both LAN/WAN interfaces of the router. During HSRP issue the PING monitor

reports both interfaces as down. Aslo the clients at the remote HSRP site are not able to communicate to other sites. When the HSRP issue occurs I login the standby router and from there I either telnet to problamatic router and reboot it OR I can increase the priority of the standby router to solve the issue.

Now my guess is that its a OSPF issue. Last time when the router had problem, I execute the command "show ip interface brief"

it showd both interface/protocol as up but the show log had the below entry.

004864: .Apr 8 04:20:47: %OSPF-5-ADJCHG: Process 65182, Nbr X.X.X.252 on FastEthernet1 fr

Down: Dead timer expired

004865: .Apr 8 04:20:49: %OSPF-5-ADJCHG: Process 65182, Nbr X.X.X.9 on FastEthernet1 fro

Down: Dead timer expired

004866: .Apr 8 04:20:51: %OSPF-5-ADJCHG: Process 65182, Nbr X.X.X.1 on FastEthernet1 fro

Down: Dead timer expired

Unfortunately I could not perform more OSPF tests. Is there any possiblity of OSPF going down even when both interface and protocol is UP? I have configured loopback address on the router for OSPF.

If the WAN interface goes down then definitely the router will decrement its OSPF priority that did not happen during the HSRP issue. But it lost OSPF communication with its neighbours and even though HSRP state was active it did not have the route to reach the destination.

Even though its working fine now if the problem happens again how can I mitigate this issue forever?

I have also performed a manual test by removing the WAN cable on the active router during which the active router decreased its HSRP priority and the standby router took over.

rpfinneran · ‎04-13-2009

Avilt,

No problem. I am sure we can get this resolved. Is it possible for you to make a drawing (visio or paint) of how this whole thing is layed out? I am a little confused, it looks like normally your router has two neighbors on Fa1, but you said you have the LAN interfaces as passive? Also, I don't understand why your two routers are in different OSPF areas? If you could make a drawing and attach the OSPF / Interface configs with the IP address scrubbed that would help.

naderzaman · ‎04-14-2009

On the "active router" you have duplex/speed as "auto". On the "standby router" you have hard coded speed:100 and duplex: full. This is a NO NO. Both sides must be configured exactly the same. When one side is configured as auto and the other hard coded, the auto side will always negotiate to half-duplex. I am assuming that this router is connected to a switch or hub as the means of connecting the local PCs. If there is a switch, check the duplex/speed settings on the ports. Make sure both sides are configured as auto.

rpfinneran · ‎04-14-2009

I agree with naderzaman, with one small correction. GigabitEth interfaces will default to 100 Full, not half.

Either way, it has been said several times in this forum that you should fix this duplex mismatch, it could absolutely be causing issues. Please correct it and provide a drawing if you can.

avilt · ‎04-15-2009

I will upload the network diagram soon.

This network is in production for the last 2 years without any issues. On the switch side it's set to auto/auto. I will try to set it to auto on the interface side as well.

avilt · ‎04-15-2009

Rpfinneran,

I am uploading the OSPF topology diagram.

ISP-1 = KD

ISP-2 = JT

The HSRP issue is at location D.

On those routers on the LAN segment interfaces are negotiated to 100mbps/full. But on WAN segment its negotiated to 100mbps/half duplex. I will hard code it to Full Duplex.

Also let me know your idea on OSPF topology. Do I need to join OSPF area 0 and area 1 at any other location?

rpfinneran · ‎04-15-2009

Avilt,

Great drawing, this is very helpful.

At site D, on KD router, what area is 90.1 interface in? On JT router, what area is 90.3 in?

As for the OSPF design. Why did you guys choose to use multiple areas for this? I would have probably put everything in area 0, and simply set the OSPF cost of all interfaces that you have in area 1 to something much higher than the primary interfaces. In your current design, you could end up with a discontinuous area 0 depending on how you answer my question.

avilt · ‎04-15-2009

Except Main Office OSPF is running only on WAN interfaces. OSPF is not enabled on LAN interface.

Well initially we had 14 branch offices connected thru OSPF in 2 areas. Now its reduced to just 4 location. May be I should put everything in one area. In such case should I enable OSPF on LAN interface at each location?