I have two 1900 series routers in production for the last two years.
The routers have 2 interfaces LAN & WAN. HSRP is enabled on LAN interface with link monitoring for WAN interface. Routers were tested for failover before putting into production and worked fine.
Today suddenly the active HSRP router got hung and I could not connect to it remotely. Surprisingly the standby router did not become active. Since this was at the remote location, I asked the remote staff to shutdown the faulty router . Then HSRP switchover took place. After 10 minutes the faulty router was powered on and became active again.
I have no logs on syslog server to identify the issue. How can I pinpoint this issue? It seems like when the router got hung it did not give up its HSRP priority.
Seems they are still sending/receiving the hello packets while one of them got hang.
Maybe this link can help you some.
Let's take a look at the HSRP timer section.
Just want to ask you questions,When the router got hang. It cannot forward anything. Right? You cannot telnet to it. Can you ping it?
When the router got hung,I couldnot telnet. I have one question on HSRP priority.
I have set priority as 80 on standy by router, so when the other router (active) gets hung by what value it decrements the priority?
Well, On the standby router you have set priority as 80. It will take affect when the active router is gone. Let's the standby router lost all hello messages within the time you configured. It will promote itself to be an active router.
So in your case you don't have any tracking applied in on the active router. Priority you mention will come into play when they first elect who will be an active router and when the active router is alive again after refreshing something on it. (grin)
I'm afraid that changing the priority will not solve the problem. The first thing you should do is that you have to log on the standby router when the active got hang. Using the a "show standby" command to see what is going on. When the active router got hang can you do a ping command from the standby router to the active router?. I mean to test connections between the segment. I'm not sure that the router you cannot telnet to. It actually doesn't forward anything. Or they are still sending/receiving hsrp packets.
Note: Don't forget the link I provided.
From the standby router I could ping to the hung router LAN interface, but I could not telnet. I do not have the output of the show standby command. I feel its safe to set the prioroty of standby router to 90 and leave the priority on active router to default.
What may have happened, part of the hung router was still functioning and part wasn't. This happens very rarely, but when it does, you can enounter strange situtations.
What sometimes helps to avoid this, is running a later release of the same version, e.g. 12.2(4) vs. 12.2(18).
In later IOS versions, there's additional features to define some self monitoring although it can quickly become complex and I don't think it will guarantee 100% problem avoidance.
Today again the active router got hung but the standby router did not take over. So I telneted into the standby router and from there I could telnet into hanging router and rebooted the router. During reboot the standby router became active. I also took the log before the reboot. Kindly find the attached file. I could not find any useful information form the log.
"%PQ3_TSEC-5-LATECOLL: PQ3/FE(1), Late collision" --
%PQ3_FE-5-LATECOLL : PQ3/FE([dec]/[dec]), Late collision
Explanation Late collisions occurred on the Fast Ethernet interface.
Recommended Action If the interface is Fast Ethernet, verify that both peers are in the same duplex mode. Otherwise, no action is required.
What I also found is that when the router hangs, I execute the command show ip interface brief. The output shows both LAN & WAN interface as up but I am not able to ping the WAN interface from the hung router itself.
My current image is Cisco IOS Software, C181X Software (C181X-ADVIPSERVICESK9-M), Version 12.4(11)T2, RELEASE SOFTWARE (fc4).
Can I upgrade it to c181x-advipservicesk9-mz.124-24.T.bin
You need to focus more on why the router is hanging as opposed to why HSRP isn't working when its hung.
When the router gets hung, as you stated all interfaces are still up/up. I would check CPU utilization (show proc cpu). When the utilization gets high, the router must choose what to allocate the processor too, and unfortunately IP routing isn't high on the list, and may result in ICMP being dropped. However, that doesn't mean that the HSRP keepalives are dropped, which could explain why HSRP is not "working correctly".
I would investigate why the router is getting hung (likely a bug). In the mean time, you could mitigate it by adding an IP SLA object and tracking ICMP reachability the router that keeps getting hung, since as you stated it is not pingable when hung.
I have attached the configs that could be used to mitigate this issue, though ultimately I suggest you resolve the router being hung (have I said this enough?). Also, as someone above noted, fix the duplex mismatch.
Thank You for the support. Let me give you my network topology and the troubleshooting steps I followed. Last time the HSRP failed again during non production hours and I could try some troubleshooting steps.
1. I cannot say that router got hung as I was able to telnet from the standby router to its LAN interface. Both are running OSPF but I was unable to ping its WAN interface from its OSPF neighbours.
2. I have tried making the standby router active by setting its priority to 105, but after some days even that router failed.
3. During the HSRP hung period the router CPU utilization is around 20% only.
4. When the HSRP on the router fails, it retains its HSRP state as active and no log is shown in the console.
5. I execute tha command show ip interface brief on the failed router. Both interface/line protocol were up.
6. On the LAN side I was using non cisco switch. Currently I have replaced this switch with Cisco 2950. The routers are working fine since 09th April-2009
7. On the WAN interface I have the command "ip verify unicast reverse-path" When this command is applied I cannot ping the WAN interface from its own console. Is it the dafault behaviour?
Kindly let me know if its a IOS issue.
Avilt, no problem...I hope we get this resolved.
"1. I cannot say that router got hung as I was able to telnet from the standby router to its LAN interface. Both are running OSPF but I was unable to ping its WAN interface from its OSPF neighbours"
Interesting. Did OSPF stay up? What other symptoms are you seeing when the router fails...how are you determining that it is failing (is it based only on the inability to ping, or are other issues reported?). When the active router fails, you can telnet to it. Once you are in, are you able to ping its own LAN interface (90.3)? If not, this could help in terms of finding a way to mitigate. I would certainly try to document as much about the failure as possible, get a show tech, and open a tac case. You need to find out why the routers are failing.
"4. When the HSRP on the router fails, it retains its HSRP state as active and no log is shown in the console."
Okay, so when the router fails, are you able to ping 90.3 from the router itself? If not, implement the mitigation technique I mentioned in a previous post.
"6. On the LAN side I was using non cisco switch. Currently I have replaced this switch with Cisco 2950. The routers are working fine since 09th April-2009"
It would be very interesting to see if this resolves the issue.
"7. On the WAN interface I have the command "ip verify unicast reverse-path" When this command is applied I cannot ping the WAN interface from its own console. Is it the dafault behaviour?"
Yes, this is normal. It is a simple but effective anti-spoofing control. See http://www.cisco.com/en/US/docs/ios/12_2/security/command/reference/srfrpf.html#wp1023632
Keep me posted on whether the issue occurs again or not. As I said, if the router fails again, be sure to get a show tech and open a TAC case. Try to get details about what interfaces you can ping, OSPF states, BGP states (if running), etc. The more info the better the TAC will be able to assist.
On active router it takes default priority 100.
FastEthernet0 - Group 60
State is Active
13 state changes, last state change 3d19h
Virtual IP address is x.x.90.2
Active virtual MAC address is 0000.0c07.ac3c
Local virtual MAC address is 0000.0c07.ac3c (v1 default)
Hello time 3 sec, hold time 10 sec
Next hello sent in 0.380 secs
Active router is local
Standby router is x.x.90.1, priority 96 (expires in 9.388 sec)
Priority 100 (default 100)
Track interface FastEthernet1 state Up decrement 10
IP redundancy name is "hsrp-Fa0-60" (default)
Your feedback helps me in finding the root cause of this issue. So far its running fine without any issue.
The HSRP site is at the remote location connected thru WAN links to 4 sites. Active router connected to ISP-B running OSPF area 1 and Standby router is connected to ISP-A, area 0. Please note that I have configured passive interface command on LAN interface on the routers.
On our site we have a ping monitor tool which pings both LAN/WAN interfaces of the router. During HSRP issue the PING monitor
reports both interfaces as down. Aslo the clients at the remote HSRP site are not able to communicate to other sites. When the HSRP issue occurs I login the standby router and from there I either telnet to problamatic router and reboot it OR I can increase the priority of the standby router to solve the issue.
Now my guess is that its a OSPF issue. Last time when the router had problem, I execute the command "show ip interface brief"
it showd both interface/protocol as up but the show log had the below entry.
004864: .Apr 8 04:20:47: %OSPF-5-ADJCHG: Process 65182, Nbr X.X.X.252 on FastEthernet1 fr
Down: Dead timer expired
004865: .Apr 8 04:20:49: %OSPF-5-ADJCHG: Process 65182, Nbr X.X.X.9 on FastEthernet1 fro
Down: Dead timer expired
004866: .Apr 8 04:20:51: %OSPF-5-ADJCHG: Process 65182, Nbr X.X.X.1 on FastEthernet1 fro
Down: Dead timer expired
Unfortunately I could not perform more OSPF tests. Is there any possiblity of OSPF going down even when both interface and protocol is UP? I have configured loopback address on the router for OSPF.
If the WAN interface goes down then definitely the router will decrement its OSPF priority that did not happen during the HSRP issue. But it lost OSPF communication with its neighbours and even though HSRP state was active it did not have the route to reach the destination.
Even though its working fine now if the problem happens again how can I mitigate this issue forever?
I have also performed a manual test by removing the WAN cable on the active router during which the active router decreased its HSRP priority and the standby router took over.
No problem. I am sure we can get this resolved. Is it possible for you to make a drawing (visio or paint) of how this whole thing is layed out? I am a little confused, it looks like normally your router has two neighbors on Fa1, but you said you have the LAN interfaces as passive? Also, I don't understand why your two routers are in different OSPF areas? If you could make a drawing and attach the OSPF / Interface configs with the IP address scrubbed that would help.
On the "active router" you have duplex/speed as "auto". On the "standby router" you have hard coded speed:100 and duplex: full. This is a NO NO. Both sides must be configured exactly the same. When one side is configured as auto and the other hard coded, the auto side will always negotiate to half-duplex. I am assuming that this router is connected to a switch or hub as the means of connecting the local PCs. If there is a switch, check the duplex/speed settings on the ports. Make sure both sides are configured as auto.
I agree with naderzaman, with one small correction. GigabitEth interfaces will default to 100 Full, not half.
Either way, it has been said several times in this forum that you should fix this duplex mismatch, it could absolutely be causing issues. Please correct it and provide a drawing if you can.
I will upload the network diagram soon.
This network is in production for the last 2 years without any issues. On the switch side it's set to auto/auto. I will try to set it to auto on the interface side as well.
I am uploading the OSPF topology diagram.
ISP-1 = KD
ISP-2 = JT
The HSRP issue is at location D.
On those routers on the LAN segment interfaces are negotiated to 100mbps/full. But on WAN segment its negotiated to 100mbps/half duplex. I will hard code it to Full Duplex.
Also let me know your idea on OSPF topology. Do I need to join OSPF area 0 and area 1 at any other location?
Great drawing, this is very helpful.
At site D, on KD router, what area is 90.1 interface in? On JT router, what area is 90.3 in?
As for the OSPF design. Why did you guys choose to use multiple areas for this? I would have probably put everything in area 0, and simply set the OSPF cost of all interfaces that you have in area 1 to something much higher than the primary interfaces. In your current design, you could end up with a discontinuous area 0 depending on how you answer my question.
Except Main Office OSPF is running only on WAN interfaces. OSPF is not enabled on LAN interface.
Well initially we had 14 branch offices connected thru OSPF in 2 areas. Now its reduced to just 4 location. May be I should put everything in one area. In such case should I enable OSPF on LAN interface at each location?