I have a P2P T1 connection between two offices connected by Cisco 2651XM routers with VWIC-2MFT-T1 cards. Last week I lost connectivity to site B so no one in the office could do anything because the 2651XM is also the gateway router for their LAN. They were unable to ping the LAN interface (fa0/0) on the router and I was unable to ping the serial interface on th other end of the P2P. I had them cycle the power and it came back up with everything working.
This weekend the same thing happened. Connectivity was lost but there was no one in the office to reset the router. Since there was nothing mission critical going on at site B we were going to have them reset the router this morning. Then the line came back up 2 1/2 hours later only to go back down again. Then it came back up 9 1/2 hours later and has been up since. We had our provider check the T1 and they stated that they were able to hit both ends of the T1 cleanly when we were having an interruption in connectivity.
The following is what is in the log for the site B router:
Apr 13 05:33:36.001: %CONTROLLER-5-UPDOWN: Controller T1 0/1, changed state to down (AIS detected)
Apr 13 05:33:38.004: %LINK-5-CHANGED: Interface Serial0/1:0, changed state to reset
Apr 13 05:33:38.016: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.30.3.1 (Serial0/1:0) is down: interface down
Apr 13 05:33:39.006: %LINEPROTO-5-UPDOWN: Line protocol on Interface Serial0/1:0, changed state to down
Apr 13 09:33:41.262: %SYS-3-CPUHOG: Task ran for 35333 msec (3591642/0), process = Logger, PC = 803B13E4.
-Traceback= 803B13E8 8049374C 80496CE8
Apr 13 09:33:41.314: %CONTROLLER-5-UPDOWN: Controller T1 0/1, changed state to up
.Apr 13 09:33:41.458: %LINK-3-UPDOWN: Interface Serial0/1:0, changed state to up
.Apr 13 09:33:42.496: %LINEPROTO-5-UPDOWN: Line protocol on Interface Serial0/1:0, changed state to up
.Apr 13 09:34:13.439: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor 172.30.3.1 (Serial0/1:0) is up: new adjacency
.Apr 13 13:34:14.169: %SYS-3-CPUHOG: Task ran for 34454 msec (3591340/0), process = ARP Input, PC = 8053FD58.
-Traceback= 8053FD5C 8049374C 80496CE8
I didn't have the log buffered for the router at site A so I don't know what it saw. The log messages above don't really help me with this issue. I see that there was an AIS detected but I don't know why. What additional logging can I turn on or troubleshooting can I do for this to try and figure out the cause? I enabled the logging buffer on the site A router but that doesn't help me right now.
Do you know if you were getting these %SYS-3-CPUHOG messages before the T1 was failing?
It can be the case that these informational messages have always existed due to the fact that certain processes are intermittently usurping the CPU's resources for an inordinate amount of time, but nonetheless are not causing any other failures.
And, on the other hand, it can also be the case that these disruptions in normal CPU resorurce allocation are what's causing other system failures, like your T1 controller failure.
Given the output you're showing us, it seems to me that the router's CPU is experiencing situations in which certain processes are "loading it down" and forcing it to divert resources to accomodate these "CPUHOG"s.
A CPUHOG is a process that diverts the processing capabilities and resources of the CPU for more than 2000 ms (2 seconds).
In the case of your messages, the processes lasted for over 30 seconds: %SYS-3-CPUHOG: Task ran for 35333 msec (3591642/0), process = Logger, PC = 803B13E4. -Traceback= 803B13E8 8049374C 80496CE8
It is fathomable that a 30 second interruption in CPU processing may effect the integrity of other processes that are running on your router.
Typically, CPUHOG problems can occur due to one or more of the following conditions:
Improper operational configuration
Initialization of many interfaces
High momentary error rate
Sustained abnormal condition
Software bug in the IOS during normal operation.
I put the third and last causes in bold letters because they bring us back to my intiial point. If these tracebacks and CPUHOG messages have always existed, then it can be a result of a bug in the code and they may not have been causing other failures. You won't really know for sure until you resolve that issue first.
If you find that upgrading your code has rectified the CPUHOG issue, yet your T1 controller continues to fail, then you may want to look into replacing the VWIC module, which is where the controller resides (in software).
Before noticing the "CPUHOG" traceback messages, I thought you may have a typical case of the service provider washing its hands of your circuit problems, as usual. But given that they tested clean while you were experiencing an outgae, coupled with the traceback messages, I have to conclude that your problem involves the router itself, most liekly wither the IOS or the VWIC module.
Just so you know, an AIS, otherwise known as a Yellow alarm, is a bit pattern that is sent out by a device that has stopped receiving a signal from its distant end. That would make sense given your router's CPU problem and its inability to sustain normal operations.
So, clear the logs on both ends, clear all interface counters, and continue monitoring. Meanwhile, use the Cisco IOS planner to decide which IOS upgrade is suitable for your router. If you have a Cisco service contract, open a TAC case and let them help you with this.
We are pleased to announce availability of Beta software for 16.6.3.
16.6.3 will be the second rebuild on the 16.6 release train targeted
towards Catalyst 9500/9400/9300/3850/3650 switching platforms. We are
looking for early feedback from customers befor...
Introduction Featured Speakers Luis Espejel is the Telecommunications
Manager of IENova, an Oil & Gas company. Currently he works with Cisco
IOS® and Cisco IOS XE platforms, and NX to some extent. He has also
worked as a Senior Engineer with the Routing P...
In this session you can learn more about Layer 3 multicast and the best
practices to identify possible threats and take security measures. It
provides an overview of basic multicast, the best security practices for
use of this technology, and recommendati...