P2P connection dropping

Unanswered Question
Apr 14th, 2008

I have a P2P T1 connection between two offices connected by Cisco 2651XM routers with VWIC-2MFT-T1 cards. Last week I lost connectivity to site B so no one in the office could do anything because the 2651XM is also the gateway router for their LAN. They were unable to ping the LAN interface (fa0/0) on the router and I was unable to ping the serial interface on th other end of the P2P. I had them cycle the power and it came back up with everything working.

This weekend the same thing happened. Connectivity was lost but there was no one in the office to reset the router. Since there was nothing mission critical going on at site B we were going to have them reset the router this morning. Then the line came back up 2 1/2 hours later only to go back down again. Then it came back up 9 1/2 hours later and has been up since. We had our provider check the T1 and they stated that they were able to hit both ends of the T1 cleanly when we were having an interruption in connectivity.

The following is what is in the log for the site B router:

Apr 13 05:33:36.001: %CONTROLLER-5-UPDOWN: Controller T1 0/1, changed state to down (AIS detected)

Apr 13 05:33:38.004: %LINK-5-CHANGED: Interface Serial0/1:0, changed state to reset

Apr 13 05:33:38.016: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor (Serial0/1:0) is down: interface down

Apr 13 05:33:39.006: %LINEPROTO-5-UPDOWN: Line protocol on Interface Serial0/1:0, changed state to down

Apr 13 09:33:41.262: %SYS-3-CPUHOG: Task ran for 35333 msec (3591642/0), process = Logger, PC = 803B13E4.

-Traceback= 803B13E8 8049374C 80496CE8

Apr 13 09:33:41.314: %CONTROLLER-5-UPDOWN: Controller T1 0/1, changed state to up

.Apr 13 09:33:41.458: %LINK-3-UPDOWN: Interface Serial0/1:0, changed state to up

.Apr 13 09:33:42.496: %LINEPROTO-5-UPDOWN: Line protocol on Interface Serial0/1:0, changed state to up

.Apr 13 09:34:13.439: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 1: Neighbor (Serial0/1:0) is up: new adjacency

.Apr 13 13:34:14.169: %SYS-3-CPUHOG: Task ran for 34454 msec (3591340/0), process = ARP Input, PC = 8053FD58.

-Traceback= 8053FD5C 8049374C 80496CE8

I didn't have the log buffered for the router at site A so I don't know what it saw. The log messages above don't really help me with this issue. I see that there was an AIS detected but I don't know why. What additional logging can I turn on or troubleshooting can I do for this to try and figure out the cause? I enabled the logging buffer on the site A router but that doesn't help me right now.

I have this problem too.
0 votes
  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 4 (1 ratings)
lamav Mon, 04/14/2008 - 11:23

Do you know if you were getting these %SYS-3-CPUHOG messages before the T1 was failing?

It can be the case that these informational messages have always existed due to the fact that certain processes are intermittently usurping the CPU's resources for an inordinate amount of time, but nonetheless are not causing any other failures.

And, on the other hand, it can also be the case that these disruptions in normal CPU resorurce allocation are what's causing other system failures, like your T1 controller failure.

Given the output you're showing us, it seems to me that the router's CPU is experiencing situations in which certain processes are "loading it down" and forcing it to divert resources to accomodate these "CPUHOG"s.

A CPUHOG is a process that diverts the processing capabilities and resources of the CPU for more than 2000 ms (2 seconds).

In the case of your messages, the processes lasted for over 30 seconds: %SYS-3-CPUHOG: Task ran for 35333 msec (3591642/0), process = Logger, PC = 803B13E4. -Traceback= 803B13E8 8049374C 80496CE8

It is fathomable that a 30 second interruption in CPU processing may effect the integrity of other processes that are running on your router.

Typically, CPUHOG problems can occur due to one or more of the following conditions:

Heavy Traffic

System load

Faulty hardware

Improper operational configuration

Configuration change

Initialization of many interfaces

High momentary error rate

Sustained abnormal condition

Software bug in the IOS during normal operation.

I put the third and last causes in bold letters because they bring us back to my intiial point. If these tracebacks and CPUHOG messages have always existed, then it can be a result of a bug in the code and they may not have been causing other failures. You won't really know for sure until you resolve that issue first.

If you find that upgrading your code has rectified the CPUHOG issue, yet your T1 controller continues to fail, then you may want to look into replacing the VWIC module, which is where the controller resides (in software).

Before noticing the "CPUHOG" traceback messages, I thought you may have a typical case of the service provider washing its hands of your circuit problems, as usual. But given that they tested clean while you were experiencing an outgae, coupled with the traceback messages, I have to conclude that your problem involves the router itself, most liekly wither the IOS or the VWIC module.

Just so you know, an AIS, otherwise known as a Yellow alarm, is a bit pattern that is sent out by a device that has stopped receiving a signal from its distant end. That would make sense given your router's CPU problem and its inability to sustain normal operations.

So, clear the logs on both ends, clear all interface counters, and continue monitoring. Meanwhile, use the Cisco IOS planner to decide which IOS upgrade is suitable for your router. If you have a Cisco service contract, open a TAC case and let them help you with this.



qbakies11 Mon, 04/14/2008 - 11:42

Thank you for the detailed response. Do I need to turn on any additional logging debugs to try and capture more useful information in case this happens again?

If the CPUHOG error did cause the issue and only lasted for 30+ seconds would it cause my interfaces to be down for 9+ hours?

lamav Mon, 04/14/2008 - 11:49


I'm inclined to say probably not. If you have a spare VWIC and want to swap it out real quick, go for it (with permission, of course!).

I do think, though, that you should address the CPUHOG messages. If it's not a bug in the code, then there is another undesirable condition that you should address.


qbakies11 Mon, 04/14/2008 - 11:53

Ok, that is strange. I wasn't looking at the times on the log messages but they seem to indicate something different. Those time look like the line was down for 4 hours:

Apr 13 05:33:39.006 - Apr 13 09:33:41.314

The line was unresponsive to my monitoring software longer than that though.

lamav Mon, 04/14/2008 - 12:07

Well, there you have it, see?

Your line was down for 4 hours, yet you couldn;t access the router for longer than that.

Could it be that the CPU was so bogged down that it couldn;t support your remote management session?

That's why I suggested addressing both problems, perhaps one at a time.

So, clear the counters and logs on both routers, and let's see what happens. Meanwhile, upgrade the IOS.

Im walking into a meeting in about 1 minute

Good luck for now!




This Discussion