7200 router reload because of watchdog reset

Unanswered Question
Dec 3rd, 2008

Hi all,

I have a problem with 7206VXR routers. Suddenly they are reloaded , and I found that it's caused by a watchdog reset. I can't find any syslogs that related to it, but it's stated clearly in the show version output. I found a field notice concerning this problem, but the IOS version I have is 12.4(19), and it seems that this version have resolved this issue.

My other question :

Why does the logs tells nothing ? Is watchdog reset only logged on console ?

Thanks,

Wiyandi

I have this problem too.
0 votes
  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 0 (0 ratings)
Loading.
Giuseppe Larosa Wed, 12/03/2008 - 09:55

Hello Wiyandi,

does the router write a crashinfo file ?

check with

dir /all

it can be written in the bootflash

About logging show logging tells you your current settings

What type of CPU / NPE are you using ?

There has been another thread of another collegue that was experiencing reload by watchdog timeout caused by AAA process in a system used for terminating mobile users (CDMA PSDN ).

In that case the information about the reason was inside the crashinfo.

the crashinfo is also placed inside the output of show tech:

Example from that case:

%SYS-3-CPUHOG: Task is running for (112004)msecs, more than (2000)msecs (0/0),process = AAA ACCT Proc.

-Traceback= 0x60AE6894 0x60AE6824 0x60AE6FBC 0x608C4DE8 0x608C5F80 0x608C598C 0x608B4014 0x608B411C 0x608592BC 0x6137142C 0x61371728 0x61370D9C 0x61370560 0x609A0E68 0x6099E0F4 0x609A1230 found

If you like you can post a show tech of your router.

Hope to help

Giuseppe

Giuseppe Larosa Thu, 12/04/2008 - 02:54

Hello Wiyandi,

inside the sh tech I've found:

Cisco 7206VXR (NPE-G1) processor (revision A) with 983040K/65536K bytes of memory.

System returned to ROM by bus error at PC 0x60766CF0, address 0xACC111A at 16:58:11 IND Wed Jun 11 2008

System was restarted by bus error at PC 0x60766CF0, address 0xACC111A

--------------------------------------------------------------------

Possible hardware fault. This could be an one time occurence,

upon reccurence please contact Cisco Technical Support.

--------------------------------------------------------------------

the address is not in any memory range listed in the sh region (also present in sh tech)

So I posted your show tech in the output interprer, here is the feedback about show stack analysis

a) spurious interrupts

This device has reported 25 spurious interrupts. These interrupts occur

when an unnecessary interrupt is raised for an already processed packet, possibly

due to an internal race condition, or due to improper initialization of interrupt

handling routines. These may be caused either by defective hardware, or by software.

There is no discernable impact on the behavior of the router due to such spurious

interrupts. They are only counted for monitoring purposes. They may safely be ignored

as long as the number of spurious interrupts is not high and increasing, along

with some dropped packets or degraded performance.

b) possible crash reasons

The failure was caused by a software defect.

The stack trace decoded symbols are:

rmi_do_resource_owner_cfg_for_ru

Possible bug matches are listed below. Bugs with a score of .90 or more

are the most likely candidates:

among them there is one about missing crash info:

0.93 CSCsu66533 R 12.4(23.3)T 12.4(23.3)PI10 None Process watchdog timeout (test crash -w) does not generate any interrupt

CSCuk61282 Bug Details

Ports of bgp_prefix_list_callback need to be the same as 12.4 version

you are using prefix-list in peer-groups this can apply

You should think of an IOS upgrade the list of possible bugs is quite long

about the missing crashinfo verify if the bootflash is full : if multiple crashes have occurred the flash can be full

upload with tftp older crashinfo and delete them.

also there can be some sw or hw issues for the spurious (see above)

Hope to help

Giuseppe

sr2290723 Thu, 12/04/2008 - 09:08

Hi Giuseppe,

Thanks for your deep analysis. Let me give you several strange facts :

1. The restart happened at Dec 1 2008, 00:00:09, not at Jun 11 2008. So, the analysis you gave here explains clearly the crashes at Jun 11 2008, but not the recent one.

2. I checked the show bootflash: all included in the show tech, and it is not full. We still have 3 MB space, and each crashinfo file only takes about 190 KB.

I checked over the bug you told here :

1. CSCsu66533 : it makes sense but I assume it only affects SAMI device, not 7200 routers.

2. CSCuk61282 : it doesn't affect my IOS version which is 12.4(19).

Any idea about this ?

Anyway thanks a lot, bro. Actually I'm quite uncertain whether IOS upgrade will resolve the problem, but if I don't take a shot, I won't know the result.

Wiyandi

Giuseppe Larosa Thu, 12/04/2008 - 09:32

Hello Wiyandi,

I realized the crashinfo wasn't for the last one.

However, I thought to go on in the analysis in the hope to find something that could help to determine if there is an hardware problem or a a SW bug but I couldn't find clear signs.

I reported a pair of bugs the list was quite long.

It is really strange no crash info even if there is space in bootflash.

You could also open a SR to TAC for this.

Let us know if an IOS upgrade solved the issue

Hope to help

Giuseppe

Actions

This Discussion