Router Crashes

Blog

May 23, 2011 7:45 PM
May 23rd, 2011

/* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Calibri","sans-serif";}

Router Crashes

Cisco devices are everywhere – from schools to small-home deployments, from labs to offices, from military setups to all major ISP environments; Cisco products have become an inevitable part of every network – large or small. The major reason why Cisco products have excelled in the market today is the robustness that they brings to the network. There are a wide variety of Routers, Switches, ASAs, etc available out there, each one designed to suit a different need, hence covering all types and aspects of network deployment. Hence, it is no surprise that the Cisco IOS is the most widely used operating system in the world.


Being such a major contributor to networks today, and having millions of routers and switches in the field, there is always a slight possibility that a Cisco device may crash, but that doesn’t take away the credibility and the awesomeness of the Cisco IOS.


So, for the moment, lets take a minute to discuss why a Cisco Router crashes.

So, what are we talking about here?


This article basically gives a general heads-up of:


·         What is a router crash;

·         Why does a router crash and what do we need to look for after the crash;

·         Most commonly seen types of router crashes; and

·         What each type of crash means.

NB: This article provides insight on commonly seen crashes on Cisco Routers only; Crashes on Cisco Switches are not covered here.

What is a Router Crash?

A ‘system crash’, in the elementary sense of the term, is where a system detects an unrecoverable error, and reboots itself.

The errors that cause the crash are typically detected by the processor hardware, which automatically branches to a special error handling code in the ROM Monitor. The ROM Monitor identifies the error, prints a message, creates a crash dump, which is stored on a file system, and restarts the system.

The crash dump (called the ‘crashinfo’ file) is our lifeline to determine why the router crashed.

Why does a router crash and what do we need to look for after the crash?

In a broad sense, there are two reasons why a router may crash – software or hardware failures.

·         A software crash is observed when the router detects an unrecoverable error and reloads itself to prevent sending corrupted data. Such crashes are generally caused by bugs in the Cisco IOS.

So what is a bug? A bug is essentially an error in the code; it may be a coding error or just the simple fact that the code is not executing as smoothly as it is supposed to.

·         A hardware crash is caused when a piece of hardware (not necessarily the router chassis; it may even be a small module) goes bad and is not able to perform to its fullest capacity, so much so that the device can no longer live with the faulty hardware and ultimately crashes. A hardware replacement is the way forward in such cases.

In general, when a router crashes, a crashinfo file is generated and stored in the flash of the router (unless otherwise specified and provided that there is sufficient space in the flash to store the file). This crashinfo file is the lifeline for troubleshooting such crashes; it gives us the reason for the crash and the relevant information required for analysis.

Besides the crashinfo file, the other place where we can see the reason for the crash is in the output of the “show version” command. It will be displayed as follows:

                Router uptime is <>

                System returned to ROM by <reason for the crash> at PC <> , address <>

System restarted at <>

System image file is <IOS name>

Most commonly seen types of router crashes

Here is a list of the most commonly seen types of router crashes and what each of these really mean. Please note that this list is not exhaustive.

In each of the below mentioned case, you will need to collect the information mentioned (in each case below), open a Service request with Cisco TAC and provide them the collected information for initial troubleshooting of the issue.

1.       Bus Error crash (Software-related)

What it means:

The processor tries to access a memory location that doesn’t exist (software problem) or the hardware doesn’t respond properly to a correctly referenced memory location (hardware problem).

What you see in “show version”:

System returned to ROM by bus error at PC <value>, address <value> at <time and date>

What you need to do:

This is always a software related issue, and generally leads to a bug in the IOS. Hence, you will need to collect the following information for troubleshooting:

·         Output of the “Show tech” command

·         Most recent relevant crashinfo file present in the flash

·         Console logs (if any)

On some devices, you may see a router crash due to a “Seg V Exception Error”. This is very similar to a bus error crash, both in terms of what it means and what you need to do when you encounter such a crash.

2.       Processor Memory Parity error crash (Hardware-related)

What it means:

Parity errors occur when the hardware attempts to check the validity of data by comparing computed parity values to previous parity values for the same data. A single (or more) bit flip in the data can result in a parity error.

Parity errors are generally caused due to the following reasons:

·         Improper handling of the hardware equipment

·         High temperatures

·         Cosmic radiation

What you see in “show version”:

System returned to ROM by processor memory parity error at PC <value>, address <value> at <time and date>

What you need to do:

If the router crashes due to a processor memory parity error once, then you need to monitor the device for a couple of weeks to see if it crashes again or not. If it is stable, then we can consider the initial crash as a Single Event Upset (SEU) and overlook it.

However, if the router crashes periodically due to a processor parity error, then you need to replace the DRAM + Processor of the router.

3.       Software forced crash (Software-related)

What it means:

As mentioned earlier, a software crash is observed when the router detects an unrecoverable error and reloads itself to prevent sending corrupted data. A majority of such crashes are caused due to software bugs in the Cisco IOS.

What you see in “show version”:

System restarted by error - Software-forced crash, PC <value> at <date & time>

What you need to do:

A software forced crash will always generate a crashinfo file, which will help you determine why the router crashed. Collect the following outputs for TAC analysis:

·         “show tech” output from the router

·         Crashinfo file generated from the crash

4.       Watchdog timer expired (Software or Hardware related)

What it means:

There are two types of watchdog timeout crashes:

·         Software watchdog timeout

Despite its name, this indicates a hardware problem.

There is a watchdog timer on the motherboard of the router, which counts down while waiting for a response from any hardware on the router chassis. If the motherboard does not hear a response from the hardware till the timer counts down, it assumes that the hardware has gone bad and the router reloads due to a software watchdog timeout exception. This may be caused by an infinite loop at interrupt level, but this is very rare. In most of the cases, the hardware which did not respond in the allotted time is at fault and needs to be replaced.

·         Process Watchdog timeout

This indicates a software problem.

This is generally caused due to an infinite loop at process level (a process running for a very long time, without letting go of the processor).

It is a very similar to a software forced crash.

What you see in “show version”:

·         Software watchdog timeout

System restarted due to a Watchdog timeout

Also, the console will have a message similar to:

*** Watchdog Timeout ***

·         Process Watchdog timeout

System restarted due to a software forced crash

The console will show:

SYS-2-WATCHDOG: Process aborted on watchdog timeout
*** System received a software forced crash ***

What you need to do:

In case of a software watchdog timeout crash, you will need to replace the motherboard or one of the modules (the logs will help in determining which module caused the crash).

In case of a process watchdog timeout crash, the troubleshooting is the same as that for “Software forced crash” (see above).

5.       Fatal Hardware Error (Hardware-related)

What it means:

There is a hardware failure in the processor or in one of the modules in the router, which has led to a router crash.

What you see in “show version”

You will see something similar to:

System returned to ROM by error - a System Error PC <value> at <date & time>

The crashinfo file will have a signal value of Signal=22, which indicates that it is a fatal hardware error.

What you need to do:

A crashinfo file will be generated for this type of a crash. Collect the following for TAC analysis:

·         “Show tech” output from the router

·         Crashinfo file generated from the crash

These outputs will be used to determine which hardware component needs to be replaced.

Retrieving the crashinfo from the router:

The crashinfo, by default, is generated and stored in the flash of the router every time the router crashes (provided that there is enough space in the flash to store the crashinfo file). It can be retrieved as follows:

Router# more flash:<crashinfo_filename>

(On some platforms, the flash may be called “disk”, “bootflash”, etc. The appropriate keyword needs to be used in the above command.)

Signal (SIG) values

For crashes that generate a crashinfo file, there is a Signal value associated to each type of crash. This signal value can be seen in the crashinfo file (under the “Context” section).

For example:

========= Context ======================

C2600 Software (C2600-ADVIPSERVICESK9-M), Version 12.4(15)T12, RELEASE SOFTWARE (fc3)
Technical Support: http://www.cisco.com/techsupport
Compiled Fri 22-Jan-10 00:53 by prod_rel_team
Signal = 23 Vector = 0x700
 
The following table gives a brief list of the most commonly seen types of crashes and the signal values associated with the same.
 

SIG Value
Reason
2
Unexpected HW interrupt
4
Illegal Opcode Exception
10
Bus Error Exception
11
SegV Exception
20
Parity Error Exception
22
Fatal Hardware error
23
Software forced crash

 
In the example above, the router has crashed due to a “Software Forced Crash”.

Average Rating: 5 (1 ratings)

Comments

iantra123 Wed, 10/10/2012 - 23:14

Hello,

Good tip

Changing DRAM is easy.

But tell me how to replace the router processor?

regards,

Antra

Actions

Login or Register to take actions

This Blog

Posted May 23, 2011 at 7:45 PM
Stats:
Comments:1 Avg. Rating:5
Views:5580   
Shares:0

Related Content

Blogs Leaderboard

Rank Username Points
1 5
Rank Username Points
5