Cisco Support Community
cancel
Showing results for 
Search instead for 
Did you mean: 

ACE module crashed abruptly with no specific reload reason

Introduction

Cisco ACE modules sits inside Cisco Catalyst 6500 Series Switches and Cisco 7600 Series Routers to provide high level of load-balancing and application-delivery. ACE modules have robust software and hardware that makes it possible to handle high volume of traffic at real time.

Problem

ACE module crashed unexpectedly. No error message or problem log is found. There is no specific reload reason. The module was working fine since long time and then abruptly crashed and reloaded. The module is working fine since then and showing no problem.

Following log is seen on the backup module:

Aug 27 11:29:24 PDT: %FABRIC-SP-6-TIMEOUT_ERR: Fabric in slot 5 detected excessive flow-control on channel 8 (Module 9, fabric connection 0)

Aug 27 11:34:26 PDT: %OIR-SP-3-PWRCYCLE: Card in module 9, is being power-cycled off (Reset - Module Reloaded During Download)

Aug 27 11:34:27 PDT: %C6KPWR-SP-4-DISABLED: power to module in slot 9 set off (Reset - Module Reloaded During Download)

Aug 27 11:34:39 PDT: %OIR-SP-3-PWRCYCLE: Card in module 9, is being power-cycled off (Module not responding to Keep Alive polling)

Aug 27 11:34:39 PDT: %C6KPWR-SP-4-DISABLED: power to module in slot 9 set off (Module not responding to Keep Alive polling)

Aug 27 11:39:34 PDT: %DIAG-SP-6-RUN_MINIMUM: Module 9: Running Minimal Diagnostics...

Aug 27 11:39:34 PDT: %DIAG-SP-6-DIAG_OK: Module 9: Passed Online Diagnostics

Explanation

Any kind of event that causes module to reload gets logged and is mentioned as reload reason. However certain situations can render even most basic working impossible and thus causing the module to reload without showing any reason. Almost always this is related to SRAM parity error. The SRAM parity error, which can be seen in the core file, is not due to a software issue; although there were software related issues in earlier code versions.

SRAM's are very sensitive to light, dust, radiation, shock, temperature, so it is possible to get an SRAM parity error on an  healthy ACE. The issue is the result of a "bit-flip" within the SRAM itself which can occur as a result of environmental conditions. This "bit-flip" is rectified by a simple reboot of the system, which would occur with the generation of the core file. ACE is susceptible to this because of the way it uses SRAM to store control information and packet data as opposed to scratch-pad storage. Almost any 1-bit flip will be detected as a parity error. This is the problem with SRAM memory. All equipment makers face the same issue with this type of memory. SRAM memory is quite sensitive to a lot of things and it may detect parity error once in a while.

Solution

Recommended action is to upgrade to A2(3.3) or above in order to fix all software-related SRAM bugs. Refer bug CSCtc53046 for a partial software workaround which mitigates hardware generated SRAM parity errors by reducing the amount of access to the SRAM due to the collection of the interface statistics. SRAM errors are expected to occur at a frequency of approximately one per year per ACE module. A single SRAM parity error does not justify an RMA. If a particular module experiences a significantly higher failure rate and is running A2(3.3) or later, then a proactive RMA would be required.

Related Information

How to capture on the TenGigabit interface between the ACE module and the Catalyst

SSL URL Rewrite with wildcard on ACE

ACE Crash due to SRAM Parity

ACL memory usage on ACE module

Version history
Revision #:
1 of 1
Last update:
‎10-11-2011 12:44 AM
Updated by:
 
Labels (1)
Comments
Cisco Employee

Hi Sandeep,

Parity errors are always accompanied by the core/crash files as well as the last reload reason. In cases where ACE reloads without any information, the causes are unknown and documented in DDTS CSCsy91540 related to "Silent.

Reload". There is an action plan associated with the DDTS and upgrade suggested too.

We have a similar DDTS for ACE 30 too.

CSCua77618  ACE - ACE30 A5(1.2): UNKNOWN Silent reboot without Core Dump output

Regards,

Kanwal