ACE Crash due to SRAM Parity

Unanswered Question
Nov 23rd, 2009

Hi Experts,

My question is one of my ACE module running A2(1.6a) have been crashed due to SRAM parity error.

ACE20Admin#show version
Software  Version A2(1.6a)


last boot reason:  NP 1 Failed : SRAM Parity Error Chan 2

I would like to know is this a Software bug or an Hardware replacement is needed.

Thanks in advance.

Regards,

Sum.

I have this problem too.
0 votes
  • 1
  • 2
  • 3
  • 4
  • 5
Average Rating: 0 (0 ratings)
Gilles Dufour Mon, 11/23/2009 - 08:38

Sum,

a single SRAM parity error does not justify an RMA.

Unfortunately, SRAM's are very sensitive to light, dust, radiation, shock, temperature,... so it is possible to get an SRAM parity error on an healthy ACE.

Only, if you see repeated errors on the same blade is it an indication that there an hardware problem.

Gilles.

inayathulla1 Mon, 11/23/2009 - 08:45

Hi Giles,
I have the same issue with me and when i reseached it I found an Bug and its been fixed in the 2.0 version.
BUg:-CSCsv52331 Bug Details: ACE crashes with SRAM parity error : source OCM ME

Hence this bug been resolved in A2(2.1) Release.
Resolved Cavets:-

CSCsv52331—The ACE becomes unresponsive due to an SRAM parity error. Workaround: None.

What is your opinion on this?????

Thanks in Advance.

Regards,

Inayath.

Gilles Dufour Tue, 11/24/2009 - 02:40

yes, this is a particular case where we tried to access an address that does not actually exist.

There is not really a parity error.  But it was detect as such assuming the pointer got corrupted in SRAM.

Anyway, when you do get an ACE crash (especially SRAM parity errors) it is really advised to open a service request with the TAC.

We can than make sure that this is software or hardware. And if a real parity error, we do keep track of them to see if there is a "bad" trend.

If we do not get all SRAM parity erros reported to us, we can't detect that there is a problem in the field.

Thanks.

Gilles.

waitejk Thu, 02/18/2010 - 15:33

We had the same issue. Our standby ACE rebooted a couple of nights ago with this SRAM Parity error.

We opened a TAC case and this is the reply we got,

The SRAM parity error presented in the core file is not due to a software issue.
The issue is the result of a "bit-flip" within the SRAM itself which can occur as a
result of environmental conditions. This "bit-flip" is rectified by a simple reboot of
the system, which would occur with the generation of the core file. Cisco internal
testing and customer experience has shown that these types of issues can occur
with very low frequency, but do not required an RMA of the device.
If there are multiple instances of this issue on the same module, a proactive RMA/EFA
of the device would be in order.

ACE is susceptible to this because of the way it uses SRAM to store control information
and packet data as opposed to scratch-pad storage. Almost any 1-bit flip will be detected as a
parity error. Cisco has recognized the issue and is taking action to ensure this will not be
an issue on the next generation of the ACE module. The next generation module design
and timeline is currently under review.

We are running A2 2.3 code.
Everyone can derive their own opinion from that response. My take is that it's sounds like a hardware design issue to me. It certainly does not give us the "warm and fuzzy's" we've come to expect from Cisco.
Gilles Dufour Mon, 02/22/2010 - 02:15

This is the problem with SRAM memory.

All equipment makers face the same issue with this type of memory.

This is the reason why we are working on a way to get rid of this type of memory.

G.

marciobaesse Thu, 04/05/2012 - 07:58

Hi guys,

My error is :  last boot reason:  NP 2 Failed : SRAM Parity Error Chan 3

The issue is the result of a "bit-flip" within the SRAM itself which can occur as a result of environmental conditions. This "bit-flip" is rectified by a simple reboot of the system, which would occur with the generation of the core file. . Cisco internal testing and customer experience has shown that these types of issues can occur with very low frequency, but do not require an RMA of the device.

rugs,

Marcio Baesse


jobejara Thu, 04/05/2012 - 08:52

Hardware designers and developers in general have identified

this issue related to SRAM memory which might be triggered by

environmental conditions. The way how SRAM memory works makes it susceptible to suffer these issues, Cisco is highly focused on this currently and we are working on that.It is being seen that this behavior may be also linked to some software defects but if you have experienced this issue before and you are running at A2 2.3 then the recommendation is to proceed with a replacement since the device hardware might be affected at that moment. This issue occur with a low very frequency.

J.

marciobaesse Thu, 04/05/2012 - 09:41

I received an update from Cisco, and we will monitor this ACE module.

If the problem appears, we will upgrade to the A2 (3.3).

tks,

Marcio Baesse

sogleedy41x Sat, 09/07/2013 - 16:57

Here was the response from Cisco for my issue, hopefully can shed some light:

Problem  Description
As I have understood it till now, the issue is, ACE20 module in slot 9 of the chassis and ace has crashed three times in some time variation and the cause for the module failure is hard parity error

There is a well known defect documented for crashes /  unexpected reload because of parity errors.

tools.cisco.com/Support/BugToolKit/search/getBugDetails.do?method=fetchBugDetails&bugId=CSCsz65679



Symptom:

The ACE Module crashed  unexpected with a NP Control Store Parity Error which can be due to  hardware.


Conditions:

Normal  Operations.


Workaround:

None. Monitor the ACE  Module and if this reoccurs a RMA should be  considered.

Explanation : -

The SRAM parity error presented in the core file is not  due to a software issue. The issue is the result of a "bit-flip" within the SRAM  itself which can occur as a result of environmental conditions. This "bit-flip"  is rectified by a simple reboot of the system, which would occur with the  generation of the core file. Cisco internal testing and customer experience has  shown that these types of issues can occur with very low frequency, but do not  require an RMA of the device.

ACE is  susceptible to this because of the way it uses SRAM to store control information  and packet data as opposed to scratch-pad storage. Almost any 1-bit flip will be  detected  as a parity error.


CSCtc53046 is a partial software workaround which  mitigates hardware generated SRAM parity errors by reducing the amount of access  to the SRAM due to the collection of the interface

statistics. It is highly recommended that you  upgrade to A2(3.3) or later to both lower the overall rate of SRAM parity errors  and ensure failover occurs appropriately.


SRAM  errors are expected to occur at a frequency of approximately one per year per  ACE module. If a particular module experiences a significantly higher failure  rate and is running A2(3.3) or later, then a proactive RMA would be in  order.

Suggestion:-

1.      Since you are already running  A2(3.2),  I would suggest you to first upgrade to A2(3.3) and then monitor if  the device crashes again.

2.      If the same happens again, we  should RMA the module.

Actions

Login or Register to take actions

This Discussion

Posted November 23, 2009 at 8:01 AM
Stats:
Replies:10 Avg. Rating:
Views:4174 Votes:0
Shares:0
Tags: No tags.

Discussions Leaderboard

Rank Username Points
1 1,551
2 369
3 333
4 228
5 212
Rank Username Points
5