My question is one of my ACE module running A2(1.6a) have been crashed due to SRAM parity error.
Software Version A2(1.6a)
last boot reason: NP 1 Failed : SRAM Parity Error Chan 2
I would like to know is this a Software bug or an Hardware replacement is needed.
Thanks in advance.
a single SRAM parity error does not justify an RMA.
Unfortunately, SRAM's are very sensitive to light, dust, radiation, shock, temperature,... so it is possible to get an SRAM parity error on an healthy ACE.
Only, if you see repeated errors on the same blade is it an indication that there an hardware problem.
I have the same issue with me and when i reseached it I found an Bug and its been fixed in the 2.0 version.
BUg:-CSCsv52331 Bug Details: ACE crashes with SRAM parity error : source OCM ME
Hence this bug been resolved in A2(2.1) Release.
CSCsv52331—The ACE becomes unresponsive due to an SRAM parity error. Workaround: None.
What is your opinion on this?????
Thanks in Advance.
yes, this is a particular case where we tried to access an address that does not actually exist.
There is not really a parity error. But it was detect as such assuming the pointer got corrupted in SRAM.
Anyway, when you do get an ACE crash (especially SRAM parity errors) it is really advised to open a service request with the TAC.
We can than make sure that this is software or hardware. And if a real parity error, we do keep track of them to see if there is a "bad" trend.
If we do not get all SRAM parity erros reported to us, we can't detect that there is a problem in the field.
We had the same issue. Our standby ACE rebooted a couple of nights ago with this SRAM Parity error.
We opened a TAC case and this is the reply we got,
The SRAM parity error presented in the core file is not due to a software issue.
The issue is the result of a "bit-flip" within the SRAM itself which can occur as a
result of environmental conditions. This "bit-flip" is rectified by a simple reboot of
the system, which would occur with the generation of the core file. Cisco internal
testing and customer experience has shown that these types of issues can occur
with very low frequency, but do not required an RMA of the device.
If there are multiple instances of this issue on the same module, a proactive RMA/EFA
of the device would be in order.
ACE is susceptible to this because of the way it uses SRAM to store control information
and packet data as opposed to scratch-pad storage. Almost any 1-bit flip will be detected as a
parity error. Cisco has recognized the issue and is taking action to ensure this will not be
an issue on the next generation of the ACE module. The next generation module design
and timeline is currently under review.
This is the problem with SRAM memory.
All equipment makers face the same issue with this type of memory.
This is the reason why we are working on a way to get rid of this type of memory.
My error is : last boot reason: NP 2 Failed : SRAM Parity Error Chan 3
The issue is the result of a "bit-flip" within the SRAM itself which can occur as a result of environmental conditions. This "bit-flip" is rectified by a simple reboot of the system, which would occur with the generation of the core file. . Cisco internal testing and customer experience has shown that these types of issues can occur with very low frequency, but do not require an RMA of the device.
Hardware designers and developers in general have identified
this issue related to SRAM memory which might be triggered by
environmental conditions. The way how SRAM memory works makes it susceptible to suffer these issues, Cisco is highly focused on this currently and we are working on that.It is being seen that this behavior may be also linked to some software defects but if you have experienced this issue before and you are running at A2 2.3 then the recommendation is to proceed with a replacement since the device hardware might be affected at that moment. This issue occur with a low very frequency.
I received an update from Cisco, and we will monitor this ACE module.
If the problem appears, we will upgrade to the A2 (3.3).
Our ACE20 Version A2(3.3) reloaded "NP 1 failed : NP Control Store Parity Error" on 3/28
Per TAC we hit the following bug id: CSCsz65679
Here was the response from Cisco for my issue, hopefully can shed some light:
As I have understood it till now, the issue is, ACE20 module in slot 9 of the chassis and ace has crashed three times in some time variation and the cause for the module failure is hard parity error
The ACE Module crashed unexpected with a NP Control Store Parity Error which can be due to hardware.
None. Monitor the ACE Module and if this reoccurs a RMA should be considered.