We have recently suffered a crash on one of our Cat 6509's, using a Sup720 / MSFC3 with PFC 3B.
The result of "show ver" states that the "System restarted by processor memory parity error at PC 0x601799C4".
The following document clearly states that this is a possible fault with the system DRAM:
Yet our service partner that holds our maintenance contract simply recommended that we "upgrade to the latest safe harbour relase".
On looking at the safe harour release I have found it to be a little unreliable.
The recommended release in our current SXF train is 12.2(18)SXF13. Yet when I click on this in the tool informs me that there are a number of software defects for the version!!
Why is this a safe harbour release if it has defects ? I appreciate that they aren't major defects, but a bug is still a bug that can cause issues. Trying to explain this to a non techy who manages change in our organisation is impossible. You can just imagine the conversation, " we are upgrading the latest safe harbour release recommendation from cisco, oh, but it does have some bugs", reply "er, then why are we using this release?"...
Please forgive my cynicism, but this hardly provides piece of mind when high availability is of the upmost importance, and simply bringing down a switch to perform a software upgrade is a major issue and isnt something we like to do regularly.
Should I just skip this train and move right onto 12.2(33)SXH4 which is another recommended safe harbour release but doesnt "yet" seem to have any defects. However I have been down this road before only to find defects emerge in the train at a later date.
I would greatly appreciate any response that Cisco might have on this one.
Parity errors in general are unavoidable in our environment.
The problem with a parity error isn't one of software. All electronics are vulnerable to parity errors. The difference is how these errors are dealt with. The problem is the value read from a location in memory is detected to not be the same value as what was placed into the memory.
The software can choose to continue with this incorrect data. The software can also determine the integrity of the data is critical and needs to take corrective action. Part of the problem lies in what real value was corrupted. With IOS the system crashes because the integrity is critical.
Parity errors aren't a Cisco specific problem. I have seen parity errors on other platforms as well and a search in google will show many examples of this.
If you had opened a TAC case on this and it was assigned to me I would be giving you the same information.
"The powers that be" as you stated before typically are not technical. You could replace some hardware to appease them, but this would not prevent a parity error in the future.
A lot of times people want to do something just to show that something is being done. Monitoring the device is a course of action.
It has been a while since I took any probability courses, but since you asked I believe the probability favors keeping the same device. Say you had two pieces of equipment which had a chance to experience some rare event. Now if one of those two devices experience the event then it would be less likely for the event to happen twice on the same device. The device that had not experienced the event would actually now be more likely to see the event vs the event happening twice on the other device.