Bugs, Crashes and trusting Safe Harbour Release.

Answered Question
Apr 10th, 2009

Hi,

We have recently suffered a crash on one of our Cat 6509's, using a Sup720 / MSFC3 with PFC 3B.

The result of "show ver" states that the "System restarted by processor memory parity error at PC 0x601799C4".

The following document clearly states that this is a possible fault with the system DRAM:

http://www.cisco.com/en/US/products/hw/routers/ps341/products_tech_note09186a0080094793.shtml#topic1

Yet our service partner that holds our maintenance contract simply recommended that we "upgrade to the latest safe harbour relase".

On looking at the safe harour release I have found it to be a little unreliable.

The recommended release in our current SXF train is 12.2(18)SXF13. Yet when I click on this in the tool informs me that there are a number of software defects for the version!!

Why is this a safe harbour release if it has defects ? I appreciate that they aren't major defects, but a bug is still a bug that can cause issues. Trying to explain this to a non techy who manages change in our organisation is impossible. You can just imagine the conversation, " we are upgrading the latest safe harbour release recommendation from cisco, oh, but it does have some bugs", reply "er, then why are we using this release?"...

Please forgive my cynicism, but this hardly provides piece of mind when high availability is of the upmost importance, and simply bringing down a switch to perform a software upgrade is a major issue and isnt something we like to do regularly.

Should I just skip this train and move right onto 12.2(33)SXH4 which is another recommended safe harbour release but doesnt "yet" seem to have any defects. However I have been down this road before only to find defects emerge in the train at a later date.

I would greatly appreciate any response that Cisco might have on this one.

I have this problem too.
0 votes
Correct Answer by gephelps about 7 years 8 months ago

Parity errors in general are unavoidable in our environment.

The problem with a parity error isn't one of software. All electronics are vulnerable to parity errors. The difference is how these errors are dealt with. The problem is the value read from a location in memory is detected to not be the same value as what was placed into the memory.

The software can choose to continue with this incorrect data. The software can also determine the integrity of the data is critical and needs to take corrective action. Part of the problem lies in what real value was corrupted. With IOS the system crashes because the integrity is critical.

Parity errors aren't a Cisco specific problem. I have seen parity errors on other platforms as well and a search in google will show many examples of this.

If you had opened a TAC case on this and it was assigned to me I would be giving you the same information.

"The powers that be" as you stated before typically are not technical. You could replace some hardware to appease them, but this would not prevent a parity error in the future.

A lot of times people want to do something just to show that something is being done. Monitoring the device is a course of action.

It has been a while since I took any probability courses, but since you asked I believe the probability favors keeping the same device. Say you had two pieces of equipment which had a chance to experience some rare event. Now if one of those two devices experience the event then it would be less likely for the event to happen twice on the same device. The device that had not experienced the event would actually now be more likely to see the event vs the event happening twice on the other device.

  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 5 (1 ratings)
Loading.
Leo Laohoo Fri, 04/10/2009 - 01:22

Hi Chris,

I have been managing >6 Sup720 and >50 Sup32. Due to the Safe Harbour accreditation, we were using SXH13. Although we did not see any crashes, we decided to move to the newer SXI series after receiving a number of good reviews.

No IOS is "crash free". Like anyone of us, it's just bad luck for stumbling upon a bug.

There are over 10,000 combinations to set up a 6500 and I doubt any organization can test every one of them for network stability.

If we hit a crash, together with TAC, we find out why it crashed. If it's a bug, we determine if this is going to re-occur. If so, we determine if we should proceed with the recommended fix or we just sit on it.

A non-techie will never understand. But you can explain to the non-techie that it's just as worst if the machine fails yet again and the press finds out that BAA "could've" done something to fix the issue but didn't.

The mere utter of the word "press" or "media" can have profound effects.

In my humble opinion ...

gephelps Tue, 04/14/2009 - 04:20

Leo, This is some good rational advice.

Cisco as a whole tries not to release code with bugs in it and Safe Harbor is a bit more unique as the results from testing are shared.

Code has failed to become Safe Harbor certified as well. The problem with a pass / fail system is that no solution works for everyone. When crashes occur then the question is why was this passed? For a situation such as this, the real problem was a transient hardware failure and every version of code would have behaved the same way.

As Leo mentions as well, some bugs while causing crashes are much less severe if there is a viable workaround to avoid the bug.

Leo Laohoo Fri, 04/10/2009 - 01:23

Hi Chris,

I have been managing >6 Sup720 and >50 Sup32. Due to the Safe Harbour accreditation, we were using SXH13. Although we did not see any crashes, we decided to move to the newer SXI series after receiving a number of good reviews.

No IOS is "crash free". Like anyone of us, it's just bad luck for stumbling upon a bug.

There are over 10,000 combinations to set up a 6500 and I doubt any organization can test every one of them for network stability.

If we hit a crash, together with TAC, we find out why it crashed. If it's a bug, we determine if this is going to re-occur. If so, we determine if we should proceed with the recommended fix or we just sit on it.

A non-techie will never understand. But you can explain to the non-techie that it's just as worst if the machine fails yet again and the press finds out that BAA "could've" done something to fix the issue but didn't.

The mere utter of the word "press" or "media" can have profound effects.

In my humble opinion ...

gephelps Fri, 04/10/2009 - 04:28

Processor Memory Parity Error is a generic term and does not always point to DRAM. On a sup720 the DRAM uses ECC. This means that single bit parity errors are corrected on the fly by the ECC mechanism. The DRAM could potentially see a multiple bit parity error and ECC would be able to detect this, but not correct it. While possible, I have never seen one.

Now the caches used by the CPU can detect parity errors, but are not capable of correcting errors on the fly. When a parity error is detected the system crashes. The document goes on to say that parity errors come in two types.

For a parity error there is no reason to upgrade (with one or two exceptions not on this platform). A parity error like you stated is a physical problem. The question is if the error was a transient error. A hard parity error will cause a bit flip repeatedly when the memory location is accessed. Since the DRAM is protected on this platform, the vulnerable memory is very small in size and used constantly. I doubt you will see the crash again.

All releases of IOS contain bugs. The question is really if the bugs could potentially affect you. Why would you care if there were x bugs in a feature you were not using?

Asking people to suggest code to you without them knowing anything about your network is not wise.

If you look at the release notes for SXH4 it contains a list of bugs in the release known up front. Safe Harbor documents bugs they find during testing in a controlled environment. You should really read the results and determine if the defects are a concern for you. You shoud also be aware that you may be using features not configured or used by the Safe Harbor results you are looking at.

I hope that helps.

cbeswick Tue, 04/14/2009 - 00:08

Many thanks for your response.

Could you please expand a little on your comment :

"A parity error like you stated is a physical problem. The question is if the error was a transient error. A hard parity error will cause a bit flip repeatedly when the memory location is accessed. Since the DRAM is protected on this platform, the vulnerable memory is very small in size and used constantly. I doubt you will see the crash again."

Are you saying that Parity errors in the cache used by the CPU are unavoidable ? If they happen (though rare) it is just something we have to live with and that no software upgrade can rectify the issue ? Furthermore if we do suffer from this type of issue, is the silicon we are using more likely to have these type of defects, or is the probability of having a recurrence so remote that it isnt even worth worrying over ?

I suppose the million dollar question is whether to upgrade or not (assuming that this is a software issue). Should the crash re-occur, however unlikely that may be, and "the powers that be" saw that no action was taken previously, then some serious questions will be asked.

Thanks in advance.

Correct Answer
gephelps Tue, 04/14/2009 - 04:01

Parity errors in general are unavoidable in our environment.

The problem with a parity error isn't one of software. All electronics are vulnerable to parity errors. The difference is how these errors are dealt with. The problem is the value read from a location in memory is detected to not be the same value as what was placed into the memory.

The software can choose to continue with this incorrect data. The software can also determine the integrity of the data is critical and needs to take corrective action. Part of the problem lies in what real value was corrupted. With IOS the system crashes because the integrity is critical.

Parity errors aren't a Cisco specific problem. I have seen parity errors on other platforms as well and a search in google will show many examples of this.

If you had opened a TAC case on this and it was assigned to me I would be giving you the same information.

"The powers that be" as you stated before typically are not technical. You could replace some hardware to appease them, but this would not prevent a parity error in the future.

A lot of times people want to do something just to show that something is being done. Monitoring the device is a course of action.

It has been a while since I took any probability courses, but since you asked I believe the probability favors keeping the same device. Say you had two pieces of equipment which had a chance to experience some rare event. Now if one of those two devices experience the event then it would be less likely for the event to happen twice on the same device. The device that had not experienced the event would actually now be more likely to see the event vs the event happening twice on the other device.

cbeswick Tue, 04/14/2009 - 04:47

Many thanks again this has proved to be extremely useful

Actions

This Discussion