This morning we updated a 7606-S router with RSPs 720 from 12.2(33)SRC6 to 12.2(33)SRE9 (the rp- and sp- rom-monitors were updated too). After the reload some of the cards in the router developped diagnostics failures that weren't there before:
- one X6724 card (that got its rom-monitor updated too) failed TestL3VlanMet
- one X6708 card failed TestLoopback for port 8 (unused, unpopulated) and also keeps failing TestFabricCh0Health; there is another X6708 in the shelf (same Fw and Hw versions), that didn't get any issues
Since the problematic X6708 only has two used ports (one in each fabric channel), we could afford re-seating it - nothing changed. Traffic on the card doesn't seem to be disturbed, and also "show fabric channel-counters" doesn't output anything suspicious. Besides the "Major Error" status, another thing that bothers me is the "%HA_EM-6-LOG: Mandatory.go_fabrich0.tcl: GOLD EEM TCL policy for TestFabricCh0Health" log entry every 10 seconds.
As for the X6724, what problems could the failed TestL3VlanMet cause? I read that re-seating the card is recommended for that one as well, but if we're going to schedule a maintenance window for this I'd like to know what else I could try in order to get to the root cause of all these test failures... I guess it's not the IOS's fault, right?
I work with cisco gear for 10 years and i have seen things that where weirder than what you have , i would just make a backup from all config and rolback to the previous IOS and boot again to see what happens .
If that does not help rollback to the previous ROMMON .. a friend told me never upgrade when all works fine.. this could be a example.
There are many bugs and field notices that cause the weirdest things to happen to hardware so do not be suprised it is the IOS / ROM sp/rp images .
I thought about rolling-back, but 12.2(33)SRC is sooo old and end-of-everything (life/sales/support/etc)...
I'm still hoping this is sort of an exact science :), and someone will eventually manage to guide me into debugging this... it's a good thing I can still afford to wait (as I said, so far traffic doesn't seem to be affected).
When two cards that work fine suddenly show diagnostic failures after a rommon and IOS update a firm reslot will likely not help , that is to easy.
I have tested dozens of s720/rsp720 with 6700 series linecards and the 12.2(33)SRx releases are full of bugs we have seen cosmetic false messages before .. We had complaining customers with simular issues .. And the 12.2(33)SRx releases where number one causing all sorts of issues.
I do not suggest you to swap IOS for ever , you mentioned you will start a maintenance window.... so a rollback is easy and a fast way to confirm the diagnotics fail because of the IOS change.
Cisco Generic Online Diagnostics (GOLD)
Generic Online Diagnostics on the Cisco Catalyst 6500 Series Switch
It is possible that the newer ios has better/other way of performing the gold tests .. For so far i know you cannot remove the dignotic False result fom the tests .. You can choose to instruct the supervisor to skip a specific test after a new bootup.
Do you not have another chassis in your possesion where you can put the two cards in when having the M window ? Also check the backplane and connectors to be sure ..
Sent from Cisco Technical Support iPad App
Sent from Cisco Technical Support iPad App
Unfortunately there's no spare chassis around to test the cards in, and also no available slots in this router to move them to. What I can do is wait for an almost identical router in another location to be scheduled for the same update, and see what happens to that one . We've updated another 7606 in the past without issues, but there were some differences (it was 3C (vs. 3CXL in this case), it didn't have any X6708 cards and the X6724 card didn't get its rom-monitor updated).
Based on your experience, could you maybe recommend a stable train in the 15 series IOS that doesn't suffer so much from false diagnostics results?
1. Re-seat the line cards one-by-one.
2. If the problem persists, move one of the line card to another slot.
3. If problem persist, then call TAC and get ready to RMA the chassis.
Steps 1 determines if you have a line card problem. Most of the time, it fixes itself out after a re-seat.
Steps 2 determines if you have a backplane problem. It happens but rare.
Both steps unavailable unfortunately (no spare chassis around, no free slots in the router). If these are false errors, is it possible that they are caused by the active RSP? I mean, could switching to the stand-by RSP change anything in this behaviour?
And there's another thing that's puzzling me: TestFabricCh0Health is configured to run every 5 seconds and seems to fail exactly every other attempt (one failure every 10 seconds), which explains why the card has never been reset or powered down:
rc1#show diagnostic content module 1 | i Interval|Attributes|TestFabricCh0Health
Test Interval Thre-
ID Test Name Attributes day hh:mm:ss.ms shold
2) TestFabricCh0Health -------------> ***N****A*** 000 00:00:05.00 10
On the other hand, all failures for this test are reported as consecutive:
rc1#show diagnostic result module 1 test 2 detail | b TestFabricCh0Health
2) TestFabricCh0Health -------------> F
Error code ------------------> 1 (DIAG_FAILURE)
Total run count -------------> 8592
Last test execution time ----> Nov 12 2013 13:40:11
First test failure time -----> Nov 11 2013 15:24:11
Last test failure time ------> Nov 12 2013 13:40:11
Last test pass time ---------> n/a
Total failure count ---------> 8592
Consecutive failure count ---> 8592
Both steps unavailable unfortunately (no spare chassis around, no free slots in the router).
Raise a TAC Case using this thread (upper right-hand corner) and arrange for an RMA.