We have a 6509 SUP720 IOS switch that has suddenly started to randomly reload on it's own.
There is no consistent time for the reloads.
There is nothing in the log files to indicate why this is happening. There are no high CPU utilization problems prior to the reloads.
The power is supplied in redundant mode and is on battery backup.
TACACS logs have been checked for any issue of reload command,
Originally, there were crash files sent to Cisco and they issued an RMA for the SUP720, saying the crash file indicated a hardware problem.
SUP has been replaced with no change in the results, we are getting 1 to 2 reloads a day on this switch, and are getting no crash files.
When the SUP was replaced, I got the latest image in the version we are running (s72033-ipservicesk9-mz.122-18.SXF17a.bin) with no change.
Cisco TAC after informaing them of no change said unfortunately they are totally in the dark as to why this could be happenign and suggested we look at power issues.
Anyone ever seen a 6509 do this?
System returned to ROM by reload at 22:24:04 UTC Fri Oct 1 2010 (SP by reload)
System restarted at 21:33:38 EDT Mon Oct 4 2010
System image file is "disk0:s72033-ipservicesk9-mz.122-18.SXF17a.bin"
If we assume for a moment that its not a bug, then the reload may come from:
Check for SNMP RW ACL. Try changing RW community on the switch (without changing it elsewhere).
Run these commands:
"show kron schedule"
Tell me if you're seeing anything.
Thank you Timor,
I had looked at the scheduled reload and did not see anything, but never thought of a kron job, but here is what I see:
LKLD-6509--2# sh reload
No reload is scheduled.
LKLD-6509-RTR-2#sh kron schedule
We do have a couple of devices with snmp RW access.
Not to be conspiricy theorist, but we did have an unhappy IT employee leave just prior to this starting.
Thanks for the reply, I will see if anything turns up.
Start graphing the CPU/memory utilization of the Sup using cacti to see if an IOS memory leak or race condition is causing the reboot. The other options are to do a "show power" and see if your chassis can run on a single power supply or not. If you cannot and one is going out on you, it will cause periodic reboots like this and be practically untraceable.
Do you have dedicated circuits going to this chassis? Did someone potentially plug something into the same circuit(s) and cause a peak overload issue? I've seen circuits that also have copiers/printers/etc on them that will overload the circuit momentarily when those devices hit peak draw warming up...this could push your 6509 to the edge and cause a reboot.
Just some ideas...
Sounds like the backplane of the 6500 could be faulty. How many PS do you have installed? Can you make sure the PS is properly seated and the s_crew is properly secured?
Thanks for all of the input.
I had looked at power and CPU over several days and did not see anything.
I was wondering about a backplane issue also, BUT
I disabled SNMP on the switch yesterday morning and so far no reboot yesterday, last night upto this morning at 9:12AM.
It had rebooted every day prior to that, so we may be on to something.
I have my fingers crossed.
6509 still reloading with SNMP removed completely.
SUP720 has been RMA and replaced
Switch is in redundant mode for power and can run off of a single power supply.
CPU utilization is approx 10%, nothing in the logs to indicate failure or problem of anything.
I had a console set up to monitor and was logging debugging level to console.
Last reboot, the log had no entries for 45 minutes prior to the reload, then shows reboot process.
Still no crash files.
Everything is passing diagnostics on boot.
The only thing I see that is askew is:
000132: Oct 8 21:20:02.631 EDT: %ILPOWER-5-ILPOWER_POWER_DENY: Interface Fa4/15: inline power denied
000133: Oct 8 21:20:09.267 EDT: %ILPOWER-5-ILPOWER_POWER_DENY: Interface Fa4/8: inline power denied
After the last few reboots, I have had to "shut/no" shut some ports (I do not recall if they were all on blade 4), to get them up.
Is this a clue?
Can you post a 'show tech' for review? You mentioned that you checked the TACACS logs for any commands being run around the time of the reloads. Did these show anything? Also, have there been any power issues reported at the site? Any other devices connected to the same power source that aren't seeing the issue or any other boxes at that site that are or are not seeing the problem?
We just had the entire switch RMAed, every blade and chassis, also moved switch up to 12.2.33 from 12.2.18, and the switch did a reload last night.
The switch is on the same power source as other devices that are not being affected.
TACACS log show nothing, no log ins, nothing in accounting or administration logs.
Switch was moved from redundant power to combined to make sure that power was not on the edge.
The only things that were not replaced were power supples, fan tray and POE daughter cards.
TAC has seen the "sh tech", and I cannot really post that here.
Everyone is stumped.
This switch has never done anything like this before.
Can you provide the following outputs -
show environment all
Given that almost everything has been replaced and there are other devices on the same source not seeing the issue, it sounds like the best next step will be to replace the supplies. I want to see if any alarms or negative readings have been reported by the device although I suspect not. Thanks.
I still stand by power being the problem...
Graph the power, you can use these OIDs:
power supplies & switch load are under - 188.8.131.52.184.108.40.206.220.127.116.11 (these values are in centiAmps at 42V so you have to do some math to get the usable readings)
Port POE Values are under - 18.104.22.168.22.214.171.124.402
As a first step, I'd turn off *ALL* POE ports. I know that is annoying, but it is possible a device is sending a short or overload back up the link and causing a crash/reboot.
If that doesn't help, swap power supplies after verifying load support (show power). If that doesn't help, RMA power supplies *only* if you cannot support the full load on a single power supply in redundant mode. If you don't know how to check this, please ask and we'll help.
I assume you changed all the SNMP communities, stopped all krons, cleared all TCL scripts, etc. Make sure you are logging all authentications/access to ACS or another server so you can see if someone logs in before the reboot (malicious script or accidental activity).
You guys were on the right track with thinking about the power.
That is what it ended up being related to, but it was one of the POE daughterboards causing the problem.
As mentioned before after replaing everything but the power supplies and the POE cards, I mentioned earlier that we had another reboot.
After looking farther, I saw this time it was not a reboot, but a crash. We started seeing errors on blades 7 and 9.
The entire Blade 7 was going into a disabled state, reseating the blade, the ports came back up, but would disable after a few minutes.
The POE module was put in a different blade and the switch crashed and went to ROMON mode. This time we pulled the POE module and put the blade back in and the switch stayed stable for two days.
We got the POE cards RMAed and the switch has been fine for a week now, whereas it was reloading once or twice a day.
The old switch, modules, POE cards never showed any problems with power inline or anything. Disabling POE was a hgood thought, but I never got a chance to get that far as the events sort of led us down that path anyway.
I just wanted you guys to know what it ended up being and I really appreciate your input.