10-19-2006 10:16 AM - edited 03-03-2019 02:24 PM
I run a 72 site hub and spoke Frame Relay network. There are dual 3660's at the host and each remote site has dual pvc's back to each 3660. The IOS running on the 3660's was rather old (12.2) and I upgraded to 12.4 last night. Before I left and after the succesful upgrade I checked all the interfaces and pinged a few sites and was satisfied that all was good.
At 6:30 this morning I start getting phone calls that the entire network was down. Raced into work and found that approximately 30 sites were "down". Meaning the HSSI interfaces were up-up but would not respond to pings and I could not telnet in them. I reverted back to the old image with a boot statement and all sites came back up.
Opened a case with Cisco and they advised that older IOS's automatically adjusted the MTU size and that with the newer you have to manually specify. My host MTU size for the HSSI interface is 4470, the remote serial interface MTU size is 1500.
Now I have to explain why this happened and why I didn't catch it before hand. Has anyone else ever experienced this? What could I have done differently? I want to learn from this mistake as I can't afford for an outage of this magnatude to happen again.
Thanks for any constructive criticism or advice you may have.
Regards,
Rob
10-19-2006 10:29 AM
Rob,
Mistakes like that are common especially upgrading so many devices at one time, which in my opinion, is where the first mistake is. Upgrading to a new image should always be done on a test network. Doing this upgrade on a test network could have caught this change and the outage might not have happened. Mind you, hindsight is always 20/20 and like you said, you want to learn from your mistakes and this is just something you have to experience and correct in the future. What did management have to say about this? Good luck
~Shannon
10-19-2006 01:53 PM
Along with the first reply, you need to do some sort of bug scrub of your target IOS. That is dig up all the Sev1 Sev2 issues and see if they hit your environment.
10-19-2006 02:25 PM
Rob,
If I've got you right you upgraded two devices?
In my experience, I only upgrade the IOS of a device either, as part of a requirement to access some new feature, or as a result of a failure that can be attributed directly to the software, or as a need to standardise the version of software accross multiple sites, utilising the same model router.
I would suggest that next time failing having a test environment, that you would look at only upgrading one box at a time and having a bedding in period where live data could be used, between upgrades.
I don't think you should beat yourself up to much about this one. Stuff happens.
Tony Henry
10-20-2006 04:11 AM
Hi,
Agree with the previous comments - stuff happens and with the best will in the world (reading all the release notes, searching the bug toolkit, etc.) you will still get unforeseen issues.
So, knowing that issues will crop up no matter how prepared you are, what can be done? There's two things - a test plan and a rollback plan.
The test plan would include user testing of applications because at the end of the day only the application users can verify whether things are working properly or not (You don't normally need a tester at every remote site - a representative sample should be adequate.) This can never be too detailed - for a major change I'd get the business managers to nominate testers for all the affected apps. After the change you simply get the business managers to sign off that the change is ok. If problems arise you have a pre-prepared rollback plan to revert the change.
That said, we don't always follow best practise and no-one on this forum who's been in networking for any length of time can put hand on heart and said they've never been in your situation (myself included! ;-)
HTH
Andrew.
10-21-2006 04:40 PM
I ran into the exact same issue a couple of weeks ago, going from 12.3T to 12.4. Luckily, I noticed the issue right away. Logs are your friend.
Doing a terminal monitor or show log over a two minute period would have showed the neighbors going up and down. As part of my test plan, I check the stability of my routing table by doing a show ip eigrp neighbor and sh ip route command and verifing that the last changed timers aren't resetting and Q counts are at 0.
Other things I do, not specific to your particular issue, include pumping some traffic through the network with Solarwinds WAN Killer tool. Some bugs are load related.
Using bugtool before hand and checking every Sev1 or Sev2 bug to make sure the bug won't bite me.
Making a test call to a remote site (we run VoIP)
Doing a show proc cpu and show mem and looking for wierdness.
And capturing a show tech for a baseline.
If it's a hub router I typically will bounce around the network (telnet, ssh) from my workstation to make sure I have connectivity from end to end as well. As you found out, pings to directly connected interfaces can sometimes be deceiving.
HTH,
Cliff
10-25-2006 09:44 AM
Thanks everyone for the advice. This was definitely a learning experience that has taught some valuable lessons for the future.
I did have to do a little song and dance about why this happened and what I would be doing in the future to prevent it from happening again. The thoughts and suggestions presented here enabled me to present a clear cut test plan and roll back plan tha made everyone (including myself) comfortable.
Regards,
Rob
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide