today on one of our redunant 6509's the cpu load went to 90% on one and both 6509 had eigrp neighbor messages in syslog. Where is the best spot to start troubleshooting the cause. What could have caused this? A possible broadcast storm?
There could be numerous reasons for high CPU utilization
Check out this link
THe EIGRP neighbors would not have received hellos from each other during this period which may have resulted in the logs.
thanks for the link...I read that before and it seems like a good article but I am unsure how to determine in we had a broadcast storm that caused the issue. We did see alot of our interfaces go into high utilization. Also, if we did have a broadcast storm how do we determine what caused it?
Check the switch Layer 2/3 interface and check if there is a lot of broadcast on those interfaces.the only way to determine would be mirror the traffic on the switch by setting up a SPAN session.
Whenever there is another instance as such you are describing the first this you should do is to setup a SPAN session and try to mirror the traffic coming from uplink interface. You can also sniff the CPU Queues by using COPP i.e control plane policing to check which unwanted traffic is processed by CPU. Check which IOS version you are running and see if that supports control plane policing.
I have used COPP many times on 6500/4500 and it has helped me a lot. One of my clients network went down as it was hit by MS-SQL worm, which we diagnosed by sniffing the CPU Queues on 4500 and got to know the hosts which were sending the traffic. The whole network as backup again in less than 1hr.
I came in this morning and found out what happened. I was looking through our helpdesk tickets and there was one from this weekend that stated there was a switch under someones desk and it had a patch cable that was plugged into its self :)
ok...i am looking for other comments. A co-worker is stating that this would not cause the issue. The switch was just a netgear(i think) and was under someones desk. the switch was plugged into itself and then plugged into a switch that was on vlan100 and had spanning-tree portfast. Would this possibly cause a broadcast storm and cause the 6509 cpuload to go high?
If the Netgear was plugged into itself, and then uplinked to the 6509, then yes, it was the cause of the problem. The Netgear would have taken every ethernet frame it saw, looped it, and repeated it back to the 6509. This is known as a topology loop, and it usually brings a network to its knees in a matter of seconds.
The looped cable would certainly cause a broadcast storm, in that every broadcast frame inbound from the 6509 to the Netgear would get looped and repeated back to the 6509 (and everyone else on the segment) over and over again as fast as their little hardware minds could do it. In the case of a 6509, that's pretty darn fast. :-) And yes, this event would kick the 6509's CPU right square in the head - high CPU utilization during the topology loop, and all the nasty problems that can go along with a CPU running out of gas. (Routers losing touch with neighbors, output queue drops, all sorts of network communications failures, sluggish console response if any, and just general hell on earth until you find the problem.)
Do some cisco.com searches for UDLD (unidirectional link detection) and spanning-tree loopguard. Those protocols, if supported on your 6509, would probably have saved the day for you.
Also take a look at BPDUguard, which disables a portfast port if it sees a BPDU. BPDU's (bridge protocol data units) are sent out by switches running spanning-tree. A port configured with portfast should never receive a BPDU in a properly designed network. If the Netgear switch runs spanning-tree (very doubtful, but possible), then the Cisco would have err-disabled the port when the Netgear sent out BPDU's.
since this netgear switch was plugged into itself and then to a cisco switch and then from the cisco switch(3550) back to the 6509 would you put the UDLD (unidirectional link detection) and spanning-tree loopguard on the cisco switch(3550)or on the 6509?
Okay - so you've got this sort of switch arrangement:
NetGear <-> 3550 <-> 6509
I was missing the part about the 3550 in the middle, but it doesn't change the application of the technology. One thing that's worth mentioning here is that UDLD is intended to work with other UDLD switches. In other words, UDLD works because the switches on either side of the link are talking UDLD to one another and keeping track of one another because they're having a nifty UDLD conversation. If one side doesn't hear the UDLD stuff that he's expecting to hear from the other side, then he knows that something went wrong with the link, and he shuts it down. Most often, this scenario comes up with fiber links. Fiber uplinks have 2 strands, a send and a receive. If one or the other strand goes bad, you risk a topology loop, so UDLD shuts the link down before spanning-tree can move that broken link to a forwarding state inappropriately.
So...to use UDLD to prevent the kind of loop you described (the NetGear plugged into itself) here is admittedly a gamble on my part, as your NetGear doesn't speak UDLD. My understanding from a fellow engineer (who says he's done it before) is that if a UDLD-enabled port sees it's own UDLD hello, it's going to know something is wrong and shut the port down. I've never had the chance to test it myself, nor was I able to find a Cisco article that documented this functionality.
Loopguard has a different focus than UDLD, but is focused on a similar task: preventing spanning-tree loops from occurring.
To deal specifically with your NetGear situation, you'd need to apply these on your 3550, so that the 3550 could detect the problem and shutdown the port the NetGear is uplinked to. You can run both protocols at the same time. To gain the maximum benefit from either protocol, consider applying them on all your Cisco switches in your network. Make sure you understand what you're telling the switches to do, though. It's never a good idea to just blast commands into your switch config because some guy told you they were cool. Read up and be sure they're appropriate in your world, then roll it out slowly, one switch at a time.
One other spanning-tree feature that I should have mentioned yesterday, had I been more awake: BPDUguard. BPDUguard is simple in operation. If you have a "portfast" enabled port (say, the 3550 port that uplinks to the NetGear), and that portfast-enabled port receives a BPDU (the mechanism spanning-tree devices use to talk to each other), a switch running BPDUguard will disable that port.
When a port is configured with portfast, there's an assumption that what's connected to it is an edge device - a PC, a printer, a server, a web-capable coffee-maker - but NOT a switch. Portfast puts a port into forwarding mode immediately, when normally the port would go through listening and learning states first, causing a delay in passing traffic that a PC might experience as being unable to obtain a DHCP address or login, failure to map drives, etc. when first booting up. Portfast eliminates the delay, but there's this ugly risk that if someone, for instance, plugs a Netgear switch into itself, a topology loop will get created.
BPDUguard would solve this by shutting down the port when it sees BPDU originating from the 3550 get looped back around. The 3550 as a part of the spanning-tree sends BPDU's all the time. The BPDU would go up to the NetGear, get looped around, and get sent back to the 3550. When the 3550 running BPDUguard sees the unexpected inbound BPDU, he shuts the port down.
I'd dig in a bit and see what you like best as it might apply to your world, as there's no one right answer to this problem: