HSRP Duplicate address event

epeeler · ‎04-12-2007

Hello all,

I had a event on my network core last night that has me a bit stumped.

I have two 6509 with msfc in them. There are about 25 vlans/subnets with hsrp configured for each subnet between the two msfcs in each 6509.

What I saw last night was a duplicate IP address reported by each msfc on two different vlans (1 and 15). The duplicate report was for the physical interface addresses not the hsrp virtual address. Each msfc saw it's own hsrp packet I assume.

Obviously a loop happened somewhere. What has me a bit confused is how it happened on two vlans at the same time. If a user accidentally plugged a cable in to two switch ports, I could see a loop happening but that would only be on that particular vlan. As well, if it were just some STP issue, why did it only happen on two vlans and not the others?

As a side effect (I think), I see that all of the hsrp interfaces on the standby msfc went active. I think this is a result of a cpu spike that occurred because of the loop. I can see the spike in Cricket on both routers.

I don't have any other message in the logs. No interfaces bouncing no nothing just

1. Loop on vlan1

2. Loop on vlan 15

Then all of the hsrp state change messages on the standby router as it went active and back to standby.

Whatever was causing the loop must've stopped since the whole event was over after 5 minutes and things have been normal for 14 hours now.

I've checked the uplinks and cpu stats for my edge switches to see if there are any corresponding spikes in usage there and there's nothing.

I'm at something of a loss as to what else to check to try an determine what happened.

Any ideas?

mihanlin · ‎04-12-2007

What I would check is to see what common aspects you have between vlan 1 and 15.

So, what switch is the root for vlan 1 and 15 - are they different to the other vlans? (check vtp pruning if you have this enabled)

Are there many TCN's generated on 1 and 15?

Are there any trunks which carry only vlan 1 and 15?

Are there any duplex mismatches that you can see?

That should give you a few ideas on where to look initially.

Thanks

Michael

Cisco TAC

epeeler · ‎04-12-2007

Could sustained high cpu utilization on a switch cause a port to drop out of blocking and start forwarding for random vlans? I know this is separate from the MSFC cpu but perhaps one of the edge switches became overloaded, started forwarding on a port where it should have been blocking caused the whole mess?

Every now and then someone does something insane in my lab which spikes everything but I have no CPU history for the switches in there to check it.

Just wondering...and grasping at straws since I really don't have anything common about vlan1 and vlan15.

epeeler · ‎04-12-2007

Well...never mind. On of my edge switches had somehow had it's syslog server entry disabled so I didn't see any message from it in my central log file.

I logged into it and checked it and found this:

2007 Apr 11 17:05:26 %SYS-3-SYS_MEMERR:Null Cluster while getting next from address 0x30ae5770

2007 Apr 11 17:05:26 %SYS-3-SYS_MEMERR:No clusters left while allocating for address 0x409aec00

2007 Apr 11 17:05:26 %SYS-3-SYS_MEMERR:No clusters left while allocating for address 0x40a03500

2007 Apr 11 17:05:26 %SYS-3-SYS_MEMERR:Null Cluster while getting next from address 0x30ae5770

2007 Apr 11 17:05:30 %SYS-4-P2_WARN: 1/Astro(5/5) - timeout occurred

2007 Apr 11 17:05:56 %SYS-4-P2_WARN: 1/Astro(3/5) - timeout occurred

2007 Apr 11 17:05:57 %SYS-4-P2_WARN: 1/Astro(4/6) - timeout occurred

2007 Apr 11 17:06:01 %SYS-4-P2_WARN: 1/Host 00:02:fc:bd:0a:78 is flapping between port 1/2 and port 1/1

2007 Apr 11 17:06:12 %SYS-4-P2_WARN: 1/Astro(2/6) - timeout occurred

2007 Apr 11 17:06:14 %SYS-4-P2_WARN: 1/Host 00:02:fc:bd:0a:78 is flapping between port 1/2 and port 1/1

2007 Apr 11 17:06:16 %SYS-4-P2_WARN: 1/Host 08:00:3e:28:d5:e2 is flapping between port 1/1 and port 1/2

And on and on.

If this looks like a SUP failure. I'm wondering if I should just go ahead and pull the primary Supervisor to keep it from happening again until I can get a replacement.

Advice?

Thanks.

mihanlin · ‎04-12-2007

Not sure if it's a Sup failure. Null cluster errors can indicate the system is being overloaded.

Astro errors indicate the Sup and the linecard has lost connection (via a keepalive message it sends internally).

Then, you see two MAC addresses which are flapping. Your best option is to track down those MAC addresses to see where they are in the network.

It's of course possible that there is a hardware issue with this but by no means does it indicate a definite one.

A further way of testing is to schedule a maintenance window and reset the switch. When it powers on, it will do a diagnostic test of each linecard including the Sup. This should give a fairly good indication of whether there is a hardware fault with any of the modules.

Good luck,

Michael

Cisco TAC

epeeler · ‎04-12-2007

One other thing of note. A show system on all of my edge switches (which all carry these two vlans on their trunked uplinks to the core) show the highest CPU utilization during the time of this event last night. So whatever it was effected every switch in the house.

milan.kulik · ‎04-12-2007

Hi,

what if someone connected VLAN1 and VLAN15 together by a cable somewhere at your LAN perimeter?

Creating a common STP within those two VLANs could cause some loops (it takes STP some time to converge) and could make HSRP hello packets coming back to the originating MSFC with effects described.

Best regards,

Milan Kulik

epeeler · ‎04-13-2007

I didn't post the entire log from the switch but there were over a thousand memory error messages in there all from the same memory address.

As well, there were mac address flap messages where a mac would be flapping between, for example, port 5/45 and 1/1. That indicates that the mac was seen first on a port that goes out into the switch's service area (where the users are), and then it was seen on one of the uplinks between the switch and network core. Unless someone ran a cable from the user's lan drop to the data center (impossible) I'm not sure how this could happen. Other than some crazy hardware fault.

In addition, I saw this behavior from mac addresses on three different vlans that this switch carries. So I saw a mac on vlan 15 flap between a user port and the uplink and another mac on vlan 6 do the same thing. I can't think of a physically feasible way that someone could have made connections in the user area that would explain this behavior.

If I'm missing something I would be grateful for any possible enlightenment. I'm no genius and welcome any education that the experts are willing to offer.

Thanks everyone.

allen.watson · ‎04-16-2007

I have run into this a few times. Each time it has been an issue with spanning tree. As suggested you should trace down the MAC addresses that are showing up in the logs. I will bet those trace down to access switchs. In our case it has been server enclosures that are dual homed to two different switches. One switch is in a forwarding state and the other is blocking. Not sure why, but something in the server enclosure goes crazy and suddenly both sides go into a forwarding state. This then causes a spanning-tree loop, CPU utilization spikes and you get the duplicate IP messages. Hope this helps.

jimmyc_2 · ‎04-17-2007

Hi Allen,

We may have a similar problem; where an HSRP group in standby goes into active when the primary is still up and running. This occurs when we modify a VLAN trunk that is well down-stream from the core HSRP switches. We are fairly sure it is STP related, and possibly tied into a server that is dual homed. Besides mapping out the STP for the affected VLANs, where do you begin trouble-shooting?

thanks,

Jimmyc

allen.watson · ‎04-23-2007

We look for the duplicate IP address message in the logs. Trace down the MAC that shows up in these messages and that usually is the switch that started the issue. Sometimes it is the actual server MAC that shows up.