Problems with Etherchannel across blades on Sup IV 4006's

Unanswered Question
Mar 7th, 2007

We are running an etherchannel link between two 4006 switches with Sup IV engines on 12.2.25EWA8 code. Periodically we receive duplicate hsrp ip addresses for the real ip on the SVI's. We triple checked the hsrp configurations (no errors), and confirmed we have no spanning tree loops.

The Etherchannel between the core switches utilizes fiber on G1/1 on the SUP IV, and the second interface is copper G4/1 on a WS-X4548-GB-RJ45 blade.

Below is the configuration on the portchannel and the interfaces in the channel:

Note: The switch places the "flowcontrol send off" command automaticaly on the copper G4/1 interfaces. If the commands are removed, the etherchannel works, but if the switch is reloaded the system adds the commands onto the interface.

Core1:

interface Port-channel1

description trunk to core2 G1/1 G4/1

switchport

switchport trunk encapsulation dot1q

switchport mode trunk

interface GigabitEthernet1/1

description core2 etherchannel (bonded with g4/1)

switchport trunk encapsulation dot1q

switchport mode trunk

channel-group 1 mode on

interface GigabitEthernet4/1

description core2.corp etherchannel (bonded with G1/1)

switchport trunk encapsulation dot1q

switchport mode trunk

flowcontrol send off

channel-group 1 mode on

Core2:

interface Port-channel1

description trunk to core1 G1/1 G4/1

switchport

switchport trunk encapsulation dot1q

switchport mode trunk

interface GigabitEthernet1/1

description core1 etherchannel (bonded with G4/1)

switchport trunk encapsulation dot1q

switchport mode trunk

channel-group 1 mode on

interface GigabitEthernet4/1

description core1.corp etherchannel (bonded with G1/1)

switchport trunk encapsulation dot1q

switchport mode trunk

flowcontrol send off

channel-group 1 mode on

Any thoughts on running etherchannel across blades on different media and could this be related to the problems with the duplicate hsrp address errors we are seeing on random svi's?

Example error on Core1 that has no corresponding error on Core2 nor a STP change on the vlan during the error event:

%HSRP-4-DUPADDR: Duplicate address 10.1.14.3 on Vlan114, sourced by 000d.bcad.72ff

10.1.14.3 is the real ip of the svi and the mac-address is the virtual mac addy of the standby group for vlan114.

Any help would be appreciated, and yes, I've already looked at this cisco link, hence the inquiry on the Etherchannel:

http://www.cisco.com/en/US/customer/tech/tk648/tk362/technologies_tech_note09186a0080094afd.shtml#t20

Many thanks in advance,

-Scott

I have this problem too.
0 votes
  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 4 (1 ratings)
Loading.
swharvey Sat, 03/10/2007 - 19:20

Thanks Steve for the link. I rebuilt the PortChannel 1 between the core 4006's, and I removed G4/1 which was a copper gig link (on both cores), replacing it with G1/2 on the Sup IV's. In the rebuild, I deleted the "channel-group 1 mode on" and replaced it with "mode desirable" for all interfaces in the new Port Channel 1, so it is now using PAgP. Afterwards I confirmed no interface errors on any of the 4 physical interfaces, nor any on the P1 interface.

Additionally, I turned on debug spanning-tree events on both core swithes. Seventeen hours hours later I had a duplicate hsrp statement in the logs on core1 on vlan 129. I also had a duplicate hsrp statement on vlan 121 on core2, but it was 21 hours after the changes were made. I checked the spanning-tree for the corresponding vlans...no topology changes for the past 28 hours (we upgraded IOS on a closet switch that caused this topology change).

I looked at all 49 vlans in our environment, and none of them changed during the hsrp duplicate address message. All of them had the last change occur 28 hrs ago, so I'm thinking our spanning-tree domain topology is stable. The debugs only showed STP changes during the closet switch reload, which affirms my thoughts.

As I may have already shared, I checked our dhcp scopes to ensure no devices are handing out the ip addy's of our real addresses on our SVI's.

Additionally, the TAC engr on the case I opened reviewed the configuration and the hsrp and etherchannel between the cores looks good, as well as the stability of spanning tree on the vlans.

I read about duplicate frames in the following hsrp troubleshooting document, but I don't know to tackle this problem and confirm it is my issue:

http://www.cisco.com/en/US/customer/tech/tk648/tk362/technologies_tech_note09186a0080094afd.shtml#t20

Any advice is appreciated.

hoogen_82 Sat, 03/10/2007 - 22:29

I think the document clearly explains what you need to do. 224.0.0.2 is a message of all routers Mcast address.

Just another thing you have mentioned you have 49 vlans. My guess is for the HSRP config you have used group 0 for all the vlans. I do remember sometime back when i had similar issues I was told to have a group 0 for 16 vlans, group 1 for 16 vlans, group 2 for another 16 and so on. The idea was to have 16 vlans running on each group. This had to do something with eliminating some kind of duplicate mac address. If possible do try this.

Cheers

Hoogen

steve.busby Wed, 03/14/2007 - 14:06

If this is still an issue, then can you provide the HSRP configuration(s) for a couple of the SVI's that are experiencing problems? I'd also like to see how your spanning-tree is configured.

My experience has been, if Core-1 is configured with a higher HSRP priority for Vlan X, then it should also be the spanning-tree root for Vlan X. Typically we take the dynamic process out of the equation to ensure our traffic flows as we expect it to. It should only take different paths when something breaks.

HTH

Steve

swharvey Sat, 03/17/2007 - 12:36

Hi Steve,

I've been working several other projects and am just now getting the chance to get back with you. I totally understand what

you are saying, and we have core1 hsrp with the higher hsrp priorties, and the root bridge for all vlans on core1 as well,

so layer 2 and layer 3 are primary on core1. Below are snippets from a hsrp/vlan. Also, I changed the port channel 1 to no

longer go across blades (was G1/1 and G4/1), so they are now using both ports on the SupIV (G1/1, G1/2), and I changed from

mode on to mode desirable, so PaGP is in effect now on the 4 physical interfaces in the etherchannel for both cores.

Note that the duplicate hsrp messages source from the real ip of the svi, but in the duplicate hsrp error message it states that the mac source for the error is the virtual mac of the hsrp for the svi that both cores share.

So previously (until tac requested last week), all hsrp's were in group 0 (no standby group #'s), so I made a standby group for each svi to create separate virt mac's for each svi. The TAC engineer looked at the new config's and did not see any problems. Note the errors occurred prior to the hsrp group changes.

Most importantly, the timestamps of the duplicate hsrp errors do not coincide with any stp changes for the vlan that they

occur on, nor do any other vlans have stp changes that match the timestamps of the duplicate hsrp errors for that matter, so

stp is stable (I did show spanning-tree vlan xxx det on all vlans and no changes have occurred in over a week since the Etherchannel between the cores was rebuilt during the maintenance window last week).

Below is an instance of a hsrp duplicate event that occurred on core1. Core2 has no reference to this error, but I included

the show commands from core2 for this vlan as well. Additionally, look at the Bridge Ident priority. We do not use the

standard, and core1 is at 24576 for all vlans vs 28672 on core2. Hsrp is active for the vlan on core1. Note that a "show

spanning-tree vlan 125 detail" has the last topology change as 4 days ago...it doesn't corrolate with the HSRP duplicate

address timestamp. ALL network devices have NTP source and are sync'd so the timestamps are correct.

It is my belief at this point that:

1) Something is causing duplicate frames to be received by the core with it's own mac address infomration

2) There is a possible hardware fault

3) There is a possible bug in the IOS (same errors occur on various code we have tried from 12.2.25 ewa6 thourgh ewa8)

Attached are excerpt commands from the core switches that I referenced in my write up. Let me know what you think as this is a perplexing problem and I have checked the basic issues that would cause this error:

1) HSRP misconfiguration

2) STP loops or topology changes

3) Etherchannel misconfigurations/errors

Thanks,

-Scott

w-schultz Sun, 04/19/2009 - 00:12

Curious how this ended up getting resolved.

I've got this exact issue, running 12.2(25). Was running great for over a year and decided yesterday that this was going to start being an issue.

w-schultz Sun, 04/19/2009 - 08:55

So, updating the IOS to 12.2(46) did not help with the problem but it did provide better logging.

In my case, the logs have changed to %SW_MATM-4-MACFLAP_NOTIF, which gave me enough direction to find the problem host, and promptly shut them down.

Actions

This Discussion