Weird LAN problem: Help appreciated!

acherbina · ‎05-30-2007

Hello All,

I hope someone will be able to help with the weird problem we start

Experiencing a few weeks ago:

Our LAN setup:

We have close to 70 servers on the LAN network, connected to main Cisco 3750 stack switch on the 5th floor.

Flat structure, no VLANs configured.

Software version is c3750-ipbasek9-mz.122-35.SE2

We have Dell PE servers various models.

Network cards are Intel or Broadcom, configured as a Team with Failover redundancy.

3750 stack has 9 switches.

There are also other 3750 stacks on other floors of the building connected with SPF connectors to the main switch.

One C2950 switch in the test lab, is up linked to main 3750 stack.

Problem:

Intermittently server1 (S1) cannot ping S2?and vice versa?

S2 is pingable by S3, S4 and other servers at the same time.

S3 and S4 can ping S1 as well?.

In general this can happen every 10-15 min and last for a minute or two?

Sometimes one of workstations involved when they can?t ping one of the servers, but can ping other servers fine?.

If I am logged on the main 3750 switch, and during the ?outage time? I ping the server that does not reply, it restores connectivity right a way?..

Seems that switch does not retain MAC entry of ?culprit? server, or workstation.

But I am not an expert???..

If I reload the main 3750 stack, then it can fix some ?culprits? but eventually there will be new.

What has been done:

Software was upgraded to c3750-ipbasek9-mz.122-35.SE2

Built in diag did not show any issues.

We unplugged every second NIC card on the servers; they run on single NIC card

Cisco TAC team worked with us to no result so far

All non-Cisco equipment (linksys personal switches, D-links) are disconnected.

Since we unplugged every second card, problem re-appears less often, I think we run fine for a few days, but it is still shows up occasionally, as a few users reported it.

PS: Servers Event logs do not show any related info.

I hope someone had seen this before, or knows the solution.

What could cause it?

Any help greatly appreciated

Thank you!

jasonrandolph · ‎05-30-2007

Have you phsyically looked at the switch ports during the issue?

Is there a correlation between which switch a server is connected to vs what switches can/cannot set it?

acherbina · ‎05-30-2007

Yes we were identifying ports "problematic" servers / workstations connected to, and unfortunately they could be on any switch in the stack....

Thanks!

glen.grant · ‎05-30-2007

Could be something like this .

CSCsh11040 Bug Details Bug #5 of 17 | < Previous | Next >

ARP/CEF Table Changes May Affect Host Connectivity

Symptom:

On a catalyst 3750 stack, network connectivity may be impacted if the mac address or vlan id associated

with an IPv4 or IPv6 address were to change, causing a change in the ARP and CEF adjacency tables.

Conditions:

When the change occurs, the update is not correctly propagated to the stack members

Workaround:

clear ip arp will resolve it until another such change occurs.

Status

Fixed (Verified)

Severity

3

Product

Cisco IOS software

Technology

1st Found-In

12.2(35)SE1

Known Affected Versions This link will launch a new window.

Fixed-In

12.2(35)SE2

12.2(37)SE

Component(s)

ip-unicast-routing

andrewdykes · ‎05-30-2007

Sounds close, but he already upgraded to 12.2(35)SE2

jasonrandolph · ‎05-30-2007

And that it's a flat single VLAN network, where ARP entries really shouldn't be changing very often in regards to a server IP/MAC association.

At least one would hope it wouldn't change often.

Are the teamed NICs on the same switches? I wonder if you're getting MAC collisions between switches.

bryan.lofland · ‎05-30-2007

Have you checked STP to make sure you don't have any learning/listening ports on any of your stacks?

It is sounding like it could be some arp table issue though since you say that once you ping S1 from something else it then immediately starts working again.

HTH

roy_fisher · ‎07-27-2007

Hi,

Did you come to a resolution on this ?

Today I had a very similiar issue, stack of two 3750s with IBM Broadcom NIC teams split across them.

Pinging the 3 HSRP addresses (two physical one virtual) for testing.

If we failed the 2nd switch everything went on as expected, but when we brought it back online we either had no responses or responses from just one physical interface.

Thanks,

Roy

acherbina · ‎07-27-2007

Hi Roy,

We did not find a solution yet....

We made a network stable by disabling one of the Team members on each server....

Had some Cisco consultants on site monitoring traffic, did not find any loops.

At some point we deployed new server and tech that deployed it configured NIC teaming, plugged both cards...a few hours later we had same problem again. Unplugged the second card, rebooted 3750 stack and network is stable again.

Our research shows that we start having issues after deploying latest batch of servers with broadcom NICs and latest drivers.

Older drivers and Intel Teams did not cause us problems.

All Teams configured for SLB (Smart Load balancing), this config does not require any special config changes on the switch.

We are planning to get back to this issue in the near future.

HTH,

Alex

roy_fisher · ‎07-30-2007

Alex,

Thanks for your response, I have looked further into this and I hop it is IOS version.

We're running 12.2(25)SEB4 & 12.2(25)SEE on our stacks and according to ....

http://www.cisco.com/cgi-bin/Support/Bugtool/onebug.pl?bugid=CSCsh11040&Submit=Search

http://www.ciscotaccc.com/kaidara-advisor/lanswitching/showcase?case=K69181053

.... the problem was fixed after 12.2(35)SE1

So it looks like an upgrade is required.

Thanks,

Roy

acherbina · ‎07-30-2007

Hi Roy,

As per my original post we did upgrade our stack to:

c3750-ipbasek9-mz.122-35.SE2

This did not help...

I hope upgrade will work for you.

Please post your results, thanks.

Alex