cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
866
Views
0
Helpful
12
Replies

defies all logic

brian.wilson
Level 1
Level 1

Hope someone has some ideas. We have a client that has an intermittent problem. A workstation will suddenly go into a state where it cannot talk to a specific server, but continues to talk to other servers on the same subnet. It may be a single PC that developes the problem or it might be several - no common connectivity.

During this, you can talk to both the server or the PC from anywhere else in the network. The server cannot talk to the affected PC. A laptop placed on the same cable as the affected PC can talk to the server that the PC cannot talk to. But the laptop might not be able to get to another server.

The network is all Catalysts of various flavors, has run fine for years until about 4 months ago and may run fine for weeks before the problem surfaces again. When it does, it comes and goes for several days, affecting different PCs at different times.

We found a quick fix that may or may not last very long. That is to configure a static IP address on the NIC, then switch it to DHCP. Once that is done, we do a ipconfig /release and /renew and the PC works.

I've been doing this for 20 years and have never seen anything like this. Anyone have any ideas? We're getting desperate. Can provide more details if someone has an idea. Thanks in advance.

12 Replies 12

thisisshanky
Level 11
Level 11

Brian,

Do you use NIC teaming on the servers ?

Sankar.

Sankar Nair
UC Solutions Architect
Pacific Northwest | CDW
CCIE Collaboration #17135 Emeritus

No. Single NICs on the servers. Thanks.

mheusinger
Level 10
Level 10

Nice puzzle.

Well, are the server and the PC in the same segment/VLAN? Then it might be an ARP issue. Can you check the ARP table on the server and on the PC, when the problem occurs?

You are not talking by any chance about an AS400 server? There were some issues with this specific box, which gave intermittent connectivity problems.

There is a post in another thread describing the problem:

http://forum.cisco.com/eforum/servlet/NetProf?page=netprof&forum=Network%20Infrastructure&topic=Enterprise%20Data%20Center%20Networking&CommCmd=MB%3Fcmd%3Dpass_through%26location%3Doutline%40%5E1%40%40.1dda234b/11#selected_message

Hope this helps! Please rate all posts.

Regards, Martin

The servers are on a seperate vlan, but all servers are on the same vlan. The problem happens only at one location, other locations never exibit this. Yes, ARP is my primary diagnosis - BUT clearing arp tables on every network device from (and including) the server to the affected PC has no affect on the problem. No AS400, all various windows platforms.

Brian,

Have you tried using a sniffer to find out whats going on ?

Sankar.

Sankar Nair
UC Solutions Architect
Pacific Northwest | CDW
CCIE Collaboration #17135 Emeritus

brian.wilson
Level 1
Level 1

For the curious... I reset the 4500s in the data path and the problem wnet back into remission. No idea still what might cause this, but forensic research continues. Thanks everyone for the responses. I rated them all.

Thank you.

The dhcp-issue gave me the following idea:

Could it be that your dhcp-pool for the subnet gets near to exhaustion? Another PC would then claim a random IP causing the lease for that adress to be lost. Your "workaround" is one way to solve this.

You should see some duplicate ip messages flying around though. You did not mention that, so I might be missing the key issue here. Still hope you find this insight usefull.

Regards,

Leo

w-hansen
Level 1
Level 1

You mentioned that resetting your Cat4500 cleared the problem. Also, you mentioned that this happens only in one location -- I assume a different physical location and VLAN. That would lead me to believe that the Cat4500 is the nexus of the issue. Perhaps something is happening in the main memory space of that particular Cat4500 -- cam, route table, switching path cache, maybe even the CEF tables -- that's why a reboot would "fix" the issue.

So I would consider isolating the troubleshooting to that Cat4500 switch initially. Check the VLAN (subnet) configurations as well as all associated processes including memory-related tables. Consider rebuilding the VLAN(s) configuration (or more) on that switch, too.

1. Is memory sufficient for the 4500 to do what's expected of it? Is the memory faulty? Have you swapped the memory out with known-good? Have you opened a TAC case? Is there a bug in the current IOS? Is this the correct IOS? Remember, the 4500 is a shared memory switch -- the cpu switches the packets (no offloading, but is shared with main memory); DRAM handles the caches, tables, IOS, etc.

2. What's different about this 4500 compared to your other switches (config, memory, etc)? What's the same?

3. What changed 4 months ago in your network?

Of course, I've made a lot of assumptions here e.g. no rogue routers or servers coming up during these times, etc. But I'd think that there'd be some log of that activity.

You've got a tough one. Wish I had more to help you. Have you tried Ethereal?

Thanks for the thoughts. There are actually 6 Cats involved and it took rebooting every one of them to clear the problem. I have always felt this is not a 'network' problem and we are near proving that. There is at least one Microsoft bulletin regarding this and the local techs are following up on that. I have not been able to use a sniffer (seems the obvious thing to do, doesn't it?) but the location is 100 miles away and we have not been able to catch the problem while I've been on site. There is no one there that is even close to being qualified to do this.

I'd love to be able to open a case with TAC, but they have lost all my confidence on past issues. I long for the days of when TAC was the experts and could answer a question first and forthright. The last three issues I've presented to TAC resulted in a very tedious (as in I can't understand a word they say) and completely meaningless exchanges. They have failed to even understand the simplest of concepts recently and have been of absolutely no help. Sorry to say it, but it is my experience.

Thanks again for your post.

w-hansen
Level 1
Level 1

Can you post a pic of your topo including core and leafs? That might help. And a sanitized config would help, too.

It looks like we've narrowed this down to a Microsoft issue. Thanks for the assistance.

Glad to hear you've narrowed it down. I'm interested in the issue -- and the "fix". Please let me know when the spirit moves you.

Thank you.

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: