Sudden Ethernet incompatibility on AP1230...

Unanswered Question
Nov 7th, 2009
User Badges:

I have a laptop I've used for several years while configuring AP1230's in the field. A couple of weeks ago, it suddenly stopped being able to use AP1230's as WGB's. When connected to the AP1230's Ethernet port, they can see each other just fine - but the laptop's traffic is not passed through the WGB to the AP (another AP1230 as root bridge).


To be more precise, the laptop can ping the WGB 100%. However, the laptop loses about 98% of pings to the root bridge. Not all of them, just most.


Using ANY other laptop/desktop in place of the problem laptop works perfectly. Inserting an Ethernet switch between the laptop and WGB does not fix it (but other devices simultaneously plugged into the same Eswitch work perfectly).


Just a problem on the laptop's Ethernet port, right? Except... if you bypass the bridges and plug it directly into the LAN, it works perfectly. Hubs, switches, directly into router interfaces - it works with everything except that it can no longer reliably pass IP traffic over a AP1200 bridge.


I've confirmed that the laptop is getting ARP for the root bridge and upstream devices, so traffic is passing through the WGB and the RF link. But pings and other TCP/IP traffic isn't reliable. It's not **completely** dead... for example, every time you start a fresh ping test, generally the first test comes back successful and then the timeouts start. Minutes later you'll get 4-5 successful pings in a row, then more minutes of total losses.


I've tried several different AP1230's on both ends. I've tried several versions of IOS on them. I've tried forcing speed and duplex on both ends of the Ethernet connection. Nothing makes any difference.


Interestingly, replace the AP1230 in WGB mode with an el-cheapo Linksys WET11 and the laptop works perfectly (note that the AP1230 in root bridge mode is still on the other end). So there's nothing inherently wrong with the laptop working over an RF link; it only complains if its end of the bridge is an AP1230.


Given this situation, I can use the laptop to configure WGB's - but then I can't use it to test them! It can see devices on its side of the WGB but cannot see the other bridge nor anything beyond it. Meanwhile, other devices work just fine... and the laptop works fine in any other environment.


There must be some sort of configuration detail I'm missing on the WGB or the laptop. Something that makes them incompatible. Any hints, test suggestions, or ideas will be greatly appreciated. This is driving me crazy.


Thanks!

  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 0 (0 ratings)
Loading.
richardhartman Mon, 11/09/2009 - 21:19
User Badges:

I'm trying to debug this by enabling ACL logging on the AP1230. It accepts the log keyword on the ACL entries but I'm not getting anything in the logging buffer. I've set "logging buffered 4096" but when I say "sho logging" I get nothing for ACL entries. I've done this on routers and it works, so what am I missing to get it working on AP1230's?


Thanks!

richardhartman Mon, 11/09/2009 - 21:23
User Badges:

I've discovered that if I use "debug ip packet" I can get some packet-level debugging information - but only if either the source or destination IP is the AP1230 itself. Traffic passing THROUGH the device isn't logged. That's not very useful.


Any guidance would be greatly appreciated!

richardhartman Tue, 11/10/2009 - 07:01
User Badges:

OK, new (at least the first I've noticed it) behavior. I have the troubled laptop

running continuous pings to three devices: The near bridge (WGB), the far bridge (rootbridge),

and a router on the other side of the rootbridge. The laptop ties into the

WGB through a dumb 10Mbps Ethernet hub. There is also a desktop

plugged into the hub, so that both computers are able to talk over the bridged connection.


Get this:


* The laptop pings to the WGB are 100% reliable.

* The laptop pings to the rootbridge and to the router work five times in a

row, in sync, and then one ping times out. This pattern repeats indefinitely.

* Meanwhile, the desktop pings everything all the time with no packet loss at all.


"Something" is happening every five seconds that affects the laptop but not the desktop. That "something" doesn't affect the direct connection to the WGB, but does affect the laptop's communication across the RF link. Meanwhile, the desktop - connected exactly the same way as the laptop - is unaffected. I don't understand how the problem can be so selective to the laptop since it's on the other side of the WGB.


Since Windows pings occur once per second... what happens every five seconds in this environment? I've tried locking the near bridge and the laptop to 10/half without effect (they started on auto/auto), so it's not speed/duplex negotiation.


Any assistance appreciated!

richardhartman Tue, 11/10/2009 - 11:44
User Badges:

Latest update....


In the above test environment, I can now turn this weird behavior on and off at will.


How?


By connecting or disconnecting an Ethernet switch that happens to be hanging off the subnet.


All along this has "felt" and "smelled" like an ARP problem. My current working theory is that the eswitch isn't handling its ARP cache properly. It presumably has an entry for the laptop's MAC associated with one of its ports. When I move the laptop to another subnet (one which has some connection, somehow, to this eswitch), obviously the laptop will announce itself via ARP on that new subnet. But if the eswitch doesn't retire its ARP cache entry associated with that other port, then the eswitch will announce that it has a DIFFERENT route to that MAC address. Connected devices will therefore be receiving conflicting ARP data from two places (the laptop and the eswitch), causing a sort of "route flapping" on the subnet. This would cause the intermittent operation I'm seeing.


What this doesn't explain is why other devices aren't similarly affected. Perhaps the laptop (running XP) isn't handling ARP in a way that causes the eswitch to retire its entry in favor of the newly announced ARP data.


I ran out of time to test further this morning, but will continue after normal business hours this evening. The good news is that, at the moment, it appears I at least have a way to cause it on demand.


More soon....


richardhartman Wed, 11/11/2009 - 08:13
User Badges:

I finally have this resolved, though I don't know **precisely** how it happened.


The switch itself wasn't the problem. However, it was the connection through which the problem was conveyed into the test environment. Turns out there is some device, somewhere on the larger network, that is advertising the laptop's MAC address.


We have several AP1230's configured as root bridges, and those connect to several AP1230's configured as WGB's. I would see that the laptop's MAC was being reported by a specific WGB, flush its ARP, and then the laptop's MAC would be reported by a different WGB! WTF?!?


I chased that for a while. I figured the laptop's MAC was being picked up by the various WGB's round-robin style, so perhaps if I could flush them close to simultaneously I could break the chain. I opened up multiple connections to five separate WGB's, had the commands pretyped, and tried to send them as fast as possible... but apparently I couldn't get in front of it. The source of the laptop's MAC address kept jumping from WGB to WGB. Keep in mind here that the laptop was not actually connected to ANY of the WGB's in question.


Interestingly, when I would go to the reporting device its ARP cache never held the laptop's MAC. In other words, the laptop's MAC was being reported by WGB's that claimed to not know about it. Weird, weird, weird.


Finally, I went into the laptop's registry and changed its MAC address by one. Presto - everything works properly again. I can move the laptop all over the network, make it a client of any WGB, of any rootbridge, on any port of the switch, and it all works just fine.


And yes, the old MAC address is still floating around on the network. I still haven't been able to get rid of it. I guess I now truly have a "ghost in the machine".


This is definitely one of the weirder debugging episodes.

Actions

This Discussion

 

 

Trending Topics: Other Wireless Mobility

client could not be authenticated
Network Analysis Module (NAM) Products
Cisco 6500 nam
reason 440 driver failure
Cisco password cracker
Cisco Wireless mode