Strange wireless hiccup

Unanswered Question
Dec 2nd, 2008
User Badges:
  • Silver, 250 points or more

Hey all,

We had a bizarre problem this morning with our wireless APs. All APs across two management subnets went down - 58 on one and 44 on the other. They are mostly 1231G access points. During this time, they were not reachable via ping, WLSE could not see them, and they would not associate clients. After 10-15 minutes, they all came back up.

The APs are located across two attached buildings across many different 4006 and 4507 chassis switches. Each chassis homeruns back to two 6509 campus switches, which use HSRP on each VLAN interface for redundancy.

The logs on the APs, WLSE, and switches show absolutely nothing happening at this time. The APs did not lose power or reboot, and the campus switches did not show HSRP having issues. WLSE did not show the APs fail, but it does show them all coming back up.

We're trying to find the common denominator here. Does anyone have any idea what might have happened? Has anyone seen anything like this before?

  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 5 (2 ratings)
gamccall Tue, 12/02/2008 - 08:47
User Badges:
  • Silver, 250 points or more

There was an IOS bug a few years ago which would occasionally cause an AP's F0 interface to become nonfunctional, although the AP did not lose power, the radio kept transmitting, and nothing showed up in the logs. The bug was triggered by receiving a particular type of packet over the wire.

This doesn't sound at all like that bug, though- the bad packet would hit one AP at a time rather than all at once, and it did not recover spontaneously; the AP had to be rebooted.

So, in short- I don't have any good advice for you, other than to console into one of the APs during the outage if it happens again and see if you can figure anything out that way.

dziminski Tue, 12/02/2008 - 08:56
User Badges:

Were any other devices on the management subnets affected? Or do you only have APs on those subnets?

jeff.kish Tue, 12/02/2008 - 08:59
User Badges:
  • Silver, 250 points or more

Just APs on those subnets.

Come to think of it, all we know is that clients were unable to connect. I don't know for sure that clients were unable to associate. So either the radios were still working and all client subnets were also down, or the the radios simply shut down. It's more likely that the radios were shut down... I doubt that each client subnet also dropped.

tkhan Tue, 12/02/2008 - 09:32
User Badges:

What type of authentication are your clients using? We had a similar problem with EAP-TLS and it turned out to be a bad memory leak in our RADIUS servers.

dziminski Tue, 12/02/2008 - 09:35
User Badges:

Since you are using standalone APs, if the APs were still up while the wired network was down, users would still have been able to see the SSID and associate.

I don't see how all your APs would simultaneously power off their radios, especially since they are standalone and not LWAPP. (unless a change was pushed out to them from the WLSE?)

Are you able to see if the AP's catalyst ports went down during the outage?

jeff.kish Tue, 12/02/2008 - 09:42
User Badges:
  • Silver, 250 points or more

We aren't using a RADIUS server, so I doubt it's related. Thanks for your post, tkhan.

The outage caused all APs to be unreachable, so we couldn't ping or telnet to the devices. We didn't get to console into any of them before the outage ended.

The APs are powered via PoE, so I doubt that the ethernet interfaces shut down. The logs show no rebooting and no resetting of any interface, wireless or wired. So the radios likely stayed up.

It's so confusing, haha...

tkhan Tue, 12/02/2008 - 09:46
User Badges:

It's looking more like a network event, most likely a spanning-tree event. This type of event (someone looped a hub on one of your vlans), could have prevented your WLSE from seeing your APs and caused your clients not to reach beyond their authenticated subnet.

dziminski Tue, 12/02/2008 - 09:48
User Badges:

That is very confusing. Honestly, it seems like a routing, spanning-tree or other network issue, especially since the APs are the only devices on those networks. How else could all the APs be simultaneously unreachable?

With PoE, I sill see my catalyst ports go up and down when APs reboot. So if your APs were physically powered off, you would have seen the catalyst port bounce.

jeff.kish Tue, 12/02/2008 - 09:57
User Badges:
  • Silver, 250 points or more

Well, we have no hubs on the network, so it would require someone placing a hub somewhere AND looping it. Possible, but unlikely.

I originally thought it might be a broadcast storm, but it doesn't quite match the symptoms. A storm would cause more of an outage than those two subnets. We have IPT, and clogging the access layer uplinks with broadcasts would have resulted in complaints about phones not working too. If the uplinks weren't so clogged as to cause phone issues, we would have gotten occasional pings through to APs. And they certainly wouldn't have all dropped and come back up at once.

It's possible that a bad route came through for some reason, which caused all traffic directed to those subnets to go to a black hole somewhere. But that's unlikely since we use EIGRP, and it's also unlikely that it would take down every VLAN on these two specific groups of APs.

Thanks for the ideas, guys, you're both being quite helpful. It's good to talk this out for sure. I do have a TAC case open, and I'm waiting for a reply from an engineer since I just forwarded him logs and show techs. If we come to any conclusion, I'll post it. Until then, I'm all for continuing the discussion.

tkhan Tue, 12/02/2008 - 10:21
User Badges:

Are you running any of these mechanisms on your network, UDLD, BPDU-GUARD, LOOPGUARD, RAPID-PVST+? If not, then a flaky fibre connection on the uplinks or a looped switchport can cause your network issues. I can't see only your AP subnet being affected other than a blatant DOS attack on them.

jeff.kish Wed, 12/03/2008 - 10:58
User Badges:
  • Silver, 250 points or more

We have some protection as far as spanning-tree is concerned. Again though, it doesn't seem to be a broadcast storm for the reasons above.

I doubt it's a DOS attack simply because it started and stopped so suddenly. If it had gone on indefinitely, I would definitely consider that as a possibility.

tkhan Wed, 12/03/2008 - 11:09
User Badges:

Hijacked SSID? Just thinking if I were to try and cause these symptoms, what would I do?

jeff.kish Fri, 12/05/2008 - 14:05
User Badges:
  • Silver, 250 points or more

Well, I'm stumped and the TAC engineer appears stumped as well. I'm leaning toward this resulting from our management VLANs going down, though I don't know why they would. I was informed by the TAC engineer that if the management VLAN is lost then the AP shuts down all client VLANs. That was news to me.

5 points to you both for trying to help :D

scottmac Sat, 12/06/2008 - 06:44
User Badges:
  • Green, 3000 points or more

Any chance that the connection to the management system was interrupted? That wold make it look like all AP/Controllers went down (but they didn't, hence no log activity) but to the monitoring system it would look like everything went away.

... or whatever link connects the wireless system to the LAN / network that you're monitoring from?

Did the users see the drop?

Did your AAA servers see a surge of activity?

jeff.kish Mon, 12/08/2008 - 06:08
User Badges:
  • Silver, 250 points or more

Well, it's an autonomous environment using WLSE, and it didn't go down during that time. It showed that the APs were all down on those two subnets. Losing connection to WLSE wouldn't make the APs drop like they did - we had no telnet access to them and could not ping them.

The users were dropped, but they don't use AAA so there's nothing to see there.

Again, such a mystery.


This Discussion



Trending Topics: Other Wireless Mobility

client could not be authenticated
Network Analysis Module (NAM) Products
Cisco 6500 nam
reason 440 driver failure
Cisco password cracker
Cisco Wireless mode