We had a bizarre problem this morning with our wireless APs. All APs across two management subnets went down - 58 on one and 44 on the other. They are mostly 1231G access points. During this time, they were not reachable via ping, WLSE could not see them, and they would not associate clients. After 10-15 minutes, they all came back up.
The APs are located across two attached buildings across many different 4006 and 4507 chassis switches. Each chassis homeruns back to two 6509 campus switches, which use HSRP on each VLAN interface for redundancy.
The logs on the APs, WLSE, and switches show absolutely nothing happening at this time. The APs did not lose power or reboot, and the campus switches did not show HSRP having issues. WLSE did not show the APs fail, but it does show them all coming back up.
We're trying to find the common denominator here. Does anyone have any idea what might have happened? Has anyone seen anything like this before?
There was an IOS bug a few years ago which would occasionally cause an AP's F0 interface to become nonfunctional, although the AP did not lose power, the radio kept transmitting, and nothing showed up in the logs. The bug was triggered by receiving a particular type of packet over the wire.
This doesn't sound at all like that bug, though- the bad packet would hit one AP at a time rather than all at once, and it did not recover spontaneously; the AP had to be rebooted.
So, in short- I don't have any good advice for you, other than to console into one of the APs during the outage if it happens again and see if you can figure anything out that way.
Just APs on those subnets.
Come to think of it, all we know is that clients were unable to connect. I don't know for sure that clients were unable to associate. So either the radios were still working and all client subnets were also down, or the the radios simply shut down. It's more likely that the radios were shut down... I doubt that each client subnet also dropped.
What type of authentication are your clients using? We had a similar problem with EAP-TLS and it turned out to be a bad memory leak in our RADIUS servers.
Since you are using standalone APs, if the APs were still up while the wired network was down, users would still have been able to see the SSID and associate.
I don't see how all your APs would simultaneously power off their radios, especially since they are standalone and not LWAPP. (unless a change was pushed out to them from the WLSE?)
Are you able to see if the AP's catalyst ports went down during the outage?
We aren't using a RADIUS server, so I doubt it's related. Thanks for your post, tkhan.
The outage caused all APs to be unreachable, so we couldn't ping or telnet to the devices. We didn't get to console into any of them before the outage ended.
The APs are powered via PoE, so I doubt that the ethernet interfaces shut down. The logs show no rebooting and no resetting of any interface, wireless or wired. So the radios likely stayed up.
It's so confusing, haha...
It's looking more like a network event, most likely a spanning-tree event. This type of event (someone looped a hub on one of your vlans), could have prevented your WLSE from seeing your APs and caused your clients not to reach beyond their authenticated subnet.
That is very confusing. Honestly, it seems like a routing, spanning-tree or other network issue, especially since the APs are the only devices on those networks. How else could all the APs be simultaneously unreachable?
With PoE, I sill see my catalyst ports go up and down when APs reboot. So if your APs were physically powered off, you would have seen the catalyst port bounce.
Well, we have no hubs on the network, so it would require someone placing a hub somewhere AND looping it. Possible, but unlikely.
I originally thought it might be a broadcast storm, but it doesn't quite match the symptoms. A storm would cause more of an outage than those two subnets. We have IPT, and clogging the access layer uplinks with broadcasts would have resulted in complaints about phones not working too. If the uplinks weren't so clogged as to cause phone issues, we would have gotten occasional pings through to APs. And they certainly wouldn't have all dropped and come back up at once.
It's possible that a bad route came through for some reason, which caused all traffic directed to those subnets to go to a black hole somewhere. But that's unlikely since we use EIGRP, and it's also unlikely that it would take down every VLAN on these two specific groups of APs.
Thanks for the ideas, guys, you're both being quite helpful. It's good to talk this out for sure. I do have a TAC case open, and I'm waiting for a reply from an engineer since I just forwarded him logs and show techs. If we come to any conclusion, I'll post it. Until then, I'm all for continuing the discussion.
Are you running any of these mechanisms on your network, UDLD, BPDU-GUARD, LOOPGUARD, RAPID-PVST+? If not, then a flaky fibre connection on the uplinks or a looped switchport can cause your network issues. I can't see only your AP subnet being affected other than a blatant DOS attack on them.
We have some protection as far as spanning-tree is concerned. Again though, it doesn't seem to be a broadcast storm for the reasons above.
I doubt it's a DOS attack simply because it started and stopped so suddenly. If it had gone on indefinitely, I would definitely consider that as a possibility.
Well, I'm stumped and the TAC engineer appears stumped as well. I'm leaning toward this resulting from our management VLANs going down, though I don't know why they would. I was informed by the TAC engineer that if the management VLAN is lost then the AP shuts down all client VLANs. That was news to me.
5 points to you both for trying to help :D
Any chance that the connection to the management system was interrupted? That wold make it look like all AP/Controllers went down (but they didn't, hence no log activity) but to the monitoring system it would look like everything went away.
... or whatever link connects the wireless system to the LAN / network that you're monitoring from?
Did the users see the drop?
Did your AAA servers see a surge of activity?
Well, it's an autonomous environment using WLSE, and it didn't go down during that time. It showed that the APs were all down on those two subnets. Losing connection to WLSE wouldn't make the APs drop like they did - we had no telnet access to them and could not ping them.
The users were dropped, but they don't use AAA so there's nothing to see there.
Again, such a mystery.