Welcome to the Party!
Some good news. By the time I went to bed Saturday night, I was feeling comfortable the network was up. We had the IDF and room switches mostly deployed. Wireless was working, and we finally got all of the digital media players online. Nonetheless, I planned to be in early Sunday morning (well, 7:45 is early for me). Jason, on the other hand, made it in much earlier. I think he was there around 5:45. (It also didn't help that his hotel was practically in Los Angeles.)
Now comes the bad news. No sooner had a arrived at the open NOC desk on the first floor of the San Diego Convention Center, unpacked my laptop, and looked up (circa 7:48 am) that I noticed something...wrong. I wasn't getting an IP address on the wireless network. Jason, who was on the network already, said that he lost connection to our NMS tools. The registration lines were already long, and some of the staff started to walk up to the NOC complaining about network connectivity. While Jason manned the NOC, I ran upstairs to the NOC office to find out what was going on. No one knew. We called into the security and core teams, then ran down to the MDF/core to start troubleshooting.
After a considerable amount of time analyzing the ASA firewalls and core 6500s, we found that we were hitting a relatively new bug in the ASA, and that both the active and standby contexts were taking turns crashing. We enabled a workaround, but we had already been down for a while. This was not a good way to start the show .
Once the network came back, it was reassuring to see that LMS had recorded the outage (as it, too, was disconnected from the main part of the network). We continued to monitor things, but we didn't want to make any changes or do any deep analysis at this time since we didn't want to cause another outage. That night, I texted some of the RTP firewall guys that were onsite as speakers. They happily came over to the NOC office to help dig deeper into the ASA problem. David White, Jay Johnston, and Magnus Mortensen worked with us until about 1:00 am analyzing the crash dumps, talking with development, and trying to reproduce the problem. It was like this Borg collective watching them work together...awesome stuff. By the end of the night, we knew what the problem was, how to reproduce it, and how to work around it. The bug was CSCua27134, and we now had safety measures in effect to prevent it from happening again. I guess I can go one more night without food...
Monday started out okay, but by late morning, people started complaining about the wireless. As our wireless guy, Mir will tell you, it's never the wireless. It's always one of us wired guys making him look bad . Turns out, he was right this time. During the shipment of the equipment from San Jose, one of the heat sinks on one of the 6500 distribution switch modules had come loose. This caused the module to exhibit some strange problems. While the wireless LAN controllers were dual-homed, since the module was not failing, it caused some packet loss issues. You'll be happy to know, LMS did start picking up some syslog messages from the switch in question. Once the module was replaced, the wireless stabilized. This also represented the last of our network issues (not too bad for a show you build in a week).
Another bit of good news, that night I managed to sneak away to dinner with our resident fax guru, Gonzalo Salgueiro and his family. We had some pretty good tapas (and I had a good paella).
Tuesday marked the first keynote of the show. Given some of the bumps we had earlier, Jason and I wanted to make sure the keynote was extra-monitored. We setup an additional video IPSLA collector to the switch we had located in the keynote area. We also monitored the syslog stream intently. However, the best network monitoring one can do in this situation is to put the stream of the keynote up on the big screen (well, wall really) in the NOC. If the video got interrupted, we have a problem. John Chambers' keynote went off without a hitch. The network worked like a boss.
I guess I should take some time to mention that Jason didn't give up getting in around 5:45 in the morning. He would come in early to sweep the network to make sure everything was ready for the day's festivities. If he found any issues, he would send an email out to the NOC team. For example:
Issue 1 - LMS is seeing sdcc_idf_3.5a port gi0/4 down - this goes to sdcc_ballrm_20d Issue 2 - LMS is seeing sdcc_idf_3.8a port gi0/2 down - this goes to sdcc_ballrm_20bc
To do this, he used LMS's Fault Monitor view, Syslog view, and the performance monitor graphs on the monitoring dashboard. This way, when the various teams arrived, they could look at the issues in their domain.
Just because we didn't have any other major issues with the network, there were always people messing with the room switches (i.e., maybe they tripped over a cable, unplugged something they shouldn't, powered down things when they left, etc.). Because of these issues, we would occasionally use APs or DMPs. When that happened, we used User Tracking to lookup the MAC of the AP (Network Control System reported the AP was down), and found the switch to which it connected. If that switch was up, we'd reset the port. If not, we used Topology Services in LMS to find the switch to which the down switch was connected and troubleshoot from there. From time of report, it usually took no more than a few minutes to remediate the situation.
By the end of the show, we were gathering our statistics for the big BRKNMS-1035 "NOC at CiscoLive!" session. We were sending screenshots, stats, and pictures to Jason who consolidated everything into the slide deck. It was a long ten days, but the end was in sight. The network team marched up to ballrom 6D to present what we had done and the lessons we learned. Despite all of the early problems, the network at CiscoLive! US 2012 beat records. We pushed out 51.7 TB of traffic with ~ 2.4% of that IPv6. We saw over 11,000 wireless clients associated. We finished with almost as many Network Academy students as we started (I think we still have a few MIAs in Marriott). We had an iPad named GigoloPad on the network.
This was a network that Cisco with a team of great engineers, students, and coordinators made possible. Let's see what 2013 has to offer.
For all of the stats, details, lessons, pictures, and sauciness, checkout the attached slide deck we presented on the last day of the show. And yes, I did get to eat on the last night (before flying out at 6:25 am the next morning).