We have an issue which is related to bug CSCuj73571 .
IP traffic in all vlans works fine. As soon as there is ARP traffic, the switch will stop processing. We also noted 100% utilisation on core CPU 0 during this issue. The client is using SUP8E's on 4506-E's with cat4500es8-universal.SPA.03.03.00.XO.151-1.XO.bin
You will experience this at different time intervals to us, depending on how much arp traffic is on your VLAN/VLAN's - It could fail in seconds or in our clients case, 45 minutes, which we could put a clock to.
This is exactly what we see (as per bug CSCu73751)
#show platform cpu packet driver
Forerunner Packet Engine 0.28 (0)
Receive Queues: received packets summary
Qu Capac Guara CurPo Unpro Accum Kept BperP Packets
2 2512 112 2303 0 3 2511 64 339959 <--- Kept stays at 2511, Packets does not increment
8 1008 512 67 0 3 3 64 67
9 2512 304 96 0 0 0 433 96
Receive Queues: dropped packets summary
Qu Total Packets Drop No Cell Drop Overrun Drop Underrun
2 339959 100390067 0 0 <--- Drop No Cell increments
This issue is 100% reproducable with a traffic generator, in any vlan, generating random ARP traffic at any speed or flow rate. The workaround listed in this bug does not work as we could duplicate it in any vlan, not just vlan 1.
We found that generating this command from conf t:
ip arp inspection vlan 1-4094
no ip arp inspection vlan 1-4094
(note - we just did it on all vlans as we wanted to test what we had in the lab with all the clients vlans)
This then solves the issue and reduces CPU load to normal conditions on core 0. Also kept buffers now operate correctly and increment/decrement as designed. However, at reboot, you will be back to the issue (or power loss). Using Event Manager to write an applet to run this at startup is not a good workaround either.
The issue is only around arp traffic on your network. You can rate limit all you want or remove rate limiting on arp packet inspecting but unless you toggle inspection on then off, it will continue to fail. Also, noted during debugging this issue in the lab, if you toggle arp inspection back on after disabling it, you will eventually have a network failure scenario on ARP packets. IE - ARP will not work on your network.
There is no fix on this for SUP8E's and it only appears to be in 3.6.E which is not acceptable. Our client wants to return the SUP8's and put SUP7's in which has a fix. This is also not good, so we are looking at Cisco to solve this issue ASAP.
Anyone from Cisco is welcome to contact me on this subject.
Robert Thompson, CCIE #10302
Please open a Tac case on this if possible. Once open, feel free to message me the SR number and I can assist.
We recently met the same issue, and solved by the following EEM script:
event manager applet toggleIpv6Snooping authorization bypass
event syslog occurs 1 pattern "Terminal state reached"
action 100 cli command "enable"
action 200 cli command "configure terminal"
action 300 cli command "vlan configuration 1"
action 400 cli command "ipv6 snooping"
action 500 cli command "no ipv6 snooping"
action 600 cli command "end"
action 700 syslog msg "******** TAC_EEM Complete: Vlan1 Workaround Applied ********"
Although this is a good workaround, you don't really want this in a production network and that is why we pushed Cisco for the answer.
The problem is actually to do with how Cisco are actually monitoring your arp traffic and dhcp traffic and a bunch of other data. Its an undocumented feature, but the fix is to actually turn off the macros that are turned on by default. It was introduced in IOS XE 3.3.0 (15.1), which is the version we were running.
no macro auto monitor
That command sorts the problem out by not directing packets to the cpu queue for processing (inspection) and subsequent queues filling up.
Hope people find that information useful. The original bug will be updated today with reasons and the workaround.
Robert Thompson, CCIE #10302
Definitely test this on a single site before deploying to your enterprise. We are running into these same ARP issues on our sup8e which are installed in our 4500 chasis.
After entering this command I noticed lots of network links going down. As you can imagine, it was quite a fire drill to hurry to bring things back as quick as possible.
One thing I noticed is that and trunk link connected by a single link seemed to have went down after this command.
However our trunks that were connected via a Port Channel stayed up and online.
These devices were still seen as a cdp neighbor to the 4500 with IP information available via cdp neighbor detail though they were not reachable until I entered the command MACRO AUTO MONITOR back into the switch and then cleared the arp.
I have a client that has had some problems with is 4500X with an ios that has this same bug, but in is case he did not have any device with ipv6 configured. Are you using IPv6?
No we were not running IPv6.
Can you check that all your traffic or at least traffic with issues is not in VLAN 1? If it is, move the traffic to a new VLAN.