I have a Cisco 7606 and several Cisco 6509s with Sup 720 3BXLs (along with the compatible 3BXL distributed forwarding cards on my other line cards). I am running into some resource problems whereby the TCAMs will get overrun at various times. I'll see stuff like this:
%EARL_NETFLOW-DFC1-4-TCAM_THRLD: Netflow TCAM threshold exceeded, TCAM Utilization [95%]
I am graphing L3 learned flow failures via SNMP (cseL3FlowLearnFailures), and I am seeing that a lot of flows are getting kicked over to the 6509/7606 CPU when the TCAM resources get exhausted.
When this happens, the DNS infrastructure on our campus gets really mad. I am assuming that punting new flows to the Cisco CPU is causing some performance issues, since our beefy DNS infrastructure will start pounding the network with more and more DNS requests as there are more and more outstanding recursive queries going off-campus. So it looks like our DNS infrastructure is only making matters worse for our Cisco routers by demanding more flow resources just when the routers need to start purging more flow entries in the TCAMs to make room for new entries!
Our 7606 sits on our campus perimeter, so it bears the brunt of the load. Sometimes the router will even reboot as the NDE process sucks up the CPU:
%SYS-DFC6-3-CPUHOG: Task is running for (2000)msecs, more than (2000)msecs (15/8),process = NDE - IPV4.
So, I am trying to figure out a way to tune the ability of the router to be more efficient when handling TCAM resources. For now, I have left the mls normal and long aging timers at their defaults of 300 and 1920 seconds respectively, with no packet threshold.
I am focusing on the fast aging timer. When I first changed the default setting to 128 seconds with a 100 packet threshold, things did get better. However, it still isn't good enough (several router crashes).
My requirements are that I do full flow collection, with NO sampling, and no aggregation.
Given that, are there any recommendations for setting the mls fast aging timer to help me better deal with my DNS issues without unnecessarily overloading the export process and the downstream collectors with too many new flow records per second?
Here is what I am trying as of today:
sh mls netflow aging
enable timeout packet threshold
------ ------- ----------------
normal aging true 300 N/A
fast aging true 30 16
long aging true 1920 N/A
I am running SXH3 on the 6509s and SRC1 on the 7606.
College of William and Mary
I recommend you increase the threshold to 50 or 100. Higher threshold (or lower aging time) value means more aggressive. Watch your switching processor performance (too aggressive configuration may cause problems with overall stability) - You can use the following commands to get switch CPU stats:
attach (active PFC)
show proc cpu
Regarding to full TCAM:
I think that Cisco is using bad method for full TCAM. I suppose that if TCAM is full, PFC clear all flows from TCAM (or most of flows) ;-( It explains why after TCAM overflow is only 20% utilization. I saw this case many times (maybe some cisco engineer can explain it???)
In any case, if you have a many connections you will not be able to export all flows information. Good command to get number of TCAM creaton failures is:
show mls netflow table-contention aggregate
And sorry for the delay, I was on bussiness trip.
Have a nice day,