Solved: NetFlow Issue

balla-zoltan · ‎01-19-2010

On last Friday we had a 10 minute outage because of the following errors;

2010-01-15 16:26:45    Local7.Warning    10.65.63.2    148586: Jan 15 16:28:55.663 EDT: %EARL-DFC2-4-NF_USAGE: Current Netflow Table Utilization is 74%
2010-01-15 16:26:55    Local7.Warning    10.65.63.2    148587: Jan 15 16:29:06.015 EDT: %EARL_NETFLOW-DFC2-4-TCAM_THRLD: Netflow TCAM threshold exceeded, TCAM Utilization [99%]
2010-01-15 16:27:01    Local7.Warning    10.65.63.3    180881: Jan 15 16:29:11.628 EDT: %EARL_NETFLOW-DFC2-4-TCAM_THRLD: Netflow TCAM threshold exceeded, TCAM Utilization [90%]
2010-01-15 16:27:07    Local7.Warning    10.65.63.3    180882: Jan 15 16:29:18.004 EDT: %EARL_NETFLOW-DFC1-4-TCAM_THRLD: Netflow TCAM threshold exceeded, TCAM Utilization [99%]
2010-01-15 16:27:09    Local7.Warning    10.65.63.2    148588: Jan 15 16:29:20.039 EDT: %EARL_NETFLOW-SP-4-TCAM_THRLD: Netflow TCAM threshold exceeded, TCAM Utilization [97%]
2010-01-15 16:27:15    Local7.Warning    10.65.63.2    148589: Jan 15 16:29:25.423 EDT: %EARL_NETFLOW-SPSTBY-4-TCAM_THRLD: Netflow TCAM threshold exceeded, TCAM Utilization [96%]

I looked at Cisco's web site for answers and I found one that said we need to disable service internal and change the flow aging to be more aggressive. Did anyone have this issue before? Does anyone know what would cause this kind of issues?

Thanks

Giuseppe Larosa · ‎01-19-2010

Hello Zoltan,

if the objective of netflow collection is to gather for security purposes defining flows at Layer 4 including protocol and ports is highly desirable.

If so you can monitor current situation with the change on fast aging timers.

if the obejctive is just to be able to classify traffic to and from internet a destination-source mask can be enough.

I see Mazu has been bought by riverbed.

I agree that application performance analysis requires full flowmask at least.

Hope to help

Giuseppe

View solution in original post

Giuseppe Larosa · ‎01-19-2010

Hello Zoltan,

>> Does anyone know what would cause this kind of issues?

traffic variety can cause this. the Netflow TCAM table is of limited size so in an attempt to track multiple flows for netflow accounting purposes the table fills and there was an impact.

I see DFC2 in the messages, so my guess is that your C6500 uses a supervisor2, MSFC2, PFC2 combination.

the Netflow TCAM size for sup2/MSFC2 + PFC2 is reported here:

Table 50-3 NetFlow table utilization
PFC	Recommended NetFlow Table Utilization	Total NetFlow Table Capacity
PFC3BXL	235,520 (230 K) entries	262,144 (256 K) entries
PFC3B	117,760 (115 K) entries	131,072 (128 K) entries
PFC3A	65,536 (64 K) entries	131,072 (128 K) entries
PFC2	32,768 (32 K) entries	131,072 (128 K) entries

http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SXF/native/configuration/guide/netflow.html#wp1106378

An explanation of MLS timers for netflow table is here:

To keep the NetFlow table size below the recommended utilization, enable the following parameters when using the mls aging command:

•normal—Configures an inactivity timer. If no packets are received on a flow within the duration of the timer, the flow entry is deleted from the table.

•fast aging—Configures an efficient process to age out entries created for flows that only switch a few packets, and then are never used again. The fast aging parameter uses the time keyword value to check if at least the threshold keyword value of packets have been switched for each flow. If a flow has not switched the threshold number of packets during the time interval, then the entry is aged out.

•long—Configures entries for deletion that have been active for the specified value even if the entry is still in use. Long aging is used to prevent counter wraparound, which can cause inaccurate statistics.

http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SXF/native/configuration/guide/netflow.html#wp1147986

The suggestion for tuning is the following:

If you need to enable MLS fast aging time, initially set the value to 128 seconds. If the size of the NetFlow table continues to grow over the recommended utilization, decrease the setting until the table size stays below the recommended utilization. If the table continues to grow over the recommended utilization, decrease the normal MLS aging time.

You can only attempt to prevent the filling of the table and the price to pay is less accuracy in netflow accounting.

By using aggressive timers you can "miss" some flows because they last for a short time or they are not formed by enough packets in a given time window as explained above, so they are removed from the table and so they are not exported to netflow collector.

There is a feature called NDE flow filter but it influences only what flows are exported to NFC not what is in the netflow TCAM table.

see

http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SXF/native/configuration/guide/nde.html#wp1140829

It is a trade off between accuracy and scalability / stability of device.

Other users have reported similar issues,may be without an impact on traffic forwarding.

Where is placed the C6500 in a service provider POP, in an internet exchange point?

It is not traffic volume that counts but how many IP flows classified per your NDE mask are seen.

So another point to investigate is what flow mask you are using now, a more detailed definition of flows generated more entries in the table.

see

http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SXF/native/configuration/guide/netflow.html#wp1057334

This is probably first aspect to check. As other users have reported there are cases where different flow masks would be required by different features.

Hope to help

Giuseppe

balla-zoltan · ‎01-19-2010

Thank you for the quick answer Giuseppe. I included the output from sh mod. As you can see we use SUP720s and the 6509s are in the core network with 4948s connected to it by 10G. Each 4948 connects to two 6509 for redundancy. The 4948s have the servers. I also saw a bunch of NetFlow creation failure on both switches. It was strange that it caused outage. We have Mazu devices to collect and analyse flows.

Mod Ports Card Type                              Model              Serial No.
--- ----- -------------------------------------- ------------------ -----------
1    8 CEF720 8 port 10GE with DFC            WS-X6708-10GE      SAD112002N4
2    8 CEF720 8 port 10GE with DFC            WS-X6708-10GE      SAD111904F8
3    4 CEF720 4 port 10-Gigabit Ethernet      WS-X6704-10GE      SAL1109JABK
4   48 CEF720 48 port 10/100/1000mb Ethernet WS-X6748-GE-TX     SAL1119NJ1P
5    2 Supervisor Engine 720 (Active)         WS-SUP720-3B       SAL1122Q527
6    2 Supervisor Engine 720 (Hot)            WS-SUP720-3B       SAL1122QEZX
7   48 CEF720 48 port 1000mb SFP              WS-X6748-SFP       SAL1122QCP4
8   48 CEF720 48 port 10/100/1000mb Ethernet WS-X6748-GE-TX     SAL09316RWY

Mod MAC addresses                       Hw    Fw           Sw           Status
--- ---------------------------------- ------ ------------ ------------ -------
1 001b.d483.5624 to 001b.d483.562b   1.3   12.2(18r)S1 12.2(33)SXI Ok
2 001b.539d.2820 to 001b.539d.2827   1.3   12.2(18r)S1 12.2(33)SXI Ok
3 001a.6cf5.bf54 to 001a.6cf5.bf57   2.5   12.2(14r)S5 12.2(33)SXI Ok
4 001b.d452.55f0 to 001b.d452.561f   2.5   12.2(14r)S5 12.2(33)SXI Ok
5 0016.c85e.ab24 to 0016.c85e.ab27   5.4   8.4(2)       12.2(33)SXI Ok
6 0017.9568.eb48 to 0017.9568.eb4b   5.4   8.4(2)       12.2(33)SXI Ok
7 001b.d45d.cb30 to 001b.d45d.cb5f   1.10 12.2(14r)S5 12.2(33)SXI Ok
8 0014.f212.3a58 to 0014.f212.3a87   2.3   12.2(14r)S5 12.2(33)SXI Ok

Mod Sub-Module                  Model              Serial       Hw     Status
---- --------------------------- ------------------ ----------- ------- -------
1 Distributed Forwarding Card WS-F6700-DFC3C     SAD112104MW 1.0    Ok
2 Distributed Forwarding Card WS-F6700-DFC3C     SAD112104FB 1.0    Ok
3 Distributed Forwarding Card WS-F6700-DFC3B     SAD111605VH 4.6    Ok
4 Distributed Forwarding Card WS-F6700-DFC3B     SAD11160027 4.6    Ok
5 Policy Feature Card 3       WS-F6K-PFC3B       SAL1122Q5TN 2.3    Ok
5 MSFC3 Daughterboard         WS-SUP720          SAL1122Q3QC 3.0    Ok
6 Policy Feature Card 3       WS-F6K-PFC3B       SAL1122Q9LX 2.3    Ok
6 MSFC3 Daughterboard         WS-SUP720          SAL1123QJ6W 3.0    Ok
7 Distributed Forwarding Card WS-F6700-DFC3B     SAL1110JMHG 4.6    Ok
8 Distributed Forwarding Card WS-F6700-DFC3A     SAL08486L4F 2.2    Ok

Mod Online Diag Status
---- -------------------
1 Pass
2 Pass
3 Pass
4 Pass
5 Pass
6 Pass
7 Pass
8 Pass

balla-zoltan · ‎01-19-2010

Giuseppe,

Here is the configuration we have on the 6509s for NetFlow;

no mls acl tcam share-global
mls aging fast time 30 threshold 128
mls aging long 64
mls aging normal 32
mls netflow interface
mls flow ip interface-full
no mls flow ipv6
mls nde sender

ip flow-export source Vlan120
ip flow-export version 9
ip flow-export destination 10.65.63.151 4000

And under each VLAN interface;

Interface vlan90
ip flow ingress

The mls aging fast was added after the problem we had.

Giuseppe Larosa · ‎01-19-2010

Hello Zoltan,

you have provided a lot of details.

You have PFC3B sup720 and this is good news.

Note that the use of the most detailed flowmask

mls flow ip interface-full

is going to create more entries in the table.

Le'ts make an example: if we define an IP flow only using IP SA and IP DA any conversation between two given hosts is classified as a single flow in the table.

If we define a flow using more details for example adding upper layer protocol and ports we have a line for telnet, one line for web access and so on.

so depending on what features are on the device a flow mask like

destination-source—A more-specific flow mask. The PFC maintains one entry for each source and destination IP address pair. Statistics for all flows between the same source IP address and destination IP address aggregate into this entry.

can be of help in containing size of netflow TCAM table.

you are currently using the most specific flowmask

full—A more-specific flow mask. The PFC creates and maintains a separate table entry for each IP flow. A full entry includes the source IP address, destination IP address, protocol, and protocol ports.

•full-interface—The most-specific flow mask. Adds the source VLAN SNMP ifIndex to the information in the full-flow mask.

http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SXF/native/configuration/guide/netflow.html#wp1057334

Check table 50-1 and analyze configuration of your device. IF possible moving to flow mask destination-source can be of help.

or also to full instead of full-interface.

Hope to help

Giuseppe

balla-zoltan · ‎01-19-2010

Giuseppe,

The configuration we have was giving us by Mazu when we originally put the Mazu devices on the network. What would the drawback be if we would use something other then the interface-full? Would we have less information on traffic? we use Mazu to see the traffic flow and see if we have any issue on the network.

Giuseppe Larosa · ‎01-19-2010

Hello Zoltan,

if the objective of netflow collection is to gather for security purposes defining flows at Layer 4 including protocol and ports is highly desirable.

If so you can monitor current situation with the change on fast aging timers.

if the obejctive is just to be able to classify traffic to and from internet a destination-source mask can be enough.

I see Mazu has been bought by riverbed.

I agree that application performance analysis requires full flowmask at least.

Hope to help

Giuseppe

horizons6 · ‎09-08-2010

Giuseppe,

Thank you for the detailed reply issue above- i'm facing exact same issue (but no traffic outage) and was looking for possible workarounds- so this helps alot.

Regards,

Titus