6509 with MSFC Dropped packets/timeouts

epeeler · ‎04-27-2006

Hey all,

I have a pair of 6509 with msfcs in them. They are setup in an HSRP situation with one being active for all vlans and the other being backup.

We have a monitoring server that pings each interface on the msfc. Lately I've been getting random failures on these interfaces. The vast majority of failures are on the primary 6509. It only fails once and then it's okay.

The primary msfc is running at about 33% cpu utilization. With peak traffic on the switch backplane showing at 16%. If I ping any interface on the primary msfc I'm getting response times between .500ms and 7.0ms with anywhere from 4 to 7 percent packet loss. This is consistant across all of the interfaces. Pinging any interface on the backup msfc gives me consistant .500ms response with 0% packet loss. These two switches are connected to each other via redundant gigabit links.

I notice that when I telnet directly to the primary msfc, I get noticably slower response on the console than if I telnet to the switch first and then connect to the msfc via session 15. Is this console performance disparity something that might be indicative of a bottleneck somewhere? It doesn't matter where I source the ping from...even from the back up mfsc I get the same ping performance from the primary (meaning bad).

Any ideas?

The cpu load and backplane traffic levels wouldn't seem to indicate a cpu or memory bottleneck but the fact that all interfaces show the same crummy response and reliability as well as the slow console make me think it's something system wide that's getting overloaded. I just can't see what it is.

sundar.palaniappan · ‎04-27-2006

Are you losing packets and the connection is slow only when you try to get to the addresses configured on the switch (or) is this affecting the user traffic that is passing through the switch?

If all traffic is affected, it might sound funny, have you tried rebooting the swtich since the problem started happening?

Actually, if you have a 2nd Sup/MSFC card in the switch just failover to the standby Sup and check the outcome.

The symptoms you mentioned above doesn't indicate any sort of resource utilization or network traffic problem.

Pls. rate the post if it helped.

HTH,

Sundar

epeeler · ‎04-28-2006

I only seem to be losing echo requests when they are directed at an interface on the primary msfc. If I ping any other device I get solid replies with no drops. The path to these devices is through the primary msfc so user traffic appears to be okay. My monitoring software also pings the servers in the data center and I'm not seeing the random timeouts on any of the servers. Just the router.

glen.grant · ‎04-27-2006

33% is somewhat high for the msfc . You must have a lot of traffic that needs to be processed switched for some reason. Most traffic should be hardware switched and never hit the cpu . Might do you well to get a sniffer on there and see what the traffic is doing . Possible client infections somewhere ?

a.tolstykh · ‎04-27-2006

Please post a list of processes sorted by the CPU usage, show int + show int stats on the monitored interface, OS version running on the SW

epeeler · ‎04-28-2006

Requested information in attached file.

Thanks so much.

vladrac-ccna · ‎04-28-2006

Hello there,

Checking the output indeed seems that CPU is high due to IP INPUT, which means that you have a large portion of your traffic hitting the cpu, what I think we can see on the "show int stat" too, and that is not a good thing.

btw, pings destinated to the router it self will be processed in CPU.

This high IP INPUT could indicate:

Interrupt switching is disabled on an interface (or interfaces) that has (have) a lot of traffic

Fast switching on the same interface is disabled

Fast switching on an interface providing policy routing is disabled

Traffic that cannot be interrupt-switched arrives:

which is caused by a big list of things:

Packets for which there is no entry yet in the switching cache.

Broadcast traffic

IP packets with options

Packets that require protocol translation

Multilink Point-to-Point Protocol (supported in Cisco Express Forwarding switching)

Compressed traffic

Encrypted traffic

Packets destined for the router

So, I think we will need to identify why you are getting all this traffic process switched.

Vlad

BTW, Just as an advice, if you IOS permits try:

show proc cpu | e 0.00

show proc cpu sorted | e 0.00

to get an shorter output.

Also show interfaces switching is a good command to troubleshoot this issues.

epeeler · ‎04-28-2006

Messed up the attachment first time round..

Here is the info.

vladrac-ccna · ‎04-28-2006

Hello if you open the first attachment with WORDpad, it will open with no problems (and clear to see).

Could you provide a sample configuration for 1 of your vlans and physical interfaces?

Vlad

glen.grant · ‎04-30-2006

Vlan 11 seems to be the one putting the load on the CPU , if you look at the show int stat and also the show interface it shows a lot hitting the switch processor and also you even have packets in the input queue which means it is backing up somewhat . I think I would sniff vlan 11 and see what is going on . It is pretty amazing we recently had something like this and we finally traced it down to one person ghosting some stations using broadcast mode and this one person was enough to bury a Sup 720 at between 90 -100 % cpu and basically you couldn't do anything with the box when he did this .

epeeler · ‎05-01-2006

You hit that one on the head. Vlan11 is a "lab" network and has a significant amount of weird traffic, including a lot of broadcasts.

I guess I've been under the mistaken impression that 33% cpu load is not "that" high but I guess it's enough to cause the issues I'm having. Unfortunately, there's not much I can do about the traffic levels in the lab since they have to do some odd things to test our product in there.

I'm considering just making the current hsrp backup, the primary for vlan11 which will hopefully spread the load out enough to keep my random packet losses from happening.

epeeler · ‎05-01-2006

Another question. Vlan11 (the lab network) has several sub-interfaces which allows us to have multiple layer 3 networks in a single broadcast domain.

Would it be advisable to turn on the "fast switching on same interface" for this vlan? That might have the potential to reduce the hits on the ip_input process.

I notice in the docs that this is not recommended because it can interfere with "redirection". Can anyone elaborate on what problems I might have if I do this?

vladrac-ccna · ‎05-02-2006

what kind of switching do you have now?

process, fast or cef?

fast switching will probably makes things worst, as every 1st packet in the flow will hit the CPU.

You should have CEF wherever its possible.

You can check the show ip interface command to see which switching method are used on each interface.

Vlad

epeeler · ‎05-02-2006

Output pasted below for the lab interface. Does this mean that Fast Switching AND CEF are enabled?

=====================================

IP fast switching is enabled

IP fast switching on the same interface is disabled

IP Flow switching is disabled

IP CEF switching is enabled

IP Fast switching turbo vector

IP Normal CEF switching turbo vector

IP multicast fast switching is enabled

IP multicast distributed fast switching is disabled

IP route-cache flags are Fast, CEF

=========================================

6509L-MSFCs15>sho ip cache verbose

IP routing cache 0 entries, 0 bytes

30927 adds, 30927 invalidates, 0 refcounts

Minimum invalidation interval 2 seconds, maximum interval 5 seconds,

quiet interval 3 seconds, threshold 0 requests

Invalidation rate 0 in last second, 0 in last 3 seconds

Prefix/Length Age Interface Next Hop