6509 CPU Utilization Keeps Climbing

Unanswered Question
May 22nd, 2007
User Badges:

I have a Catalyst 6509 with a SUP720 running a modular IOS. The IOS filename is s72033-adventerprisek9_wan-vz.122-18.SXF6.bin. I have noticed that the CPU utilization on this switch increases constantly since the time it was last rebooted. On all of our other switches and routers the CPU is higher during the day, and lower during the evening, but with this 6509, the CPU constantly climbs. It never decreases at all. This climb may take over 2 months, but it will start at about 10% utilization, and within 2 months, it'll be near 40% utilization.


I have another 6509 that was just deployed and is also running a modular IOS (but a different version), and I am experiencing the exact same thing. On other 6509's that are running the standard IOS (not modular) we do not see this.


Does anyone know if there are any known issues like this? I tried searching the bug lists, but I didn't see any obvious bugs.


Thanks,


-Steve



  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 0 (0 ratings)
Loading.
salmodov Tue, 05/22/2007 - 10:12
User Badges:

Please post show proc cpu and lets see what is using up your resources.


Thanks

Steve

sbader48220 Tue, 05/22/2007 - 10:20
User Badges:

Here is the output from the 'show proc cpu' command.


CPU utilization for five seconds: 30%; one minute: 33%; five minutes: 34%

PID 5Sec 1Min 5Min Process

1 0.3% 15.0% 16.0% kernel

3 0.0% 0.0% 0.0% qdelogger

4 0.0% 0.0% 0.0% devc-pty

5 0.0% 0.0% 0.0% devc-mistral.proc

6 0.0% 0.0% 0.0% pipe

7 0.0% 0.0% 0.0% dumper.proc

4104 0.0% 0.0% 0.0% pcmcia_driver.proc

4105 0.0% 0.0% 0.0% bflash_driver.proc

20490 0.0% 0.0% 0.0% mqueue

20491 0.0% 0.0% 0.0% flashfs_hes.proc

20492 0.0% 0.0% 0.0% dfs_bootdisk.proc

20493 0.0% 0.0% 0.0% ldcache.proc

20494 0.0% 0.0% 0.0% watchdog.proc

20495 0.0% 0.0% 0.0% syslogd.proc

20496 0.0% 0.0% 0.0% name_svr.proc

20497 0.0% 0.0% 0.0% wdsysmon.proc

20498 0.0% 0.0% 0.0% sysmgr.proc

24578 0.0% 0.0% 0.0% chkptd.proc

24595 0.0% 0.0% 0.0% sysmgr.proc

24596 0.0% 0.0% 0.0% syslog_dev.proc

24597 0.0% 0.0% 0.0% itrace_exec.proc

PID 5Sec 1Min 5Min Process

24598 0.0% 0.0% 0.0% packet.proc

24599 0.0% 0.0% 0.0% installer.proc

24600 25.9% 16.6% 16.7% ios-base

24601 0.0% 0.0% 0.0% fh_fd_oir.proc

24602 0.0% 0.0% 0.0% fh_metric_dir.proc

24603 0.0% 0.0% 0.0% fh_fd_snmp.proc

24604 0.0% 0.0% 0.0% fh_fd_none.proc

24605 0.0% 0.0% 0.0% fh_fd_intf.proc

24606 0.0% 0.0% 0.0% fh_fd_gold.proc

24607 0.0% 0.0% 0.0% fh_fd_timer.proc

24608 0.0% 0.0% 0.0% fh_fd_ioswd.proc

24609 0.0% 0.0% 0.0% fh_fd_counter.proc

24610 0.0% 0.0% 0.0% fh_fd_rf.proc

24611 0.0% 0.0% 0.0% fh_fd_cli.proc

24612 0.0% 0.0% 0.0% fh_server.proc

24613 0.0% 0.0% 0.0% fh_policy_dir.proc

24614 2.8% 0.3% 0.2% tcp.proc

24615 0.0% 0.0% 0.0% ipfs_daemon.proc

24616 0.4% 0.2% 0.2% raw_ip.proc

24617 0.0% 0.0% 0.0% inetd.proc

24618 0.0% 0.1% 0.2% udp.proc

24619 0.0% 0.1% 0.1% iprouting.iosproc

24620 0.2% 0.1% 0.1% cdp2.iosproc


salmodov Tue, 05/22/2007 - 10:50
User Badges:

What about your logs any thing in the logs indicating anything?

I wonder if you are having issues beacuse packets are getting punted to the CPU instead of being software switched CEF.

Please include

Show logs

show ip arp sum

show processes cpu | exclude 0.00

show mls statistics

sbader48220 Tue, 05/22/2007 - 11:01
User Badges:

There is nothing in the logs to indicate any sort of a problem. Like I said, the CPU ramps up over several months. After a reboot, the cycle starts again. I was also wondering about packets being punted to the CPU, but everything appears to be running CEF. No ACL's or anything either.


Here are the outputs you requested. I ommitted the 'show log' output, as it has nothing useful in it, but is quite long.


6509#show ip arp sum

1222 IP ARP entries, with 16 of them incomplete


6509#show processes cpu | exclude 0.0

CPU utilization for five seconds: 19%; one minute: 42%; five minutes: 43%

PID 5Sec 1Min 5Min Process

1 0.1% 11.9% 15.6% kernel

24600 17.1% 26.8% 24.7% ios-base

24614 1.7% 1.8% 1.4% tcp.proc

24616 0.2% 0.3% 0.3% raw_ip.proc

24619 0.1% 0.2% 0.1% iprouting.iosproc


6509#show mls statistics


Statistics for Earl in Module 5


L2 Forwarding Engine

Total packets Switched : 48640339585


L3 Forwarding Engine

Total packets L3 Switched : 48594000495 @ 15575 pps


Total Packets Bridged : 26677053072

Total Packets FIB Switched : 20778492539

Total Packets ACL Routed : 0

Total Packets Netflow Switched : 0

Total Mcast Packets Switched/Routed : 127984744

Total ip packets with TOS changed : 2

Total ip packets with COS changed : 2

Total non ip packets COS changed : 0

Total packets dropped by ACL : 0

Total packets dropped by Policing : 0

Total packets exceeding CIR : 0

Total packets exceeding PIR : 0


Errors

MAC/IP length inconsistencies : 13

Short IP packets received : 0

IP header checksum errors : 0


Total packets L3 Switched by all Modules: 48594000495 @ 15575 pps




salmodov Tue, 05/22/2007 - 11:13
User Badges:

Can we try clearing the arp and lets see if the issue is with the MAC/IP length inconsistencies

sbader48220 Tue, 05/22/2007 - 11:25
User Badges:

I cannot clear the ARP table at this time. This switch is in production.


I don't beleive the 13 errors are the cause of this issue though. There have only been 13 of them, and the switch has been up almost 8 weeks. We have other switches with almost 40 inconsistencies, and they have no problems.


-Steve



sbader48220 Tue, 05/22/2007 - 12:21
User Badges:

I am probably going to revert to a non-modular IOS. I need to schedule a downtime to do this. I was just hoping somemone maybe had some input or had maybe seen this before.


Thanks.


-Steve


avmabe Fri, 05/25/2007 - 06:05
User Badges:
  • Bronze, 100 points or more

I haven't seen anything in this thread asking, but do you by chance run BGP on this box? We had this issue with 7600's (same architecture) when using SUP720-B.


What version of SUP720 are you using and are you taking the full BGP table?

John Patrick Lopez Fri, 05/25/2007 - 08:08
User Badges:

I just wanna share this, also try to check the IOS version of the switch. We had an incident that both of our 4500 core switches crashed at the same time because they were turned on at the same time. That time, redundancy was totally useless. It was because of a memory leak.

JEFFREY SESSLER Wed, 10/31/2007 - 16:51
User Badges:

Did you ever resolve this?


I'm also seeing this exact problem with my 6509 running 12.2(33)SXH modular. Slow creep over time.


A show CPU seems to indicate this is in the kernel process. If I get the details, the high CPU use is from TID 14 of PID 1:


1 14 10 Running 0 (128K) 5h26m procnto-cisco


I can't find any other information past this.

sbader48220 Wed, 10/31/2007 - 18:22
User Badges:

The only resolution I found was to upgrade the IOS to a non-modular model. I had to upgrade 2 switches and haven't had a problem since.


After I had this problem I had a Cisco engineer on site about 2 weeks later and he said not to use modular in production systems until it evolves some more.


Hope this helps!


-Steve

cbeswick Thu, 11/01/2007 - 08:31
User Badges:

Hi,


We have had exactly the same issues with Modular IOS on the Sup720.


CPU was high on ios-base process. We also suffered from excessively high cpu when issuing show tech-support, or even show run.


We are currently downgrading all sups to non modular ios on the latest safe harbour 12.2(18)SXF8.


This seems to have solved all our issues.


Stay away from the modular IOS!

JEFFREY SESSLER Thu, 11/01/2007 - 12:10
User Badges:

I've got a TAC case open now on this so I'll let you know what happens. I've got a graph of CPU over time, and there is a nice stair-step progression of the CPU climbing. On Oct 27th I was a my normal baseline of 8%, now I'm at almost 30%. I looks like it jumps up every 5.5 to 6-hours


Worst case, I'm thinking I'll just force a fail-over to the redundant Sup, and see if that at least helps while cisco sorts it out.

Steffen Lindemann Wed, 04/02/2008 - 01:43
User Badges:

Hi,


Did you ever get a solution on your TAC?

I see the same issue.

6509, sup720, IOS 12.2(33)SHX and no fancy features enabled.

The CPU load increase in steps of approx. 15% every 3 week. And it is the ios-base that takes the CPU.



sbader48220 Wed, 04/02/2008 - 06:01
User Badges:

The only solution I ever received to this problem was to run the non-modular IOS. Ever since switching to the non-modular IOS, I have not had any problems at all.

JEFFREY SESSLER Wed, 04/02/2008 - 08:13
User Badges:

Yes, Cisco did fix the problem for us. There was a memory leak in UDP that would slowly consume all available memory. As memory became scarce, the CPU was spending more and more time on freeing memory.


Cisco supplied us with a patched IOS, but the fix was then rolled in to 12.2(33)SXH1. With the patch and now SHX1 installed, we're no longer seeing the issue.


If you've got dual SUPs, and you can't upgrade IOS right away, just force fail-over between the two sups every couple of week to keep the problem at bay.


Jeff

Steffen Lindemann Wed, 04/02/2008 - 12:31
User Badges:

Okay, the status after a day working on it (and a possible way of replicate the issue).


The hops in CPU load seems to be originate from a open source network tool, netdisco (search google or sourceforce for more info on this very nice tool).

It is based on a Net-SNMP packet, which do all the SNMP i/o.

I was running the lates CVS version.


I dont have a spare 6509 to replicate the issue, but as a part of the normal netdisco operation, there was a cron job which made a topological discovery. The command is ./netdisco -r CORESWITCH


I have not time to dig into why a -r would be different for other operations of netdisco, for instance mac-suck or arp-suck. Only the -r seems to cause the CPU additional load.


I dont have a backup sup, the money was spend on a 2nd 6509 (to get the number of port needed) so I really need a software that is stable.


I have involved my vendor and if anybody is interested I will report back here regard the outcome?

Joseph W. Doherty Wed, 04/02/2008 - 05:13
User Badges:
  • Super Bronze, 10000 points or more

I haven't check recently, but recall none of the modular versions had passed safe harbor testing. If true, unless the modular feature is really, really important to you, you might consider moving back to a non-modular image.

jschweng Wed, 07/30/2008 - 13:34
User Badges:

We recently upgraded 6 of our 6506 switches in our Development environment. On 2 of our switches, we are seeing a steady CPU utilization increase. "Show proc cpu detail sort" shows it is due to ios-base. That really doesn't tell us much. Our logs on the 2 switches show some SNMP authentication error messages but not much else. The other switches don't have this SNMP error message. Any ideas? We are running 12.2(33)SXH2a. I have a Cisco ticket too.

Steffen Lindemann Wed, 07/30/2008 - 23:08
User Badges:

I am 90% sure that my situation was like below, I have however not been able to test it afterward.


The problem came after I installed a network tool which did snmp read via snmpwalk.


I have not read through the RFC's but there is a older snmpwalk and a newer bulkget for retrieving multiple snmp values.

On my 6509's and 3750's there is large routing tables and sometimes the boxes was slow to complete a snmpwalk of the whole OID tree.


What happen was that the job got stuck and was not ended before a new job was initiated. This happen once a week!


My solution was to tweak the tool to do bulkget and I have not seen the problem since.


I still believe that the IOS have a snmp bug, because it look like DoS if the box can not complete a snmpwalk of the full tree within one week.


So you might check if you have the same situation, a tool doing snmpwalk (mine was the open source tool netdisco).

Actions

This Discussion