Cisco 7204VXR with NPE-G1 high CPU and output drops

Unanswered Question
Apr 22nd, 2010

Hello community,

I Have a Cisco 7204VXR router with NPE-G1 that  started acting weird for 24 hours. This router is dual attached to two  6k5 routers using multimode FO and SX GBICs.

The router is used  as an LNS termination point for PPPoVPDN sessions, we have a bunch of  them.

Here is a show ver output :

LNS1.IX1#sh version
Cisco IOS  Software, 7200 Software (C7200-ADVENTERPRISEK9-M), Version 12.2(33)SRD,  RELEASE SOFTWARE (fc2)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2008 by Cisco Systems, Inc.
Compiled Thu 23-Oct-08  12:58 by prod_rel_team

ROM: System Bootstrap, Version  12.3(4r)T1, RELEASE SOFTWARE (fc1)
BOOTLDR: 7200 Software  (C7200-KBOOT-M), Version 12.3(5a), RELEASE SOFTWARE (fc1)

LNS1.IX1 uptime is 21 weeks, 23 hours, 22 minutes
System  returned to ROM by reload at 13:36:53 UTC Wed Nov 25 2009
System  restarted at 13:39:45 UTC Wed Nov 25 2009
System image file is  "disk0:c7200-adventerprisek9-

mz.122-33.SRD.bin"
Last reload type: Normal Reload
Last reload reason: Reload command

This  product contains cryptographic features and is subject to United
States  and local country laws governing import, export, transfer and
use. Delivery of Cisco cryptographic products does not imply
third-party  authority to import, export, distribute or use encryption.
Importers,  exporters, distributors and users are responsible for
compliance  with U.S. and local country laws. By using this product you
agree to comply with applicable laws and regulations. If you are unable
to  comply with U.S. and local laws, return this product immediately.

A  summary of U.S. laws governing Cisco cryptographic products may be  found at:
http://www.cisco.com/wwl/export/crypto/tool/stqrg.html

If  you require further assistance please contact us by sending email to
export@cisco.com.

Cisco 7204VXR (NPE-G1) processor (revision B) with 983040K/65536K  bytes of memory.
Processor board ID 29498611
SB-1 CPU at 700Mhz,  Implementation 0x401, Rev 0.2, 512KB L2 Cache
4 slot VXR midplane,  Version 2.7

Last reset from power-on

PCI bus mb1 (Slots 1, 3 and 5) has a  capacity of 600 bandwidth points.
Current configuration on bus mb1  has a total of 0 bandwidth points.
This configuration is within the  PCI bus capacity and is supported.

PCI bus mb2 (Slots 2, 4 and 6) has a capacity of 600 bandwidth  points.
Current configuration on bus mb2 has a total of 0 bandwidth  points.
This configuration is within the PCI bus capacity and is  supported.

Please refer to the following document "Cisco 7200 Series Port  Adaptor
Hardware Configuration Guidelines" on Cisco.com <http://www.cisco.com>
for  c7200 bandwidth points oversubscription and usage guidelines.

1 FastEthernet interface
3 Gigabit Ethernet interfaces
509K  bytes of NVRAM.

1000944K bytes of ATA PCMCIA card at slot 0  (Sector size 512 bytes).
1000944K bytes of ATA PCMCIA card at slot 1  (Sector size 512 bytes).
62592K bytes of ATA PCMCIA card at slot 2 (Sector size 512 bytes).
16384K  bytes of Flash internal SIMM (Sector size 256K).
Configuration  register is 0x2102



Starting yesterday afternoon, I saw appear  high CPU usage and numerous output drops. My first instinct was that  GBICs started dying so I replaced them on both siedes, no change.

Then, I thought we were victim of a DDoS but my graphs show no  increase of number of packets or things like that.

I have been  debugging this and found out this :

LNS1.IX1#sh interfaces gi0/2  controller
GigabitEthernet0/2 is up, line protocol is up
  Hardware is BCM1250  Internal MAC, address is 000b.fcdd.c41a (bia 000b.fcdd.c41a)
   Description: F=B, E=BB2.IX1, P=Gi9/1
  Internet address is 77.246.80.101/31
   MTU 9216 bytes, BW 1000000 Kbit, DLY 10 usec,
     reliability  255/255, txload 29/255, rxload 34/255
  Encapsulation 802.1Q Virtual  LAN, Vlan ID  1., loopback not set
  Keepalive set (10 sec)
  Full  Duplex, 1000Mbps, link type is auto, media type is SX
   output flow-control is XON, input flow-control is XON
  ARP type:  ARPA, ARP Timeout 04:00:00
  Last input 00:00:00, output 00:00:00,  output hang never
  Last clearing of "show interface" counters  00:22:35
   Input queue: 1/150/0/0 (size/max/drops/flushes); Total output drops:  2712
  Queueing strategy: Class-based queueing
  Output  queue: 311/1000/0 (size/max total/drops)
  5 minute input rate  133959000 bits/sec, 28885 packets/sec
   5 minute output rate 115013000 bits/sec, 20137 packets/sec
      39283651 packets input, 1536817418 bytes, 0 no buffer
     Received  535 broadcasts (0 IP multicasts)
     0 runts, 0 giants, 0 throttles
      164 input errors, 0 CRC, 0 frame, 164 overrun, 0 ignored
      0 watchdog, 539 multicast, 0 pause input
     27456129 packets  output, 2625296538 bytes, 0 underruns
     0 output errors, 0  collisions, 0 interface resets
     0 babbles, 0 late collision, 0  deferred
     0 lost carrier, 0 no carrier, 0 pause output
      0 output buffer failures, 0 output buffers swapped out
Interface  GigabitEthernet0/2 (idb 0x50098218)
Hardware is BCM1250 Internal MAC  (Revision B2/B3)
  Network connection mode is AUTO
  network link  is up 
   Config is 1000Mbps, Full Duplex
  Selected media-type is GBIC
   GBIC type is 1000BaseSX
MAC Registers:
  mac_cfg =           0x000000C8000A0176, mac_thrsh_cfg = 0x0000080400084004
  mac_vlantag  =      0x0000000000000000, mac_frame_cfg = 0x241C400000280200
   mac_adfilter_cfg = 0x0000000000000E28, mac_enable    =  0x0000000000000C11
  mac_status =       0x0000000000040004,  mac_int_mask  = 0x00004F0000C300C3
  mac_txd_ctl =       0x000000000000000F, mac_eth_addr  = 0x00001AC4DDFC0B00
   mac_fifo_ptrs =    0x241C400000280200, mac_eopcnt    =  0x000044001B1B1B1B
  MAC RX is enabled  RX DMA - channel 0 is  enabled, channel 1 is disabled
  MAC TX is enabled  TX DMA - channel 0  is enabled, channel 1 is disabled
   Device status = 1000 Mbps, Full-Duplex
PHY Registers:
  PHY is  Marvell 88E1011S (Rev 1.3)
  Control                =  0x1000           Status                 = 0x796D
  PHY ID  1               = 0x0141           PHY ID 2               = 0x0C62
   Auto Neg Advertisement = 0x01A0           Link Partner Ability   =  0x4120
  Auto Neg Expansion     = 0x0000           Next Page  Tx           = 0x2001
  Link Partner Next Page = 0x0000            1000BaseT Control      = 0x0000
   1000BaseT Status       = 0x0000           Extended Status        =  0xC000
  PHY Specific Control   = 0x0008           PHY Specific  Status    = 0xAD04
  Interrupt Enable       = 0x6C00            Interrupt Status       = 0x0000
   Ext PHY Spec Control   = 0x0C64           Receive Error Counter  =  0x0000
  LED Control            = 0x4100
  Ext PHY Spec Control 2 =  0x006A           Ext PHY Spec Status    = 0xA017
  PHY says Link is  UP, Speed 1000Mbps, Full-Duplex [AUTONEG Done]
   Physical Interface - GBIC
  AUTONEG - Our ability is     1000M/FD  Pause Capable (Asymmetric)
  AUTONEG - Partner ability is 1000M/FD
GBIC  registers:
  Register 0x00:   01  04  01  00  00  00  01  20
   Register 0x08:   40  0C  01  01  0D  00  00  00
   Register 0x10:   37  1E  00  00  4F  45  4D  20
  Register 0x18:    20  20  20  20  20  20  20  20
  Register 0x20:   20  20  20  20  00   00  00  00
  Register 0x28:   47  42  49  43  2D  53  58  20
   Register 0x30:   20  20  20  20  20  20  20  20
   Register 0x38:   00  00  00  00  03  52  00  BA
  Register 0x40:    00  1A  00  00  42  31  30  34
  Register 0x48:   38  31  39  31  20   20  20  20
  Register 0x50:   20  20  20  20  30  39  30  39
   Register 0x58:   32  39  20  20  68  B0  01  5A
   Register 0x60:   20  20  20  20  20  20  20  20
  Register 0x68:    20  20  20  20  20  20  20  20
  Register 0x70:   20  20  20  20  20   20  20  20
  Register 0x78:   20  20  20  20  20  20  20  20
   PartNumber: GBIC-SX        
   PartRev: B
  SerialNo: B1048191       
  Options:  0
   Length(9um/50um/62.5um): 000/550/300
  Date Code: 090929 
   Gigabit Ethernet Codes:  1
Internal Driver Information:
   lc_ip_turbo_fs = 0x6236BE18, ip_routecache = 0x11 (dfs = 0/mdfs = 0)
   rx cache size = 1000, rx cache end = 15
  max_mtu = 9244
Software  MAC address filter(hash:length/addr/mask/hits):
need_af_check =  0
  0x00:  0  ffff.ffff.ffff  0000.0000.0000         0
  0x2E:   0  0900.2b00.0005  0000.0000.0000         0
   0x2F:  0  0900.2b00.0004  0000.0000.0000         0
  0x5C:  0   0100.5e00.0002  0000.0000.0000         0
  0xC0:  0  0100.0ccc.cccc   0000.0000.0000         0
  0xD6:  0  0180.c200.0014   0000.0000.0000         0
   0xD7:  0  0180.c200.0015  0000.0000.0000         0
  0xE6:  0   000b.fcdd.c41a  0000.0000.0000         0
  ring sizes: RX = 128, TX =  256
  rx_particle_size: 512
Rx Channel 0:
  dma_config0 =      0x0010002000800888, dma_config1 =    0x002D000000600029
   dma_dscr_base =   0x000000000C218A40, dma_dscr_cnt =    0x0000000000000080
  dma_cur_dscr_a =  0x000010000C29FC82,  dma_cur_dscr_b = 0x02D4000000000001
  dma_cur_daddr  =   0x000080000C2190E0
  rxring = 0x0C218A40, shadow = 0x50098FA4, head =  26 (0x0C218BE0)
   rx_overrun=78512, rx_nobuffer=0, rx_discard=0
  Error Interrupts:  rx_int_dscr = 0, rx_int_derr = 0, rx_int_drop = 53
Tx Channel 0:
   dma_config0 =     0x0000000001001088, dma_config1 =     0x00B6000000000010
   dma_dscr_base =   0x000000000C219280, dma_dscr_cnt =    0x0000000000000000
  dma_cur_dscr_a =  0x00000F000C27F980,  dma_cur_dscr_b = 0x0000000000000000
  dma_cur_daddr  =   0x000000000C219430
  txring = 0x0C219280, shadow = 0x657D0A14, head =  164, tail = 165, tx_count = 1
   Error Interrupts: tx_int_dscr = 0, tx_int_derr = 0, tx_int_dzero = 0
   chip_state = 2, ds->tx_limited = 0
  throttled = 0, enabled = 0,  disabled = 0
  reset=6(init=1, restart=5), auto_restart=1
   tx_underflow = 0, tx_overflow = 0
   rx_underflow = 0, rx_overflow = 0, filtered_pak=0
  descriptor  mismatch = 0, fixed alignment = 52530
  bad length = 0 dropped, 0  corrected
  unexpected sop = 0
Address Filter:
  Promiscuous  mode OFF
  Exact match table (for unicast, maximum 8 entries):
     Entry 0 MAC Addr = 000b.fcdd.c41a
    (All other entries are  empty)
  Hash match table (for multicast, maximum 8 entries):
     Entry 0 MAC Addr = 0100.0ccc.cccc
    Entry 1 MAC Addr =  0900.2b00.0004
    Entry 2 MAC Addr = 0900.2b00.0005
     Entry 3 MAC Addr = 0180.c200.0014
    Entry 4 MAC Addr =  0180.c200.0015
    Entry 5 MAC Addr = 0100.5e00.0002
    (All  other entries are empty)
Statistics:
  Rx Bytes                  19767582054   Tx Bytes                 19944011802
   Rx Good Packets             27498925   Tx Good Packets              27480126
  Rx Multicast                     544
  Rx  Broadcast                       0

  Rx Bad Pkt  Errors                  0   Tx Bad Pkt Errors                  0
   Rx FCS Errors                      0   Tx FCS  Errors                      0
  Rx Runt Errors                      0   Tx Runt Errors                     0
  Rx Oversize  Errors                 0   Tx Oversize Errors                 0
   Rx Length Errors                   0   Tx  Collisions                      0
  Rx Code  Errors                     0   Tx Late Collisions                 0
   Rx Dribble Errors                  0   Tx Excessive  Collisions            0
                                          Tx Abort  Errors                    0


My queuing strategy  went from FIFO to Class-based queuing !?! How is that possible ?


Some might this could be related to CPU but it's not :

LNS1.IX1#sh processes cpu history
                             
                                
    2222222223333322222444444444455555333332222222222222222222
     6666666666666699999444449999911111666666666688888888888888
100                                                           
90                                                           
80                                                           
  70                                                          
60                                                           
50                         **********                       
40           *****     ********************                  
  30 **********************************************************
20  **********************************************************
10  **********************************************************
    0....5....1....1....2....2....3....3....4....4....5....5....
              0    5    0    5    0    5    0    5    0    5   
                CPU% per second (last 60 seconds)
                                                              
    4434344555655555455455555454555555555555555555555555565555
     8253536886175400933900221909246456366223331226663324503316
100                                                           
90                                                           
80                                                           
  70                                                          
60         ******                 * ** **        ***    **   *
50 *      **###########*####################################*#
40  *******#################################################*#
  30 ##########################################################
20  ##########################################################
10  ##########################################################
    0....5....1....1....2....2....3....3....4....4....5....5....
              0    5    0    5    0    5    0    5    0    5   
                CPU% per minute (last 60 minutes)
              * = maximum CPU%   #  = average CPU%
                                                                          
     6567544433244568899987786666543422242347557976575555443422242335455655
     1917692522895458476603000342305575449131462755826358236305308599538984
100                   ***                       *                         
  90                * ***                        *                         
80    *           ******   *                   **                        
70    *           ***###****               *  **** *                   * 
  60 *****         *#######******           * ******** **           *   ***
50 ####** *   ***##############*  *        *************           *******
40 ######**    **################***   *  *######***********   * ***####**
  30 ########****#################***** ***##############**** *  ***########
20 #################################*######################***###########
10  ######################################################################
    0....5....1....1....2....2....3....3....4....4....5....5....6....6....7.
              0    5    0    5    0    5    0    5    0    5    0    5    0
                    CPU% per hour (last 72 hours)
                  * = maximum CPU%   #  = average CPU%

LNS1.IX1#sh processes cpu sorted | e 0.00
CPU utilization for  five seconds: 32%/28%; one minute: 37%; five minutes: 36%
PID  Runtime(ms)   Invoked      uSecs   5Sec   1Min   5Min TTY Process
167    148648548 669722691        221  1.11%  0.94%  0.71%   0 LFDp Input  Proc 
   92    84501540 658096034        128  0.55%  0.48%  0.38%   0 IP  Input        
158    29882772  19491105       1533  0.47%  0.50%   0.50%   0 CEF: IPv4 proces
  44    21898608   2506816       8735   0.31%  0.16%  0.17%   0 Net Background  
  163      1811203169702554          0  0.23%  0.15%  0.16%   0 HQF Input  Shaper
230    13727516   2560015       5362  0.23%  0.19%  0.18%    0 Compute load avg
238     4050992  17914481        226  0.15%   0.04%  0.05%   0 L2TP mgmt daemon
  237     3963624 303695077         13  0.15%  0.14%  0.15%   0 L2X Data  Daemon 
260     1429340 404591201          3  0.15%  0.08%  0.08%    0 PPP Events      
162      2102403169697698          0  0.15%   0.28%  0.27%   0 HQF Shaper Backg
  234       10364      4211       2461  0.07%  0.08%  0.03%   2 Virtual  Exec    
  84       61120 398983803          0  0.07%  0.03%   0.02%   0 ACCT Periodic Pr
254      126412 399597741          0   0.07%  0.06%  0.07%   0 PPP manager


I have been digging around in the archives, looks like I'm victim of  microburst trafic :

http://puck.nether.net/pipermail/cisco-nsp/2009-April/060158.html


You'll find attached some screenshots from my Cacti Box. FYI, the GAP at around 9:30P.M corresponds to when I rerouted the trafic to the other interface and changed the GBICs.


Any  ideas on what might be causing this ?

Thanks.

Regards.

Y.
  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 0 (0 ratings)
Loading.
Giuseppe Larosa Thu, 04/22/2010 - 12:06

Hello Youssef,

>> CPU utilization for  five seconds: 32%/28%; one minute: 37%; five  minutes: 36%

most of cpu is used by SW interrupts this means that most of traffic is process switched, this could be caused by the specific role of the device or it can be a problem

use the following link as a reference for troubleshooting this:

http://www.cisco.com/en/US/partner/products/hw/routers/ps359/products_tech_note09186a00801c2af0.shtml

About output drops:

they are few in comparison with total output packets

because you have CBWFQ you can use sh policy-map interface gi0/2 to see what traffic class suffers drops.

>> Output  queue: 311/1000/0 (size/max total/drops)

this is interesting there are too many packets in queue for the traffic offered but packets are not dropped here you see drops=0 they may be dropped by CBWFQ so again the sh policy-map int gi0/2 can be helpful

Hope to help

Giuseppe

user@720dg Fri, 04/23/2010 - 02:10

Hello,

As I have said before, the equipment in question is used as an LNS in order to terminate DSL links (PPPoVPDN).

The interesting thing is that the problem stopped on this equipment and  moved to another identical one.

As for your diagnostic, I have come to the same conclusion. I think one of my clients' link is sending huge amounts of malformed packets to the router, probably zombie PCs infected with a virus or something like that.

It's not a matter of DATA volume because the gigabit link is far from being overloaded. It's not a matter of number of packets because my graphs show smooth curves. It's more a problem of crappy packets.

I have to try to pin point the source link that is causing the packets packets to be process switched instead of being fast switched.

Is there any magical command to protect my CPU from this ? Or something that could help me pin point the source of the packets ? Some said netflow...

Another thing is that my interface changed her queueuing strategy from FIFO to CBFWQ by herself !?! Any magical command to come-back to FIFO ? I think no fair-queue under the interface makes the trick.

Thanks for the feedback.

Y.

Giuseppe Larosa Fri, 04/23/2010 - 02:46

Hello Youssef,

>> Another thing is that my interface changed her queueuing strategy from  FIFO to CBFWQ by herself !?

there is no service-policy out configured on gi0/2? it would be strange

I agree with your analysis malformed packets may be the cause of the problem.

If supported on your devices you could think of using control plane policing to rate-limit IP packets with options in header to protect main cpu

Hope to help

Giuseppe

user@720dg Fri, 04/23/2010 - 02:55

Hello again,

No, no service policies configured or applied on my interfaces !?! Weird huh !!!

As for CoPP, I will have to do some research about it... This needs to be studied before being implemented...

Thanks.

Y.

Actions

This Discussion