Cisco Support Community
cancel
Showing results for 
Search instead for 
Did you mean: 

How to monitor / troubleshoot ISR-G2 crypto engine under load

 

 

Systems under test

 

This document covers ISR G2 Cisco 3945 and Cisco 3945e routers with on-board crypto engine.

Both devices are connected back to back via GigabitEthernet0/1.

Packet injector Agilent is connected via GigabitEthernet0/2 on both routers

 

The version that has been tested in this lab is 15.1(4)M5.

 

Traffic pattern that has been sent to the router is bursty [ busts of 1000 packets sent at 1gig/second].

This flow will go through features such as Complex [Parent/Child] QoS - GRE encapsulation - Encryption / Decryption.

 

Crypto engine command output may differ from a 1900 or 2900, still the same logic has to be applied on other platforms.

 

1. 3945 - what to check

 

  • show proc cpu

 

 

bsns-3945-2#sh proc cpu | e 0.00%

3945#sh proc cpu | e 0.00%

CPU utilization for five seconds: 64%/63%; one minute: 62%; five minutes: 72%

PID Runtime(ms)     Invoked      uSecs   5Sec   1Min   5Min TTY Process

  16     1384592      494466       2800  0.31%  0.06%  0.02%   0 Environmental mo

  69     4263712      528295       8070  0.07%  0.09%  0.07%   0 Per-Second Jobs 

  87      624592     1977539        315  0.15%  0.13%  0.11%   0 Netclock Backgro

105      327624    60431937          5  0.07%  0.09%  0.09%   0 SEC BATCH       

109      155144    62150729          2  0.07%  0.11%  0.13%   0 Ethernet Msec Ti

209      359008   120354323          2  0.15%  0.13%  0.14%   0 HQF Shaper Backg

291      127524    24465492          5  0.07%  0.05%  0.06%   0 MFIB_mrib_write 

295        2200         810       2716  0.15%  0.19%  0.08% 644 Virtual Exec    

326      326172    11161929         29  0.07%  0.04%  0.06%   0 PIM Process 

 

Here the CPU reports an average usage of 64% with 63% under interrupt [ aka packets routed by CEF ].

 

We will see later the router is in fact oversubscribed.

 

Key learning: CPU usage is measured as an average calculed on 5 seconds.

This is something to always remember since the real-time CPU utilization might reach 100% during few msecs and this will not be seen in any show commands.

 

Because of this average report, you should never take the CPU usage 'as it is', U will have to understand the traffic pattern in order to  get the right picture.

 

  • show platform interrupt

 

 

Per design, Interrupts [ aka packet forwarding under CEF]  have the highest priority from a scheduling standpoint.

 

Under load, there is a safeguard mechanism that gives to other IOS processes few msec of CPU time. This is what Interrupt throttling is used for.

 

3945-2#sh platform interrupt  | section Interrupt Throttling

Interrupt Throttling:

  Throttle Count    =    31065   Timer Count            =    31060

  Netint(usec)      =    20000   Netint Mask(usec)      =     1000

  Real Netint(usec) =     4000   Real Netint Mask(usec) =      200

  Active            =        0   Configured             =        1

  Longest IRQ(usec) =     8856

When 4000 usec [ 4 msec] of continuous traffic under interrupt has been reached, Net I/O interrupts will be throttled during 200 usec, giving some CPU time to processes.

 

If we take few consecutive outputs, we can see the throttle increasing

 

 

3945-2#terminal exec prompt timestamp

3945-2#show plat interrupt | i Time | Throttle Count

Time source is hardware calendar, 08:04:14.637 UTC Mon Oct 15 2012

  Throttle Count    =    37313   Timer Count            =    37308

bsns-3945-2#show plat interrupt | i Time | Throttle Count

Time source is hardware calendar, 08:04:17.513 UTC Mon Oct 15 2012

  Throttle Count    =    37359   Timer Count            =    37354

When we throttle [ aka ignore ] an interrupt, packets are queued on the interface and most likely dropped as overruns [ aka not polled quick enough from the interface]

 

  • show platform error

 

bsns-3945-2#show plat error

Load for five secs: 64%/63%; one minute: 64%; five minutes: 64%

Time source is hardware calendar, 08:11:10.301 UTC Mon Oct 15 2012

 

Interface GigabitEthernet0/1:

----------------------------

Input Errors            : 17529

Input Overrun           : 17529

 

Interface GigabitEthernet0/2:

----------------------------

Input Errors            : 489481

Input Overrun           : 489481



 

Indeed,we can see traffic getting dropped on Gigabit Ethernet 0/1 due to the burstiness of the flow. We can crosscheck by taking the interface stats

 

 

3945-2#sh int gig0/1

Load for five secs: 64%/63%; one minute: 60%; five minutes: 60%

Time source is hardware calendar, 08:14:53.261 UTC Mon Oct 15 2012

 

GigabitEthernet0/1 is up, line protocol is up

  Hardware is PQ3_TSEC, address is 0007.7dfb.6e01 (bia 0007.7dfb.6e01)

  Description: Connected to bsns-2960-1

  Internet address is 172.146.1.2/30

  MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec,

     reliability 169/255, txload 4/255, rxload 1/255

  Encapsulation ARPA, loopback not set

  Keepalive set (10 sec)

  Full Duplex, 1Gbps, media type is RJ45

  output flow-control is unsupported, input flow-control is unsupported

  ARP type: ARPA, ARP Timeout 04:00:00

  Last input 00:00:00, output 00:00:00, output hang never

  Last clearing of "show interface" counters 00:03:45

  Input queue: 0/75/54785/0 (size/max/drops/flushes); Total output drops: 0

  Queueing strategy: Class-based queueing

  Output queue: 0/1000/0 (size/max total/drops)

  30 second input rate 2612000 bits/sec, 2177 packets/sec

  30 second output rate 17442000 bits/sec, 14535 packets/sec

     463075 packets input, 69454186 bytes, 0 no buffer

     Received 32 broadcasts (0 IP multicasts)

     0 runts, 0 giants, 392 throttles

     1082575 input errors, 0 CRC, 0 frame, 1082575 overrun, 0 ignored

     0 watchdog, 12 multicast, 0 pause input

     2407704 packets output, 361152652 bytes, 0 underruns

     0 output errors, 0 collisions, 0 interface resets

     8 unknown protocol drops

     0 babbles, 0 late collision, 0 deferred

     0 lost carrier, 0 no carrier, 0 pause output

     0 output buffer failures, 0 output buffers swapped out

 

We can see the overruns increasing and that we are throttling this interface. [ We see similar outputs on gig0/2].

  • Crypto engine statistics

3945-2#sh crypto engine accelerator statistic | i Error

Errors:

Pad Error                    = 0

Data Error                   = 0

Packet Error                 = 0

Null IP Error                = 0

Hardware Error               = 0

Link Down Error              = 0

Queue Full Error             = 0

API Request Error            = 0

Invalid Flow Error           = 0

QOS Queue Full Error         = 0

Packet too Big Error         = 0

Too Many Particles Error     = 0

Input Queue Full Error       = 0

Output Queue Full Error      = 0

Pre-batch Queue Full Error   = 0

Post-batch Queue Full Error  = 0

 

In this test scenario, crypto engine is still in healthy state since no drops are reported. Depending of the traffic pattern, it still possible to fill up the queue [ eg a burst of packets to decrypt]

 

3945-2#sh crypto engine accelerator ring pool | i used

The no entries currently being used   = 996

 

On 3945, the number of entries used on the crypto engine encryption/decryption ring is 1024 [ queue lenght] - 996 [ remaining free queue items] =28

 

2. What to check on the 3945e

 

On  an 3945e, the output from the show crypto engine accelator statistics gives a slightly different output

 

 

  • show proc cpu

 

 

3945e-1#sh proc cpu | e 0.00%

CPU utilization for five seconds: 19%/17%; one minute: 18%; five minutes: 32%

PID Runtime(ms)     Invoked      uSecs   5Sec   1Min   5Min TTY Process

  16     1392544      500615       2781  0.63%  0.12%  0.06%   0 Environmental mo

  69     4264372      534582       7977  0.07%  0.10%  0.08%   0 Per-Second Jobs 

  87      626064     2002190        312  0.23%  0.19%  0.18%   0 Netclock Backgro

109      156864    62926488          2  0.07%  0.07%  0.07%   0 Ethernet Msec Ti

209      360772   121869657          2  0.07%  0.08%  0.09%   0 HQF Shaper Backg

291      127736    24772141          5  0.07%  0.03%  0.04%   0 MFIB_mrib_write 

326      327060    11302145         28  0.07%  0.04%  0.05%   0 PIM Process     

 

CPU usage looks way better. Still, since the traffic is bursty we need to check if there is any interrupt throttling

 

 

  • show platform interrupt

 

bsns-3945e-1#sh platform interrupt  | section Throttl                         

Interrupt Throttling Statistics:

Throttle Count   = 00002186   Timer Count = 00002186

Few interrupt have been throttled but it never reached a drop level [ packets were simply queued and then processed]

There is no show platform error on 3945e platform. Still we can gather the info from the traditionnal show interface command

 

  • show interface

 

3945e-1#sh interfaces  | i Gigabit|over

GigabitEthernet0/0 is up, line protocol is up

     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored

GigabitEthernet0/1 is up, line protocol is up

     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored

GigabitEthernet0/2 is up, line protocol is up

     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored

Show interface does not reveal any issues.

 

 

    • Crypto engine statistics

 

945e-1#sh crypto engine accelerator statistic  | section ipsec

 

        Received pkts(ipsec)          : 930037

        Reinjected pkts(ipsec)        : 930036

        Dropped pkts(ipsec)           : 0

        Drop over capacity(ipsec)     : 0

        Drop no particle(ipsec)       : 0

        Drop no session(ipsec)        : 0

        Errors(ipsec)                 : 0

        Pending pkts(ipsec)           : 0

 

On the 3945e, no issues are seen. On this platform, the encryption/decryption queue most of the time empty.

 

 

3945e-1#sh crypto engine accelerator ring pool

 

 

Device:   Onboard VPN

Location: Onboard: 0

The Crypto Packet IPSEC Queue Information

==========================================

The Queuesize is                          :2048

The no of entries currently being used    : 0

The Read Index is                         :1809

The Write Index is                        :1809



Conclusion:

 

Even if on 5 seconds average, the CPU usage is not at 90+%, the nature of the traffic can create a port speed burst that lasts few msec.

 

During that time, a router may have issues polling each interface [ which leads to overruns].

 

Always size your router accordingly to the nature of the traffic or shape the flow on the next hop device on the LAN side [ in order to queue packets instead of dropping them as overruns on the ipsec gateway.

Comments

Great contribution Olivier, thanks!!

2597
Views
15
Helpful
1
Comments