Troubleshooting and understanding Task/Process failures, Session Recovery, and PAC/PSC migrations and SPC/SMC switchovers

Document

Jan 24, 2012 3:23 PM
Jan 24th, 2012

Overview.. 1

Basic hardware and software architecture. 1

Session Recovery. 8

Task Warnings and Alarms. 11

NON-critical Task Failures. 12

Migrations/Switchovers. 14

Critical Task Failures with or without migration. 16

No Migration and failed card fails to reboot successfully and remains offline. 16

No Migration and failed card reboots successfully. 19

Migration with successful boot of failed card. 20

Migration with failed boot of failed card. 21

Standby SMC reboot 22

SMC Switchover 23

Overview

This article explains the basics of PAC/PSC migrations and SPC/SMC switchovers, along with session/task failure recovery when hardware or software failures occur, how to recognize when such actions take place, and what actions can/should be taken as a result. It includes all the relevant CLI commands that can be run, what to look for in the output and how to interpret it, and what to report to Starent.

The STxx uses a redundant architecture, so that if the hardware and/or software fails for any reason, the system can continue to run with minimal impact. A failure could be software or hardware based, or even a combination, and the ultimate action to be taken as a result could be the issuance of an RMA for hardware or a new software build containing a fix for a software bug. The steps of detecting the failure, understanding the recovery action taken and what the impact was, and determining whether or not it needs further addressing are covered here.

Basic hardware and software architecture

The following is a very basic discussion of the system architecture, including card types, tasks,

The STxx chassis houses slots numbered 1 through 48, containing cards types as follows:

PAC/PSC: 1 – 7, 10 – 16

SPC/SMC: 8, 9

Linecards: 17 – 23, 26 – 39, 42 – 48 (two rows, with redundant cards located directly underneath active ones)

SPIO cards: 24, 25

RCC: 40, 41

See the System Overview manual for a much more in depth description of the chassis components, including diagrams, as this is not the focus of this article.

To see a list of all the cards in the system, run “show card table all”. Note the output below indicates an ST16 because the card types are PAC and SPC instead of PSC and SMC. Note the Attach column indicates which Linecard is “attached” to a given PAC card. Normally the linecard directly behind a PAC card (PAC card number + 16) will be attached by default, and that linecard’s redundant card will be numbered + 16.  Note all the PAC cards are active except for one that is standby. A standby card is available in case there is a failure of an active card, and it will be made the active card in case one of the active cards fail – this will be discussed further in this article.

******** show card table all *******

Slot         Card Type                         Oper State     SPOF  Attach

-----------  --------------------------------  -------------  ----  ------

1: PAC/TAC  None                              -              -       -  -

2: PAC/TAC  Packet Accelerator Card           Active         No     18 34

3: PAC/TAC  Packet Accelerator Card           Active         No      -  -

4: PAC/TAC  Packet Accelerator Card           Active         No      -  -

5: PAC/TAC  None                              -              -       -  -

6: PAC/TAC  Packet Accelerator Card           Active         No      -  -

7: PAC/TAC  None                              -              -       -  -

8: SPC      Switch Processor Card             Active         No     24 25

9: SPC      Switch Processor Card             Standby        -       -  -

10: PAC/TAC  None                              -              -       -  -

11: PAC/TAC  Packet Accelerator Card           Active         No      -  -

12: PAC/TAC  None                              -              -       -  -

13: PAC/TAC  Packet Accelerator Card           Active         No      -  -

14: PAC/TAC  None                              -              -       -  -

15: PAC/TAC  Packet Accelerator Card           Standby        -       -  -

16: PAC/TAC  None                              -              -       -  -

17: LC       None                              -              -       -  

18: LC       1000 Ethernet Line Card           Active         No      2  

19: LC       None                              -              -       -  

20: LC       None                              -              -       -  

21: LC       None                              -              -       -  

22: LC       None                              -              -       -  

23: LC       None                              -              -       -  

24: SPIO     Switch Processor I/O Card         Active         No      8  

25: SPIO     Switch Processor I/O Card         Standby        -       8  

26: LC       None                              -              -       -  

27: LC       None                              -              -       -  

28: LC       None                              -              -       -  

29: LC       None                              -              -       -  

30: LC       None                              -              -       -  

31: LC       None                              -              -       -  

32: LC       None                              -              -       -  

33: LC       None                              -              -       -  

34: LC       1000 Ethernet Line Card           Standby        -       2  

35: LC       None                              -              -       -  

36: LC       None                              -              -       -  

37: LC       None                              -              -       -  

38: LC       None                              -              -       -  

39: LC       None                              -              -       -  

40: RCC      Redundancy Crossbar Card          Standby        -          

41: RCC      Redundancy Crossbar Card          Standby        -          

42: LC       None                              -              -       -  

43: LC       None                              -              -       -  

44: LC       None                              -              -       -  

45: LC       None                              -              -       -  

46: LC       None                              -              -       -  

47: LC       None                              -              -       -  

48: LC       None                              -              -       -   

Other useful commands for troubleshooting and collecting information about cards are:

show card hardware

show card info

show card diag

The SMC (System Management Card, ST40) or SPC (System Processor Card, ST16) serve as the primary controller for the chassis and is responsible for initializing the entire system and loading the system configuration into other cards in the chassis. For redundancy, there are two of these cards in a system, one active and the other standby, and they reside in slots 8 and 9.

The PSC (Packet Services Card, ST40) or PAC (Packet Accelerator Card, ST16) is the guts of the system that does the subscriber packet processing and forwarding, implements the various protocols required to setup, teardown, and maintain calls, manages statistics, and forwards data traffic to the destination.

The ST16 contains 4 CPUs (control processors) while the ST40 contains 2 CPUs, and numbering starts at 0 (not 1). Both contain an NPU (network processor).

Tasks or Processes (interchangeable terms) run on PSC/PAC and SMC/SPC cards and can be viewed with the command “show task table”, while the amount of resources (CPU and memory) used by those tasks can be viewed with “show task resources”. An explanation of all the possible tasks that can be running, their function, how they are distributed across all cards, and all of their interdependencies is complex and beyond the scope of this document, but there is some basic information worth being aware of. Chapter 3, Software Architecture, in the System Overview manual, has an example high level software architecture block diagram.

Significant Tasks, some of which may be mentioned further are:

vpnmgr – Virtual Private Network – controls contexts

hatcpu – High Availability Task

sitmain – System Initialization Task

a11mgr – A11 protocol for call setup between PDSN and PCF

aaamgr – Radius protocol for authentication and accounting between PDSN and Radius servers

sessmgr – maintaining knowledge of subscriber sessions

acsmgr – processing for active charging service (merged into sessmgr starting in v8.1)

npumgr – processing of inbound and outbound packets and intra chassis packet routing

famgr – foreign agent service processing

hamgr – home agent service processing

The following only run on SPC/SMC:

csp – Card/Slot/Port – coordinates events occurring when cards inserted/locked/removed/shutdown/migrated

rct – Recovery Control Task – executes a defined recovery action for a failure

hatsystem – monitors all hatcpu tasks on PACs

rmmgr – Recource Manager – assigns resources to every task

cli – one started for each CLI session

bulkstat – processes bulkstat data

snmp – processes SNMP traps

orbs – communicates with EMS server

Each sessmgr task is responsible for a specific number of subscriber sessions, with a maximum number (6600 in example below) depending on the chassis type and licensing. Sessmgr tasks are spawned across all cards and sessions are load balanced across them. Here is an example from “show task resources” showing this on an ST16 (4 CPUs/card). Note the sessmgr processes on CPUs 1 - 3 are housing around 800 sessions each, while CPU 0 sessmgrs are housing slightly less. This same allocation pattern would be seen across all active PAC cards, as can be seen with the included output from CPU 4/1 (SR 7047)

******** show task resources *******

                   task   cputime        memory     files      sessions

cpu facility      inst used allc   used  alloc used allc  used  allc S status

----------------------- --------- ------------- --------- ------------- ------

3/0 sitmain         30 0.0%  15%   1.8M   6.0M   14 1000    --    -- -   good

3/0 sitparent       30 0.0%  20%   1.4M   4.0M   12  500    --    -- -   good

3/0 hatcpu          30 0.1%  10%   1.6M   5.0M   13  500    --    -- -   good

3/0 rmmgr           30 1.7%  10%   3.3M   7.0M   23  500    --    -- -   good

3/0 dhmgr           30 0.0%   5%   2.0M   6.0M   18 2000    --    -- -   good

3/0 dcardmgr         3 0.0%  60%  12.7M  20.0M   13  500    --    -- -   good

3/0 npumgr           3 7.5% 100% 117.5M 225.3M   35 1000    --    -- -   good

3/0 sft            300 6.9%  50%   2.8M  12.0M   43  500    --    -- -   good

3/0 sessmgr       5013 0.1%  50%  18.1M  85.0M   18  500    --    -- S   good

3/0 aaamgr        5014 0.0%  50%   6.5M  20.0M   16  500    --    -- -   good

3/0 sessmgr          4 5.1% 100%  81.4M 155.0M   28  500   505  6600 I   good

3/0 sessmgr         31 6.0% 100%  79.4M 155.0M   27  500   519  6600 I   good

3/0 sessmgr         51 6.2% 100%  80.0M 155.0M   29  500   537  6600 I   good

3/0 aaamgr          74 2.0%  94%  39.6M  77.0M   18  500    --    -- -   good

3/0 aaamgr          79 2.0%  94%  39.4M  77.0M   18  500    --    -- -   good

3/0 aaamgr          85 1.6%  94%  39.6M  77.0M   19  500    --    -- -   good

3/0 aaamgr          94 1.9%  94%  39.4M  77.0M   18  500    --    -- -   good

3/0 sitreap    3000509 0.0%   5%   0.1M   5.0M    5  100    --    -- -   good

3/0 msgd       3000507 0.9%   5%   0.8M   5.0M    6  100    --    -- -   good

3/0 nscontrol  3030563 0.0%   5%   1.0M   5.0M   10  500    --    -- -   good

3/1 sitmain         31 0.1%  15%   1.8M   6.0M   14 1000    --    -- -   good

3/1 sitparent       31 0.0%  20%   1.4M   4.0M   12  500    --    -- -   good

3/1 hatcpu          31 0.2%  10%   1.6M   5.0M   13  500    --    -- -   good

3/1 rmmgr           31 2.1%  10%   3.2M   7.0M   26  500    --    -- -   good

3/1 dhmgr           31 0.0%   5%   1.8M   6.0M   18 2000    --    -- -   good

3/1 sessmgr       5002 0.2%  50%  18.0M  85.0M   18  500    --    -- S   good

3/1 aaamgr        5003 0.0%  50%   6.4M  20.0M   16  500    --    -- -   good

3/1 aaamgr           1 1.7%  94%  38.8M  77.0M   19  500    --    -- -   good

3/1 aaamgr           2 1.3%  94%  35.9M  77.0M   19  500    --    -- -   good

3/1 aaamgr           3 1.8%  94%  39.3M  77.0M   18  500    --    -- -   good

3/1 sessmgr         18  10% 100%  93.2M 155.0M   29  500   799  6600 I   good

3/1 sessmgr         23 7.8% 100%  93.3M 155.0M   29  500   786  6600 I   good

3/1 aaamgr          52 1.4%  94%  36.0M  77.0M   19  500    --    -- -   good

3/1 sessmgr         56 7.6% 100%  91.0M 155.0M   28  500   787  6600 I   good

3/1 sessmgr         61 9.2% 100%  90.4M 155.0M   29  500   798  6600 I   good

3/1 sessmgr         81 8.9% 100%  91.1M 155.0M   28  500   802  6600 I   good

3/1 sessmgr         93 9.5% 100%  92.0M 155.0M   28  500   783  6600 I   good

3/1 aaamgr         102 2.1%  94%  38.7M  77.0M   19  500    --    -- -   good

3/1 sitreap    3100472 0.0%   5%   0.1M   5.0M    5  100    --    -- -   good

3/1 msgd       3100470 1.0%   5%   0.8M   5.0M    6  100    --    -- -   good

3/1 nscontrol  3130563 0.0%   5%   1.0M   5.0M   10  500    --    -- -   good

3/2 sitmain         32 0.0%  15%   1.8M   6.0M   14 1000    --    -- -   good

3/2 sitparent       32 0.0%  20%   1.4M   4.0M   12  500    --    -- -   good

3/2 hatcpu          32 0.2%  10%   1.6M   5.0M   13  500    --    -- -   good

3/2 rmmgr           32 2.2%  10%   3.4M   7.0M   25  500    --    -- -   good

3/2 dhmgr           32 0.0%   5%   1.8M   6.0M   18 2000    --    -- -   good

3/2 sessmgr       5021 0.2%  50%  17.9M  85.0M   18  500    --    -- S   good

3/2 aaamgr        5022 0.0%  50%   6.5M  20.0M   16  500    --    -- -   good

3/2 aaamgr           5 1.5%  94%  36.1M  77.0M   18  500    --    -- -   good

3/2 sessmgr          7  10% 100%  93.1M 155.0M   29  500   813  6600 I   good

3/2 aaamgr          25 1.6%  94%  38.5M  77.0M   19  500    --    -- -   good

3/2 sessmgr         35  10% 100%  95.2M 155.0M   26  500   810  6600 I   good

3/2 sessmgr         41  10% 100%  95.5M 155.0M   28  500   803  6600 I   good

3/2 aaamgr          55 1.5%  94%  36.2M  77.0M   19  500    --    -- -   good

3/2 aaamgr          62 1.8%  94%  39.4M  77.0M   20  500    --    -- -   good

3/2 sessmgr         66 8.3% 100%  93.1M 155.0M   28  500   804  6600 I   good

3/2 aaamgr          84 2.0%  94%  38.6M  77.0M   19  500    --    -- -   good

3/2 sessmgr         86 7.5% 100%  93.3M 155.0M   28  500   804  6600 I   good

3/2 sessmgr         98 8.6% 100%  94.1M 155.0M   27  500   784  6600 I   good

3/2 sitreap    3200472 0.0%   5%   0.1M   5.0M    5  100    --    -- -   good

3/2 msgd       3200470 1.0%   5%   0.8M   5.0M    6  100    --    -- -   good

3/2 nscontrol  3230563 0.0%   5%   1.0M   5.0M   10  500    --    -- -   good

3/3 sitmain         33 0.1%  15%   1.8M   6.0M   14 1000    --    -- -   good

3/3 sitparent       33 0.0%  20%   1.3M   4.0M   12  500    --    -- -   good

3/3 hatcpu          33 0.2%  10%   1.6M   5.0M   13  500    --    -- -   good

3/3 rmmgr           33 2.1%  10%   3.4M   7.0M   25  500    --    -- -   good

3/3 dhmgr           33 0.0%   5%   2.0M   6.0M   18 2000    --    -- -   good

3/3 sessmgr       5026 0.2%  50%  18.1M  85.0M   18  500    --    -- S   good

3/3 aaamgr        5027 0.0%  50%   6.5M  20.0M   16  500    --    -- -   good

3/3 aaamgr          11 1.8%  94%  39.6M  77.0M   19  500    --    -- -   good

3/3 sessmgr         13 9.4% 100%  92.3M 155.0M   28  500   802  6600 I   good

3/3 aaamgr          21 1.3%  94%  35.6M  77.0M   18  500    --    -- -   good

3/3 aaamgr          34 2.0%  94%  39.5M  77.0M   19  500    --    -- -   good

3/3 aaamgr          38 2.2%  94%  38.6M  77.0M   19  500    --    -- -   good

3/3 sessmgr         40  10% 100%  91.3M 155.0M   29  500   802  6600 I   good

3/3 sessmgr         46  10% 100%  91.7M 155.0M   30  500   814  6600 I   good

3/3 aaamgr          70 2.0%  94%  39.3M  77.0M   19  500    --    -- -   good

3/3 sessmgr         71 8.3% 100%  90.9M 155.0M   29  500   808  6600 I   good

3/3 sessmgr         80 7.4% 100%  91.5M 155.0M   29  500   793  6600 I   good

3/3 sessmgr         90 8.6% 100%  91.3M 155.0M   28  500   801  6600 I   good

3/3 sitreap    3300472 0.0%   5%   0.1M   5.0M    5  100    --    -- -   good

3/3 msgd       3300470 1.0%   5%   0.8M   5.0M    6  100    --    -- -   good

3/3 nscontrol  3330557 0.0%   5%   1.0M   5.0M   10  500    --    -- -   good

......

4/1 sitmain         41 0.1%  15%   1.8M   6.0M   14 1000    --    -- -   good

4/1 sitparent       41 0.0%  20%   1.5M   4.0M   12  500    --    -- -   good

4/1 hatcpu          41 0.2%  10%   1.6M   5.0M   13  500    --    -- -   good

4/1 rmmgr           41 2.1%  10%   3.3M   7.0M   26  500    --    -- -   good

4/1 dhmgr           41 0.0%   5%   1.9M   6.0M   18 2000    --    -- -   good

4/1 sessmgr       5004 0.2%  50%  18.0M  85.0M   18  500    --    -- S   good

4/1 aaamgr        5005 0.1%  50%   6.5M  20.0M   16  500    --    -- -   good

4/1 sessmgr          1 7.3% 100%  93.1M 155.0M   30  500   793  6600 I   good

4/1 sessmgr         24  10% 100%  91.5M 155.0M   27  500   804  6600 I   good

4/1 aaamgr          37 2.2%  94%  38.7M  77.0M   19  500    --    -- -   good

4/1 aaamgr          41 2.2%  94%  39.5M  77.0M   19  500    --    -- -   good

4/1 aaamgr          44 2.2%  94%  39.8M  77.0M   20  500    --    -- -   good

4/1 aaamgr          49 1.7%  94%  38.6M  77.0M   19  500    --    -- -   good

4/1 sessmgr         57  11% 100%  93.1M 155.0M   29  500   807  6600 I   good

4/1 sessmgr         62 7.9% 100%  95.3M 155.0M   30  500   791  6600 I   good

4/1 sessmgr         82 9.5% 100%  91.9M 155.0M   28  500   803  6600 I   good

4/1 sessmgr         94 8.2% 100%  94.1M 155.0M   29  500   791  6600 I   good

4/1 aaamgr         101 2.2%  94%  38.8M  77.0M   19  500    --    -- -   good

4/1 sitreap    4100472 0.0%   5%   0.1M   5.0M    5  100    --    -- -   good

4/1 msgd       4100470 1.0%   5%   0.8M   5.0M    6  100    --    -- -   good

4/1 nscontrol  4130563 0.0%   5%   1.0M   5.0M   10  500    --    -- -   good

The task table maps the task instance number with the process id (pid), as well as with the parent task instance and pid, for example:

******** show task table *******

     task                           parent

cpu facility      inst    pid pri  facility      inst    pid

---- -----------------------------  -------------------------

3/0 aaamgr          94    639   0  sessctrl         0    582

3/0 aaamgr          88    632   0  sessctrl         0    582

3/0 aaamgr          79    625   0  sessctrl         0    582

3/0 aaamgr          70    624   0  sessctrl         0    582

3/0 sessmgr         52    617   0  sessctrl         0    582

3/0 sessmgr         37    610   0  sessctrl         0    582

3/0 aaamgr        5026    585   0  sessctrl         0    582

3/0 sessmgr          3    572   0  sessctrl         0    582

3/0 sessmgr       5015    571   0  sessctrl         0    582

3/0 sft            300    528   0  npuctrl          0    564

3/0 npumgr           3    521  -8  npuctrl          0    564

3/0 dcardmgr         3    520   0  dcardctrl        0    567

3/0 dhmgr           30    519  -4  sitparent       30    510

3/0 rmmgr           30    512   0  sitparent       30    510

3/0 hatcpu          30    511 -12  sitparent       30    510

3/0 sitparent       30    510 -12  sitmain         30    420

3/0 sitmain         30    420 -12  sitmain         30    420

3/0 sitreap    3000509    509   0  sitmain         30    420

3/0 msgd       3000507    507   0  -                0      0

3/0 nscontrol  3024701  24701   0  -                0      0

Note that on PAC 15, the standby card, all the task instances are numbered greater than 5000, which is the signature of a standby task. Note above that there are also standby sessmgr and aaamgr tasks on active cards ready to take over for task failures (more on this discussed below), but most are active.

15/0 sitmain        150 0.0%  15%   1.8M   6.0M   14 1000    --    -- -   good

15/0 sitparent      150 0.0%  20%   1.4M   4.0M   12  500    --    -- -   good

15/0 hatcpu         150 0.2%  10%   1.6M   5.0M   13  500    --    -- -   good

15/0 rmmgr          150 1.3%  10%   3.2M   7.0M   22  500    --    -- -   good

15/0 dhmgr          150 0.0%   5%   1.7M   6.0M   18 2000    --    -- -   good

15/0 dcardmgr        15 0.0%  60%  12.7M  20.0M   13  500    --    -- -   good

15/0 npumgr          15 5.1% 100% 117.5M 225.3M   34 1000    --    -- -   good

15/0 sft           1500 0.3%  50%   2.5M  12.0M   15  500    --    -- -   good

15/0 sessmgr       5045 0.1%  50%  18.0M  85.0M   17  500    --    -- S   good

15/0 sessmgr       5046 0.1%  50%  18.1M  85.0M   17  500    --    -- S   good

15/0 sessmgr       5047 0.1%  50%  18.1M  85.0M   16  500    --    -- S   good

15/0 aaamgr        5048 0.0%  50%   6.1M  20.0M   16  500    --    -- -   good

15/0 aaamgr        5049 0.0%  50%   6.1M  20.0M   16  500    --    -- -   good

15/0 aaamgr        5050 0.0%  50%   6.4M  20.0M   16  500    --    -- -   good

15/0 aaamgr        5051 0.0%  50%   6.4M  20.0M   16  500    --    -- -   good

15/0 sitreap   15000509 0.0%   5%   0.1M   5.0M    5  100    --    -- -   good

15/0 msgd      15000507 0.7%   5%   0.8M   5.0M    6  100    --    -- -   good

15/0 nscontrol 15030489 0.0%   5%   1.0M   5.0M   10  500    --    -- -   good

15/1 sitmain        151 0.0%  15%   1.8M   6.0M   14 1000    --    -- -   good

15/1 sitparent      151 0.0%  20%   1.4M   4.0M   12  500    --    -- -   good

15/1 hatcpu         151 0.2%  10%   1.5M   5.0M   13  500    --    -- -   good

15/1 rmmgr          151 1.2%  10%   3.2M   7.0M   25  500    --    -- -   good

15/1 dhmgr          151 0.0%   5%   1.9M   6.0M   18 2000    --    -- -   good

15/1 sessmgr       5040 0.1%  50%  17.9M  85.0M   17  500    --    -- S   good

15/1 sessmgr       5041 0.0%  50%  18.1M  85.0M   17  500    --    -- S   good

15/1 sessmgr       5042 0.1%  50%  18.1M  85.0M   17  500    --    -- S   good

15/1 sessmgr       5043 0.1%  50%  18.1M  85.0M   17  500    --    -- S   good

15/1 sessmgr       5044 0.1%  50%  18.1M  85.0M   17  500    --    -- S   good

15/1 sessmgr       5052 0.1%  50%  18.1M  85.0M   16  500    --    -- S   good

15/1 aaamgr        5053 0.0%  50%   6.0M  20.0M   16  500    --    -- -   good

15/1 aaamgr        5054 0.0%  50%   6.0M  20.0M   16  500    --    -- -   good

15/1 aaamgr        5055 0.0%  50%   6.1M  20.0M   16  500    --    -- -   good

15/1 aaamgr        5056 0.0%  50%   6.0M  20.0M   16  500    --    -- -   good

15/1 aaamgr        5057 0.0%  50%   6.4M  20.0M   16  500    --    -- -   good

15/1 aaamgr        5058 0.0%  50%   6.4M  20.0M   16  500    --    -- -   good

15/1 sitreap   15100472 0.0%   5%   0.1M   5.0M    5  100    --    -- -   good

15/1 msgd      15100470 0.7%   5%   0.8M   5.0M    6  100    --    -- -   good

15/1 nscontrol 15130514 0.0%   5%   1.0M   5.0M   10  500    --    -- -   good

15/2 sitmain        152 0.0%  15%   1.8M   6.0M   14 1000    --    -- -   good

15/2 sitparent      152 0.0%  20%   1.4M   4.0M   12  500    --    -- -   good

15/2 hatcpu         152 0.2%  10%   1.6M   5.0M   13  500    --    -- -   good

15/2 rmmgr          152 1.1%  10%   3.3M   7.0M   25  500    --    -- -   good

15/2 dhmgr          152 0.0%   5%   2.0M   6.0M   18 2000    --    -- -   good

15/2 sessmgr       5059 0.1%  50%  17.9M  85.0M   17  500    --    -- S   good

15/2 sessmgr       5060 0.1%  50%  18.1M  85.0M   17  500    --    -- S   good

15/2 sessmgr       5061 0.1%  50%  18.1M  85.0M   17  500    --    -- S   good

15/2 sessmgr       5062 0.1%  50%  18.1M  85.0M   16  500    --    -- S   good

15/2 sessmgr       5069 0.1%  50%  18.1M  85.0M   17  500    --    -- S   good

15/2 sessmgr       5070 0.1%  50%  18.1M  85.0M   16  500    --    -- S   good

15/2 aaamgr        5071 0.0%  50%   6.0M  20.0M   16  500    --    -- -   good

15/2 aaamgr        5072 0.0%  50%   6.0M  20.0M   16  500    --    -- -   good

15/2 aaamgr        5073 0.0%  50%   6.1M  20.0M   16  500    --    -- -   good

15/2 aaamgr        5074 0.0%  50%   6.0M  20.0M   16  500    --    -- -   good

15/2 aaamgr        5075 0.0%  50%   6.5M  20.0M   16  500    --    -- -   good

15/2 aaamgr        5076 0.0%  50%   6.4M  20.0M   16  500    --    -- -   good

15/2 sitreap   15200472 0.0%   5%   0.1M   5.0M    5  100    --    -- -   good

15/2 msgd      15200470 0.5%   5%   0.8M   5.0M    6  100    --    -- -   good

15/2 nscontrol 15230514 0.0%   5%   1.0M   5.0M   10  500    --    -- -   good

15/3 sitmain        153 0.0%  15%   1.8M   6.0M   14 1000    --    -- -   good

15/3 sitparent      153 0.0%  20%   1.5M   4.0M   12  500    --    -- -   good

15/3 hatcpu         153 0.2%  10%   1.6M   5.0M   13  500    --    -- -   good

15/3 rmmgr          153 1.2%  10%   3.2M   7.0M   25  500    --    -- -   good

15/3 dhmgr          153 0.0%   5%   1.8M   6.0M   18 2000    --    -- -   good

15/3 sessmgr       5063 0.1%  50%  17.9M  85.0M   17  500    --    -- S   good

15/3 sessmgr       5064 0.1%  50%  18.1M  85.0M   17  500    --    -- S   good

15/3 sessmgr       5065 0.1%  50%  18.1M  85.0M   17  500    --    -- S   good

15/3 sessmgr       5066 0.1%  50%  18.1M  85.0M   17  500    --    -- S   good

15/3 sessmgr       5067 0.1%  50%  18.1M  85.0M   17  500    --    -- S   good

15/3 sessmgr       5068 0.1%  50%  18.1M  85.0M   16  500    --    -- S   good

15/3 aaamgr        5077 0.0%  50%   6.1M  20.0M   16  500    --    -- -   good

15/3 aaamgr        5078 0.0%  50%   6.2M  20.0M   16  500    --    -- -   good

15/3 aaamgr        5079 0.0%  50%   6.4M  20.0M   16  500    --    -- -   good

15/3 aaamgr        5080 0.0%  50%   6.5M  20.0M   16  500    --    -- -   good

15/3 aaamgr        5081 0.0%  50%   6.2M  20.0M   16  500    --    -- -   good

15/3 aaamgr        5082 0.0%  50%   6.1M  20.0M   16  500    --    -- -   good

15/3 aaamgr        5083 0.0%  50%   6.1M  20.0M   16  500    --    -- -   good

15/3 sitreap   15300472 0.0%   5%   0.1M   5.0M    5  100    --    -- -   good

15/3 msgd      15300470 0.6%   5%   0.8M   5.0M    6  100    --    -- -   good

15/3 nscontrol 15330545 0.0%   5%   1.0M   5.0M   10  500    --    -- -   good

To round out the picture, here is output for the tasks running on the SPC:

******** show task resources *******

                   task   cputime        memory     files      sessions

cpu facility      inst used allc   used  alloc used allc  used  allc S status

----------------------- --------- ------------- --------- ------------- ------

8/0 sitmain         80 0.0%  15%   1.9M   6.0M   15 1000    --    -- -   good

8/0 sitparent       80 0.0%  20%   1.5M   4.0M   13  500    --    -- -   good

8/0 evlogd           0  18%  94%  29.7M  60.0M  134 4000    --    -- -   good

8/0 drvctrl          0 1.4%  15%   3.3M   5.0M   15  500    --    -- -   good

8/0 hatsystem        0 1.1%  10%   2.0M   5.0M   18  500    --    -- -   good

8/0 hatcpu          80 0.2%  10%   1.6M   5.0M   16  500    --    -- -   good

8/0 rct              0 0.0%   5%   1.4M   5.0M   13  500    --    -- -   good

8/0 sct              0 4.0%  50%   9.4M  65.0M   18  500    --    -- -   good

8/0 rmmgr           80 2.5%  10%   3.5M   7.0M   19  500    --    -- -   good

8/0 npumgr           8 2.6% 100%  46.0M  90.5M   24 1000    --    -- -   good

8/0 vpnmgr           1 0.6%  30%   8.4M  12.0M   34 1000    --    -- -   good

8/0 zebos            1 0.1%  50%   7.9M  12.0M   18  500    --    -- -   good

8/0 aaamgr         231 0.1%  30%  11.7M  25.0M   20  500    --    -- -   good

8/0 dhmgr           80 0.0%   5%   2.0M   6.0M   18 2000    --    -- -   good

8/0 rmctrl           0 1.4%  20%   7.5M  22.0M   45 1000    --    -- -   good

8/0 npuctrl          0 0.6%  10%   2.7M  10.0M   27  500    --    -- -   good

8/0 dcardctrl        0 0.0%  10%   1.5M   8.0M   12  500    --    -- -   good

8/0 cspctrl          0 1.7%  50%   3.5M  12.0M   16  500    --    -- -   good

8/0 vpnctrl          0 0.1%  15%   3.5M   7.0M   16 1500    --    -- -   good

8/0 cssctrl          0 0.0%  15%   1.4M   7.0M   12  500    --    -- -   good

8/0 sessctrl         0 0.6%  89%   7.9M  57.0M   35 2000    --    -- -   good

8/0 vmgctrl          0 0.0%   7%   1.5M  20.0M   13  500    --    -- -   good

8/0 snmp             0 6.6%  34%  11.9M  12.0M  360 1500    --    -- -   warn

8/0 orbs             0 0.0%  50%   9.2M  28.0M   16 1500    --    -- -   good

8/0 bulkstat         0  20%  89%  16.1M  35.0M  316 1500    --    -- -   good

8/0 threshold        0 0.5%  89%   4.2M  23.0M   12 1500    --    -- -   good

8/0 ipsecctrl        0 0.1%  15%   2.2M  15.0M   23  500    --    -- -   good

8/0 orbns            0 0.0%  50%   2.6M  12.0M   17 1500    --    -- -   good

8/0 cli        8017052 0.0%  50%   4.0M  20.0M   11 2000    --    -- -   good

8/0 cli        8017526  31%  50%   9.8M  20.0M  385 2000    --    -- -   good

8/0 sitreap    8000546 0.0%   5%   0.1M   5.0M    5  100    --    -- -   good

8/0 msgd       8000542 1.0%   5%   0.4M   5.0M    6  100    --    -- -   good

8/0 msgd       8000543 0.0%   5%   0.1M   5.0M    7  100    --    -- -   good

8/0 inetd      8000593 0.0%   5%   0.1M   5.0M    4  100    --    -- -   good

8/0 sshd       8001310 0.0%  30%   0.2M   5.0M    4  100    --    -- -   good

8/0 ntpd       8001320 0.0%   5%   0.2M   5.0M    8  100    --    -- -   good

8/0 orbnsd     8001323 0.0%   5%   0.7M  10.0M    8  100    --    -- -   good

8/0 sshd       8017050 0.0%  30%   0.3M   5.0M    9  100    --    -- -   good

8/0 nscontrol  8017481 0.0%   5%   1.0M   5.0M   10  500    --    -- -   good

8/0 sshd       8017522 2.1%  30%   0.3M   5.0M    9  100    --    -- -   good

Session Recovery

Each sessmgr process is paired with a aaamgr process that is handling the aaa requests for the subscribers on that sessmgr, and that aaamgr process is located on a different card. The task instance ids for each pair is the same. The process pairings are across various cards, so that the processes are not paired just between two given cards.

Also, if session recovery feature is enabled, then that aaamgr also contains the state information of the sessions managed by the sessmgr process it is paired with. As well, pre-v8.1, acsmgr processes are also mirrored (acsmgr functionality has been combined into the sessmgr process with v8.1). In the case of a failed PAC, the sessions can be recreated on the standby card that has become active seamlessly based on the redundant information. Also, an additional PAC card is largely used for the recovery process (in addition to the standby card that takes over for the failed card). This PAC card is called the Demux card, and there are no session related tasks running on this card, and it is always in the active state. So, the session recovery feature requires at least three active cards and one standby card. Here is example output for the demux card (2) taken from the same chassis as the above output did.

2/0 sitmain         20 0.0%  15%   1.8M   6.0M   14 1000    --    -- -   good

2/0 sitparent       20 0.0%  20%   1.3M   4.0M   12  500    --    -- -   good

2/0 hatcpu          20 0.1%  10%   1.6M   5.0M   13  500    --    -- -   good

2/0 rmmgr           20 1.0%  10%   3.2M   7.0M   19  500    --    -- -   good

2/0 dhmgr           20 0.0%   5%   1.7M   6.0M   16 2000    --    -- -   good

2/0 dcardmgr         2 0.0%  60%  12.5M  20.0M   13  500    --    -- -   good

2/0 npumgr           2 8.1% 100% 117.5M 225.3M   34 1000    --    -- -   good

2/0 sft            200 0.3%  50%   2.5M  12.0M   15  500    --    -- -   good

2/0 vpnmgr           2 0.2% 100%   5.4M  12.0M  136 1000    --    -- -   good

2/0 zebos            2 0.0%  50%   7.2M  12.0M   17  500    --    -- -   good

2/0 sitreap    2000509 0.0%   5%   0.1M   5.0M    5  100    --    -- -   good

2/0 msgd       2000507 0.5%   5%   0.8M   5.0M    6  100    --    -- -   good

2/0 inetd      2000554 0.0%   5%   0.1M   5.0M    4  100    --    -- -   good

2/0 nscontrol  2030377 0.0%   5%   1.0M   5.0M   10  500    --    -- -   good

2/1 sitmain         21 0.0%  15%   1.8M   6.0M   14 1000    --    -- -   good

2/1 sitparent       21 0.0%  20%   1.4M   4.0M   12  500    --    -- -   good

2/1 hatcpu          21 0.1%  10%   1.6M   5.0M   13  500    --    -- -   good

2/1 rmmgr           21 0.7%  10%   3.3M   7.0M   20  500    --    -- -   good

2/1 dhmgr           21 0.0%   5%   1.9M   6.0M   18 2000    --    -- -   good

2/1 l2tpdemux        1 0.0%  94%   2.3M  20.0M   14  500    --    -- -   good

2/1 l2tpmgr          1 0.1% 100%  12.7M 266.0M   16 1500     0  200k -   good

2/1 sitreap    2100472 0.0%   5%   0.1M   5.0M    5  100    --    -- -   good

2/1 msgd       2100470 0.5%   5%   0.8M   5.0M    6  100    --    -- -   good

2/1 nscontrol  2130198 0.0%   5%   1.0M   5.0M   10  500    --    -- -   good

2/2 sitmain         22 0.0%  15%   1.8M   6.0M   14 1000    --    -- -   good

2/2 sitparent       22 0.0%  20%   1.4M   4.0M   12  500    --    -- -   good

2/2 hatcpu          22 0.1%  10%   1.6M   5.0M   13  500    --    -- -   good

2/2 rmmgr           22 1.3%  10%   3.2M   7.0M   21  500    --    -- -   good

2/2 dhmgr           22 0.0%   5%   1.7M   6.0M   14 2000    --    -- -   good

2/2 vpnmgr           3 7.4% 100%   9.1M  41.4M  241 1000    --    -- -   good

2/2 zebos            3 0.0%  50%   7.9M  15.0M   17  500    --    -- -   good

2/2 famgr            1 9.9% 100%  34.7M  96.0M  236 1000   63k  200k -   good

2/2 vpnmgr           4 0.3% 100%   4.9M  12.0M   28 1000    --    -- -   good

2/2 zebos            4 0.0%  50%   7.4M  12.0M   17  500    --    -- -   good

2/2 l2tpmgr          2 0.6% 100%  14.2M 150.0M  196 1500   144  130k -   good

2/2 cdrmod           2 0.3%  94%  42.2M 315.9M   87 1024    --    -- -   good

2/2 sitreap    2200472 0.0%   5%   0.1M   5.0M    5  100    --    -- -   good

2/2 msgd       2200470 0.8%   5%   0.8M   5.0M    6  100    --    -- -   good

2/2 inetd      2200515 0.0%   5%   0.1M   5.0M    4  100    --    -- -   good

2/2 inetd      2200526 0.0%   5%   0.1M   5.0M    5  100    --    -- -   good

2/2 sshd       2200527 0.0%  30%   0.2M   5.0M    4  100    --    -- -   good

2/2 sshd       2200577 0.0%  30%   0.3M   5.0M    8  100    --    -- -   good

2/2 sftp-serv  2200578 0.0%  10%   0.3M   5.0M    5  100    --    -- -   good

2/2 sshd       2200579 0.0%  30%   0.3M   5.0M    8  100    --    -- -   good

2/2 sftp-serv  2200586 0.0%  10%   0.4M   5.0M    5  100    --    -- -   good

2/2 nscontrol  2203737 0.0%   5%   1.0M   5.0M   10  500    --    -- -   good

2/3 sitmain         23 0.0%  15%   1.8M   6.0M   14 1000    --    -- -   good

2/3 sitparent       23 0.0%  20%   1.4M   4.0M   12  500    --    -- -   good

2/3 hatcpu          23 0.1%  10%   1.6M   5.0M   13  500    --    -- -   good

2/3 rmmgr           23 0.8%  10%   3.1M   7.0M   20  500    --    -- -   good

2/3 dhmgr           23 0.0%   5%   2.0M   6.0M   18 2000    --    -- -   good

2/3 a11mgr           1  26% 100%  50.9M 171.0M  243 1000   79k  200k -   good

2/3 sitreap    2300472 0.0%   5%   0.1M   5.0M    5  100    --    -- -   good

2/3 msgd       2300470 0.8%   5%   0.8M   5.0M    6  100    --    -- -   good

2/3 nscontrol  2330179 0.0%   5%   1.0M   5.0M   10  500    --    -- -   good

The command “show session recovery verbose” also indicates which card is the demux card:

******** show session recovery status verbose *******

Session Recovery Status:

  Overall Status        : Ready For Recovery

  Last Status Update    : 7 seconds ago

              ----sessmgr---  ----aaamgr----  demux

cpu state    active standby  active standby  active  status

---- -------  ------ -------  ------ -------  ------  -------------------------

2/0 Active   0      0        0      0        1       Good (Demux)            

2/1 Active   0      0        0      0        2       Good (Demux)            

2/2 Active   0      0        0      0        5       Good (Demux)            

2/3 Active   0      0        0      0        1       Good (Demux)            

3/0 Active   3      1        4      1        0       Good                    

3/1 Active   6      1        5      1        0       Good                    

3/2 Active   6      1        5      1        0       Good                    

3/3 Active   6      1        5      1        0       Good                    

4/0 Active   3      1        4      1        0       Good                    

4/1 Active   6      1        5      1        0       Good                    

4/2 Active   6      1        5      1        0       Good                    

4/3 Active   6      1        6      1        0       Good                    

6/0 Active   3      1        4      1        0       Good                    

6/1 Active   6      1        6      1        0       Good                    

6/2 Active   6      1        6      1        0       Good                    

6/3 Active   6      1        6      1        0       Good                    

11/0 Active   3      1        4      1        0       Good                    

11/1 Active   6      1        5      1        0       Good                    

11/2 Active   6      1        6      1        0       Good                    

11/3 Active   6      1        7      1        0       Good                    

13/0 Active   3      1        4      1        0       Good                    

13/1 Active   6      1        5      1        0       Good                    

13/2 Active   6      1        6      1        0       Good                    

13/3 Active   6      1        7      1        0       Good                    

15/0 Standby  0      3        0      4        0       Good                    

15/1 Standby  0      6        0      6        0       Good                    

15/2 Standby  0      6        0      6        0       Good                    

15/3 Standby  0      6        0      7        0       Good

Another command will allow examining what card CPUs are devoted to what kind of activity. Note the demux card is 2, the sessmgr cards are 3, 4, 6, 11, 13. That leaves the remaining cards as standby (only one card, 15, in this case, but there could be multiple standbys)

******** show resources *******

Active CPUs:

  Total CPUs            : 24

  Session CPUs          : 20 [3/0 3/1 3/2 3/3 4/0 4/1 4/2 4/3 6/0 6/1 6/2 6/3 11/0 11/1 11/2 11/3 13/0 13/1 13/2 13/3]

  Demux CPUs            : 4 [2/0 2/1 2/2 2/3]

  Highest Load          : 2.09 (CPU 3/2)

  Total Memory          : 48.0G

  Total Used            : 18.7G

  Least Free            : 1059M (CPU 13/2)

  Total Temporary Files :    5M

  Most Temporary Files  :    2M (CPU 2/3)

Standby CPUs:

  Total CPUs            : 4

  Highest Load          : 0.07 (CPU 15/2)

  Total Memory          :  8.0G

  Total Used            :  1.2G

  Least Free            : 1642M (CPU 15/0)

  Total Temporary Files :    0M

  Most Temporary Files  :    0M (CPU 15/0)

Session Information:

  In-Use Session Managers:

    Number of Managers  : 105

    Capacity            : 173250 min / 346500 typical / 693000 max

    Usage               : 79347

  Busy-Out Session Managers:

    Number of Managers  : 0

    Capacity            : 0 min / 0 typical / 0 max

    Usage               : 0

  Standby Session Managers:

    Number of Managers  : 41

  PDSN Service:

    In Use              : 79347

    Max Used            : 86135

    Limit               : 200000

    License Status      : Within Acceptable Limits

As always the case with STxx commands, you can see that there are multiple views with overlapping information, each serving a specific purpose in the information gathering, monitoring, and troubleshooting process.

Task Warnings and Alarms

This is a good opportunity to mention the status column of show task resources. If the amount of memory that has been allocated to the process has been exceeded, then the status will indicate “warn” or “alarm”. If this occurs, contact Starent support so that the appropriate action can be taken. This might involve resetting the task, which only Starent can do, but also collecting data from the chassis that might indicate how it got into that state, and then possibly reporting this to engineering as a problem report. If a task does get into this state, it is likely that an SNMP trap and/or error log entry has been saved, and possibly an alarm triggered:

show snmp trap history (or view your SNMP trap server for more detailed information)

show logs

show alarm outstanding

Here is an example of multiple logs for a bulkstats process exceeding the allocated memory limit.

show task resources
8/0 bulkstat         0 0.3%  89%  52.2M  35.0M  332 1500    --    -- -  alarm

[local]BRHOALTBPN3-HA# show logs | grep bulk
2009-Feb-21+20:18:09.342 [resmgr 14508 error] [8/0/920 <rmmgr:80> _resource_cpu.c:2292] [software internal system critical-info] The task bulkstat-0 is way over it's memory limit! Allocated 35840K, Using 57588K
2009-Feb-18+17:50:09.348 [resmgr 14508 error] [8/0/920 <rmmgr:80> _resource_cpu.c:2292] [software internal system critical-info] The task bulkstat-0 is way over it's memory limit! Allocated 35840K, Using 55636K
2009-Feb-16+11:56:09.351 [resmgr 14508 error] [8/0/920 <rmmgr:80> _resource_cpu.c:2292] [software internal system critical-info] The task bulkstat-0 is way over it's memory limit! Allocated 35840K, Using 53812K
2009-Feb-13+05:34:09.348 [resmgr 14508 error] [8/0/920 <rmmgr:80> _resource_cpu.c:2292] [software internal system critical-info] The task bulkstat-0 is way over it's memory limit! Allocated 35840K, Using 51948K

NON-critical Task Failures

Besides a task getting into the warning or alarm state, a task could also crash (unrelated to warning or alarm). Task monitoring is handled by the hatcpu processes on each card, and if the failure is not critical, then the parent task (name ends in main, ctrl, or parent) is notified, and the task can simply be respawned without any other action (such as a migration).

If session recovery is enabled, then data from any lost sessmgr, aaamgr, (or acsmgr for pre-v8.1) process(es) is transferred to standby process(es) which become active, and then those process(es) are then re-assigned the instance number of failed process(es). This re-assignment is done for convenience because other tasks in the system know about the instance number of the failed task number, not the one that took over. Also, the task that takes over has a value greater than 5000 which by definition is standby, but the task is now active. Finally, a new standby task is spawned to replace the one that became active, and that task’s instance number is usually the previous standby task instance number + 1 (see the task table for evidence of this)

The SNMP trap history and associated logs may look like this (SR 5898)

Mon Feb 16 04:06:50 2009 Internal trap notification 73 (ManagerFailure) facility acsmgr instance 50 card 13 cpu 0

Mon Feb 16 04:06:50 2009 Internal trap notification 150 (TaskFailed) facility acsmgr instance 50 on card 13 cpu 0

2009-Feb-16+04:06:51.313 [sitmain 4075 trace] [8/0/912 <sitparent:80> crashd.c:579] [software internal system critical-info] Crash handler file transfer ended (type=1 size=127535 child_ct=0 core_ct=0 pid=2257 status=1 elapsed=1s)

2009-Feb-16+04:06:50.444 [sitmain 4023 info] [13/0/718 <sitmain:130> sittask.c:4282] [software internal system critical-info] Readdress successful for facility acsmgr instance 6009 to instance 50

2009-Feb-16+04:06:50.247 [sitmain 4074 trace] [8/0/912 <sitparent:80> crashd.c:851] [software internal system critical-info] Crash handler file transfer starting (type=1 size=127535 child_ct=1 core_ct=0 pid=2257)

2009-Feb-16+04:06:50.245 [sitmain 4080 info] [8/0/912 <sitparent:80> crashd.c:812] [software internal system critical-info] Core file transfer to SPC complete, received 127535/127535 bytes

2009-Feb-16+04:06:50.238 [sitmain 4022 info] [13/0/718 <sitmain:130> sittask.c:4128] [software internal system critical-info] Readdress requested for facility acsmgr instance 6009 to instance 50

2009-Feb-16+04:06:50.207 [rct 13038 info] [8/0/918 <rct:0> rct_task.c:275] [software internal system critical-info] Death notification of task acsmgr/50 on 13/0 sent to parent task acsctrl/0 with failover of acsmgr/6009 on 13/0

2009-Feb-16+04:06:50.205 [sitmain 4017 critical] [13/0/718 <sitmain:130> sitproc.c:1471] [software internal system] Process acsmgr pid 7846 died on card 13 cpu 0 signal=11 wstatus=0x8b

2009-Feb-16+04:06:50.204 [sitmain 4027 critical] [13/0/718 <sitmain:130> crash_mini.c:795] [software internal system] Process Crash Info: time 2009-Feb-16+04:06:47(hex time 4998e657) card 13 cpu 00 pid 07846 procname acsmgr   crash_details

Fatal Signal 11: Segmentation fault

  PC: [ a280732/X] acs_control_charge()

  Faulty address: (nil)

  Signal detail: address not mapped to object

  Process: card=13 cpu=0 arch=X pid=7846 cpu=~2% argv0=acsmgr

  Crash time: 2009-Feb-15+23:06:47

Recent errno: 11 Resource temporarily unavailable

........

In this case, the backed up acsmgr task data from the failed task 50 was populated into standby task 6009 which took over and became active, and then was re-assigned the task instance number 50 of the failed task. The term “failover” means that session recovery was used to failover to standby task 6009. Note the logs use process id (pid) 7846 and task instance 6009 interchangeably.

Though the log output is intended for engineering to analyze, if you read through the logs, some of it may actually make some sense, especially in light of reading this document. Logs are in reverse chronological order, so you should read in the up direction. The SNMP trap history is the best place to START most of your troubleshooting, because it lays out a timeline of the most significant events. When you see ManagerFailure and/or TaskFailed, you can then check the logs for a matching timestamp and you should be able to find significant details on what occurred, even if the logging level for the chassis is set to error level for all facilities (default behavior). That is because event data associated with process crashes are often either error level (second to highest level) or critical level (highest level). In addition, many log events associated with important activities are set to critical-info (vs. non-critical-info), so even if the log event is less than error level and the facility in question is set to error level, it will still be saved to logs because it is critical-info.

In the above example, the full crash details have not been included here to save space (........), so the crash itself is considered one log event that spans many lines. The full crash itself will also be found in the crash list which can be retrieved with the command “show crash list”. Note the card, CPU and Process ID, software version and SPC/SMC and PAC/PSC card serial numbers are saved. The list itself can be useful for quick scanning to note any patterns of tasks or cards or timestamps of when crashes have occurred. Up to 120 crashes can be saved in the list, after which it needs to be cleared with “clear crash list” in order for any new crashes to be saved.

Finally note mention of crash files being transferred. If the crash enable feature is set in the configuration, then any crash dump files will be sent to the URL specified (it is encrypted when viewing the config), and Starent may or may not request such files (one per crash) depending on whether a root cause can be determined from the logs alone.

******** show crash list *******

==         ====         =======  =========== ========== ================

#          Time         Process   Card/CPU/      SW        HW_SER_NUM

                                     PID       VERSION  SMC / Crash Card

==         ====         =======  =========== ========== ================

1  2009-Feb-16+04:06:47 acsmgr   13/00/07846 7.1(21029) PLB19082829/PLB24084573

********************* CRASH #01 ***********************

Fatal Signal 11: Segmentation fault

  PC: [ a280732/X] acs_control_charge()

  Faulty address: (nil)

  Signal detail: address not mapped to object

  Process: card=13 cpu=0 arch=X pid=7846 cpu=~2% argv0=acsmgr

  Crash time: 2009-Feb-15+23:06:47

Recent errno: 11 Resource temporarily unavailable

........

Here’s another example with just a tiny snippet to make the point, though there were actually hundreds of log entries associated with these crashes. In this case both sessmgr processes were respawned. (SR 6032)

2009-Mar-06+08:40:10.297 [sitmain 4023 info] [16/0/1146 <sitmain:160> sittask.c:4321] [software internal system critical-info] Readdress successful for facility sessmgr instance 5022 to instance 22

2009-Mar-06+08:40:10.272 [sitmain 4022 info] [16/0/1146 <sitmain:160> sittask.c:4167] [software internal system critical-info] Readdress requested for facility sessmgr instance 5022 to instance 22

2009-Mar-06+08:40:10.244 [rct 13038 info] [8/0/1383 <rct:0> rct_task.c:275] [software internal system critical-info] Death notification of task sessmgr/22 on 16/0 sent to parent task sessctrl/0 with failover of sessmgr/5022 on 16/0

2009-Mar-06+08:40:10.243 [sitmain 4017 critical] [16/0/1146 <sitmain:160> sitproc.c:1514] [software internal system] Process sessmgr pid 1292 died on card 16 cpu 0 signal=6  wstatus=0x86

2009-Mar-06+08:40:09.889 [sitmain 4023 info] [3/0/1146 <sitmain:30> sittask.c:4321] [software internal system critical-info] Readdress successful for facility sessmgr instance 5002 to instance 50

2009-Mar-06+08:40:09.859 [sitmain 4022 info] [3/0/1146 <sitmain:30> sittask.c:4167] [software internal system critical-info] Readdress requested for facility sessmgr instance 5002 to instance 50

2009-Mar-06+08:40:09.829 [rct 13038 info] [8/0/1383 <rct:0> rct_task.c:275] [software internal system critical-info] Death notification of task sessmgr/50 on 3/0 sent to parent task sessctrl/0 with failover of sessmgr/5002 on 3/0

2009-Mar-06+08:40:09.786 [sitmain 4017 critical] [3/0/1146 <sitmain:30> sitproc.c:1514] [software internal system] Process sessmgr pid 1336 died on card 3 cpu 0 signal=6  wstatus=0x86

******** show crash list *******

==         ====         =======  =========== ========== ================

#          Time         Process   Card/CPU/      SW        HW_SER_NUM

                                     PID       VERSION  SMC / Crash Card

==         ====         =======  =========== ========== ================

1  2009-Mar-06+08:40:07 sessmgr  16/00/01292 8.1(22707) PLB27085483/PLB46075509

2  2009-Mar-06+08:40:09 sessmgr  03/00/01336 8.1(22707) PLB27085483/PLB46075497

********************* CRASH #01 ***********************

Assertion failure at sess/smgr/sessmgr_ipv4.c:11833

  Function: acsmgr_send_packet_merge_mode()

  Expression: clp

  Proclet: sessmgr (f=66000,i=22)

  Process: card=16 cpu=0 arch=X pid=1292 cpu=~5% argv0=sessmgr

  Crash time: 2009-Mar-06+03:40:07

  Recent errno: 115 Operation now in progress

********************* CRASH #02 ***********************

Assertion failure at sess/smgr/sessmgr_ipv4.c:11833

  Function: acsmgr_send_packet_merge_mode()

  Expression: clp

  Proclet: sessmgr (f=66000,i=50)

  Process: card=3 cpu=0 arch=X pid=1336 cpu=~5% argv0=sessmgr

  Crash time: 2009-Mar-06+03:40:09

Recent errno: 115 Operation now in progress

Migrations/Switchovers

The information presented thus far is important in developing an understanding of the various steps that take place when a PAC/PSC migration occurs.

The system will always try to maintain the same number of active cards as have been specified in the configuration with:

card x

mode active pac

#end

This is due to the fact that it is not trivial to decrease the number of certain tasks in the system without putting it into an unstable state.

So, whenever an active card goes into the offline or standby state, there should always be another standby card ready to become active and take its place. Whether an active card failure occurs or whether an active card is purposely made inactive (see below), the system is designed to ensure another card will become active automatically (without user intervention).

First let’s cover the cases where the migration was intentional or “planned”, that is, it was initiated by an administrator. These are the various ways that will cause a migration from one card to another to occur. This is assuming that the card being migrated from is an active card. In all cases, the migration happens FIRST before the reset action for the card in order to minimize impact.

Executive mode:

card pac migrate X to X – unlike the other commands below, this ensures which card will become active (if there are more than one standby card)

card restart – re-initializes card without reboot first (faster – it just restarts the software tasks, it doesn’t reboot the kernel)

card reboot – does both a reboot and re-initialization

card halt – card goes offline instead of standby (cannot apply this command to an offline card)

Restart, reboot, and halt all trigger a CardRebootRequest SNMP trap.

card configuration mode (for a specific card):

shutdown – card goes offline

Note: executing “mode standby” in card config mode will NOT result in a migration and so should not be used to accomplish this task. (It will also immediately disable the line card behind the card (if there is one)).

Certainly you can do a card reboot, restart, or halt to a standby card, but nothing will be migrated because it is standby.

All successful migrations have the following traps:

Internal trap notification 16 (PACMigrateStart) from card 6 to card 5

Internal trap notification 55 (CardActive) card 5

Internal trap notification 60 (CardDown) card 6

Internal trap notification 17 (PACMigrateComplete) from card 6 to card 5

And if the card being migrated from is destined for the standby state (migrate, reboot, or restart), then additionally:

Internal trap notification 5 (CardUp) card 6

Internal trap notification 93 (CardStandby) card 6

The state changes show up in the card table during the migration process:

5: PAC/TAC  Packet Accelerator Card           Migrate To     -       -  -

6: PAC/TAC  Packet Accelerator Card           Migrate From   -      20  -

There is a command that gets run when running “show support details” that shows all recovery attempts (you cannot run this command at the command line).

******** show rct stats *******

RCT stats Details (Last 1 Actions)

Action         Type      From To   Start Time                Duration

-------------- --------- ---- ---- ------------------------  ---------- 

PAC migration  Planned     6    5  2009-May-24+12:49:19.969   8.852 sec

Whether or not session recovery is enabled, the above approaches should all result in a graceful recreation of tasks to a standby card with very minimal subscriber impact. This is because nothing has gone wrong with the existing tasks, there is no lost information, and so the tasks and data have sufficient time to simply be recreated on another card, without relying on a backup/mirrored card (which would be the case for session recovery).

Note: Another way to force a graceful migration is to lift the latch on the card. If you do this, do not pull out the card from the slot while the migration is taking place. If the card is physical pulled out, then migration of tasks is not possible without session recovery because all the tasks and data are lost when the card is pulled and loses power.

SPC/SMC switchovers work the same way as PAC/PSC migrations. Note that SPC/SMC cards are not “configured” to be active in the configuration like PAC/PSC cards are. The commands to run are:

card spc switchover – no need to specify the card to switch to because there are only two cards

card reboot (8 or 9)

card restart (8 or 9)

There is no version of the halt command as there is for PAC/PSC cards.

For more information on the impact of various failover scenarios in table format, see the System Overview manual, Chapter 4 Redundancy and Availability Features, Maintenance and Failure Scenarios section.

Critical Task Failures with or without migration

A critical task failure often is the symptom of a hardware failure on a card and the card must be failed/reset. The system uses heartbeat or ping UDP-based messages every second to determine if the system is running normally. On the Active SPC/SMC, the hatsystem task executes heartbeat messages that are sent to the heartbeat CPU tasks (hatcpu) located on all CPU-based cards (PAC/PSC/SPC/SMC) in the system. If two consecutive messages are lost (not responded to), plus some other criteria (too detailed/low-level to cover here), then a critical task failure will be declared and will need to be reset. Depending on the severity, a migration of whatever tasks are migratable may or may not be attempted first, but in either case, an available standby card will be made active immediately. If session recovery is enabled, then minimal subscriber impact will be felt as those tasks are populated with data from backup processes from the various other active cards.

It is often asked if it is necessary or recommended to put all the cards in the chassis back into the state that they are configured for in the configuration, so that all the active cards are active, and the standby card(s) are standby. This is not necessary unless you like to keep things “tidy and neat”; the system will function fine without doing so.

No Migration and failed card fails to reboot successfully and remains offline

In this example, no migration is attempted and the failed card did not boot back to standby status but goes offline (some traps removed): (SR 7095). Note that the logs do not mention anything about the new card (16) becoming active (because they are “non-critical-info”, “info” level logs and the logging level for this capture was “error” level), but the SNMP traps do.

Sun May 24 17:18:36 2009 Internal trap notification 73 (ManagerFailure) facility hatcpu instance 40 card 8 cpu 0

Sun May 24 17:18:36 2009 Internal trap notification 55 (CardActive) card 16

...

Sun May 24 17:18:37 2009 Internal trap notification 73 (ManagerFailure) facility hatcpu instance 40 card 8 cpu 0

...

Sun May 24 17:18:38 2009 Internal trap notification 60 (CardDown) card 4

...

Sun May 24 17:21:40 2009 Internal trap notification 9 (CardBootFailed) card 4

...

Sun May 24 17:24:42 2009 Internal trap notification 9 (CardBootFailed) card 4

Sun May 24 17:27:45 2009 Internal trap notification 9 (CardBootFailed) card 4

Sun May 24 17:30:53 2009 Internal trap notification 9 (CardBootFailed) card 4

The logs show the card being booted multiple times but not responding to pings as it should, and finally no more attempts are made after the fourth attempt. Note “Kernel missed ICMP echo request(s) from card 4: cpu:0 if:0, cpu:1 if:0” indicating that pings to both CPUs 0 and 1 (PSC cards have 2 CPUs) failed to be responded to.

2009-May-24+17:30:53.360 [csp 7019 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:3078] [hardware internal system] The Packet Services Card with serial number PLB50076807 in slot 4 has failed and will be brought down and kept down. (Device=CPU_0, Reason=CARD_BOOT_TIMEOUT_EXPIRED, Status=CFE_ROM: Step 2 [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], GPIO_IN: 02,ff,ff,ff)

2009-May-24+17:30:53.353 [alarmctrl 65201 info] [8/0/1378 <evlogd:0> alarmctrl.c:189] [software internal system critical-info] Alarm condition: id 50056e19844d0000 (Critical): Card Failure Detected on card 4 Device CPU_0 failure reason CARD_BOOT_TIMEOUT_EXPIRED

2009-May-24+17:30:53.351 [hat 3018 critical] [8/0/1380 <hatsystem:0> atsystem_fail.c:719] [hardware internal system] Card Failure Detected on card 4 Device CPU_0 failure reason CARD_BOOT_TIMEOUT_EXPIRED

2009-May-24+17:30:53.351 [hat 3033 error] [8/0/1380 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 4 device CPU_0 reason CARD_BOOT_TIMEOUT_EXPIRED

2009-May-24+17:30:53.351 [csp 7005 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:2473] [hardware internal system] The Packet Services Card in slot 4 did not boot in the allowed time. CPU 0 did not boot. CPU 1 did not boot.

2009-May-24+17:27:45.054 [csp 7019 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:3078] [hardware internal system] The Packet Services Card with serial number PLB50076807 in slot 4 has failed and will be power down and brought back online. (Device=CPU_0, Reason=CARD_BOOT_TIMEOUT_EXPIRED, Status=CFE_ROM: Step 2 [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], GPIO_IN: 02,ff,ff,ff)

2009-May-24+17:27:45.038 [hat 3005 warning] [8/0/1380 <hatsystem:0> hatsystem_msg.c:3204] [software internal system critical-info] message bounced from facility hatcpu instance 41

2009-May-24+17:27:45.038 [hat 3005 warning] [8/0/1380 <hatsystem:0> hatsystem_msg.c:3204] [software internal system critical-info] message bounced from facility hatcpu instance 40

2009-May-24+17:27:45.037 [hat 3033 error] [8/0/1380 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 4 device CPU_0 reason CARD_BOOT_TIMEOUT_EXPIRED

2009-May-24+17:27:45.029 [csp 7005 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:2473] [hardware internal system] The Packet Services Card in slot 4 did not boot in the allowed time. CPU 0 did not boot. CPU 1 did not boot.

2009-May-24+17:24:42.794 [csp 7019 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:3078] [hardware internal system] The Packet Services Card with serial number PLB50076807 in slot 4 has failed and will be brought down and brought back online. (Device=CPU_0, Reason=CARD_BOOT_TIMEOUT_EXPIRED, Status=CFE_ROM: Step 2 [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], GPIO_IN: 02,ff,ff,ff)

2009-May-24+17:24:42.790 [hat 3005 warning] [8/0/1380 <hatsystem:0> hatsystem_msg.c:3204] [software internal system critical-info] message bounced from facility hatcpu instance 41

2009-May-24+17:24:42.790 [hat 3005 warning] [8/0/1380 <hatsystem:0> hatsystem_msg.c:3204] [software internal system critical-info] message bounced from facility hatcpu instance 40

2009-May-24+17:24:42.789 [hat 3033 error] [8/0/1380 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 4 device CPU_0 reason CARD_BOOT_TIMEOUT_EXPIRED

2009-May-24+17:24:42.789 [csp 7005 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:2473] [hardware internal system] The Packet Services Card in slot 4 did not boot in the allowed time. CPU 0 did not boot. CPU 1 did not boot.

2009-May-24+17:21:40.578 [csp 7019 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:3078] [hardware internal system] The Packet Services Card with serial number PLB50076807 in slot 4 has failed and will be brought down and brought back online. (Device=CPU_0, Reason=CARD_BOOT_TIMEOUT_EXPIRED, Status=CFE_ROM: Step 2 [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], GPIO_IN: 02,ff,ff,ff)

2009-May-24+17:21:40.574 [hat 3005 warning] [8/0/1380 <hatsystem:0> hatsystem_msg.c:3204] [software internal system critical-info] message bounced from facility hatcpu instance 41

2009-May-24+17:21:40.574 [hat 3005 warning] [8/0/1380 <hatsystem:0> hatsystem_msg.c:3204] [software internal system critical-info] message bounced from facility hatcpu instance 40

2009-May-24+17:21:40.573 [hat 3033 error] [8/0/1380 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 4 device CPU_0 reason CARD_BOOT_TIMEOUT_EXPIRED

2009-May-24+17:21:40.572 [csp 7005 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:2473] [hardware internal system] The Packet Services Card in slot 4 did not boot in the allowed time. CPU 0 did not boot. CPU 1 did not boot.

2009-May-24+17:18:37.130 [hat 3017 error] [8/0/1380 <hatsystem:0> atsystem_fail.c:935] [software internal system] Ignoring multiple failure on card 4 cpu 0 reason CPU_CRITICAL_TASK_FAILURE

2009-May-24+17:18:36.834 [rct 13014 info] [8/0/1382 <rct:0> rct_pac.c:1105] [software internal system critical-info] Card 4 shutdown completed, took 0.479 sec

2009-May-24+17:18:36.436 [sitmain 4022 info] [16/0/1146 <sitmain:160> sittask.c:4167] [software internal system critical-info] Readdress requested for facility aaamgr instance 5031 to instance 1

2009-May-24+17:18:36.361 [hat 3005 warning] [8/0/1380 <hatsystem:0> hatsystem_msg.c:3204] [software internal system critical-info] message bounced from facility hatcpu instance 41

2009-May-24+17:18:36.356 [rct 13013 info] [8/0/1382 <rct:0> rct_pac.c:978] [software internal system critical-info] Card 4 shutdown started

2009-May-24+17:18:36.210 [dcardmgr 57099 info] [16/0/1244 <dcardmgr:16> dmgr_handlers.c:1610] [software internal system critical-info] finish taking over card 4

2009-May-24+17:18:36.204 [dcardmgr 57099 info] [16/0/1244 <dcardmgr:16> dmgr_handlers.c:1584] [software internal system critical-info] start taking over card 4

2009-May-24+17:18:36.136 [csp 7019 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:3078] [hardware internal system] The Packet Services Card with serial number PLB50076807 in slot 4 has failed and will be reset and brought back online. (Device=CPU_0, Reason=CPU_CRITICAL_TASK_FAILURE, Status=Boot Done [hb: CC], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], GPIO_IN: 02,ff,ff,ff)

2009-May-24+17:18:36.129 [hat 3033 error] [8/0/1380 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 4 device CPU_0 reason CPU_CRITICAL_TASK_FAILURE

2009-May-24+17:18:35.208 [hat 3050 error] [8/0/1380 <hatsystem:0> hat_hb_lib.c:1297] [software internal system critical-info] Kernel missed ICMP echo request(s) from card 4: cpu:0 if:0, cpu:1 if:0

2009-May-24+17:18:35.208 [hat 3051 warning] [8/0/1380 <hatsystem:0> hat_hb_lib.c:1272] [software internal system critical-info] ICMP ping results from card 4: All CPUs failed to respond

2009-May-24+17:18:35.168 [hat 3014 critical] [8/0/1380 <hatsystem:0> hat_hb_lib.c:727] [software internal system] HAT instance 0 found Critical task hatcpu/41 failed on cpu 4/1. Missed heartbeat sequence 2190818 ns_age 4 bounce_code N/A uptime 2192349.

2009-May-24+17:18:35.167 [hat 3050 error] [8/0/1380 <hatsystem:0> hat_hb_lib.c:1297] [software internal system critical-info] Kernel missed ICMP echo request(s) from card 4: cpu:0 if:0, cpu:1 if:0

2009-May-24+17:18:35.167 [hat 3051 warning] [8/0/1380 <hatsystem:0> hat_hb_lib.c:1272] [software internal system critical-info] ICMP ping results from card 4: All CPUs failed to respond

2009-May-24+17:18:35.127 [hat 3014 critical] [8/0/1380 <hatsystem:0> hat_hb_lib.c:727] [software internal system] HAT instance 0 found Critical task hatcpu/40 failed on cpu 4/0. Missed heartbeat sequence 2190818 ns_age 4 bounce_code N/A uptime 2192343.

In this case, no migration is even attempted:

******** show rct stats *******

RCT stats Details (Last 1 Actions)

Action         Type      From To   Start Time                Duration

-------------- --------- ---- ---- ------------------------  ---------- 

PSC shutdown   N/A         4   16  2009-May-24+17:18:36.355   0.478 sec

RCT stats Summary

-----------------

PSC Migrations =      0

SMC switchover =      0

To troubleshoot an offline card, it should be (firmly) reseated physically in the slot to determine if it will come back up (into standby mode). If it remains offline after doing this, then the card should be RMA’d.

No Migration and failed card reboots successfully

This example is similar to the previous one except the failed card reset successfully (“Boxer starting …”) to standby status. (SR 6304)

Fri Mar 27 11:16:22 2009 Internal trap notification 73 (ManagerFailure) facility hatcpu instance 110 card 9 cpu 0
Fri Mar 27 11:16:22 2009 Internal trap notification 55 (CardActive) card 4
Fri Mar 27 11:16:25 2009 Internal trap notification 60 (CardDown) card 11
...
Fri Mar 27 11:18:43 2009 Internal trap notification 5 (CardUp) card 11
Fri Mar 27 11:18:43 2009 Internal trap notification 93 (CardStandby) card 11

2009-Mar-27+11:18:32.136 [sitmain 4003 info] [11/3/370 <sitmain:113> sitmain.c:549] [software internal system critical-info] Boxer starting on card 11 cpu 3 (PAC)

2009-Mar-27+11:18:32.105 [sitmain 4003 info] [11/2/370 <sitmain:112> sitmain.c:549] [software internal system critical-info] Boxer starting on card 11 cpu 2 (PAC)

2009-Mar-27+11:18:32.085 [sitmain 4003 info] [11/1/370 <sitmain:111> sitmain.c:549] [software internal system critical-info] Boxer starting on card 11 cpu 1 (PAC)

2009-Mar-27+11:18:31.425 [sitmain 4003 info] [11/0/397 <sitmain:110> sitmain.c:549] [software internal system critical-info] Boxer starting on card 11 cpu 0 (PAC)

2009-Mar-27+11:16:23.363 [rct 13014 info] [9/0/570 <rct:0> rct_pac.c:1071] [software internal system critical-info] PAC Card 11 shutdown completed, took 0.616 sec

2009-Mar-27+11:16:22.747 [rct 13013 info] [9/0/570 <rct:0> rct_pac.c:951] [software internal system critical-info] PAC Card 11 shutdown started

2009-Mar-27+11:16:22.391 [csp 7019 critical] [9/0/915 <cspctrl:0> spctrl_events.c:3045] [hardware internal system] The Packet Accelerator Card with serial number PLW05061700 in slot 11 has failed and will be reset and brought back online. (Device=CPU_0, Reason=CPU_CRITICAL_TASK_FAILURE, Status=Boot Done [hb: F3], Boot Done [hb: E3], Boot Done [hb: F3], Boot Done [hb: E3])

2009-Mar-27+11:16:22.346 [hat 3033 error] [9/0/568 <hatsystem:0> atsystem_fail.c:957] [hardware internal system critical-info] Card error detected on card 11 device CPU_0 reason CPU_CRITICAL_TASK_FAILURE

2009-Mar-27+11:16:22.252 [hat 3051 warning] [9/0/568 <hatsystem:0> hat_hb_lib.c:1120] [software internal system critical-info] ICMP ping results from card 11: rtt(0,0)=0ms rtt(1,0)=0ms rtt(2,0)=0ms rtt(3,0)=0ms

2009-Mar-27+11:16:22.250 [hat 3014 critical] [9/0/568 <hatsystem:0> hat_hb_lib.c:611] [software internal system] HAT instance 0 found Critical task hatcpu/110 failed on cpu 11/0. Missed heartbeat sequence 5794530 ns_age 4 bounce_code N/A.

Migration with successful boot of failed card

Here is another example where the system determined that a migration of tasks was possible, as specified in the line “Migrating all tasks on the Packet Accelerator Card in slot 4 to the one in slot 13”. Note the reboot of card 4 starts AFTER the migration is complete. A few task (facility) migration logs are displayed below, but most have been removed to save space. (SR 6304)

2009-Mar-08+04:46:35.767 [sitmain 4003 info] [4/2/370 <sitmain:42> sitmain.c:549] [software internal system critical-info] Boxer starting on card 4 cpu 2 (PAC)

2009-Mar-08+04:46:35.707 [sitmain 4003 info] [4/3/370 <sitmain:43> sitmain.c:549] [software internal system critical-info] Boxer starting on card 4 cpu 3 (PAC)

2009-Mar-08+04:46:35.687 [sitmain 4003 info] [4/1/370 <sitmain:41> sitmain.c:549] [software internal system critical-info] Boxer starting on card 4 cpu 1 (PAC)

2009-Mar-08+04:46:35.059 [sitmain 4003 info] [4/0/397 <sitmain:40> sitmain.c:549] [software internal system critical-info] Boxer starting on card 4 cpu 0 (PAC)

2009-Mar-08+04:44:30.069 [csp 7022 info] [9/0/915 <cspctrl:0> spctrl_events.c:3391] [hardware internal system critical-info] The migration of tasks on card 4 to card 13 is complete.

2009-Mar-08+04:44:29.186 [rct 13014 info] [9/0/570 <rct:0> rct_pac.c:1071] [software internal system critical-info] PAC Card 4 shutdown completed, took 0.071 sec

2009-Mar-08+04:44:29.115 [rct 13013 info] [9/0/570 <rct:0> rct_pac.c:951] [software internal system critical-info] PAC Card 4 shutdown started

2009-Mar-08+04:44:26.958 [rct 13002 info] [9/0/570 <rct:0> rct_pac.c:464] [software internal system critical-info] PAC task migration completed from card 4 to card 13

2009-Mar-08+04:44:15.364 [sitmain 4025 info] [4/1/370 <sitmain:41> sittask.c:3853] [software internal system critical-info] Task migration started facility sessmgr instance 208 from card 4 cpu 1 to card 13 cpu 1

2009-Mar-08+04:44:15.363 [sitmain 4025 info] [4/0/397 <sitmain:40> sittask.c:3853] [software internal system critical-info] Task migration started facility sessmgr instance 5154 from card 4 cpu 0 to card 13 cpu 0

2009-Mar-08+04:44:14.356 [rct 13001 info] [9/0/570 <rct:0> rct_pac.c:148] [software internal system critical-info] PAC task migration begin from card 4 to card 13

2009-Mar-08+04:44:14.181 [csp 7009 info] [9/0/915 <cspctrl:0> pctrl_helpers.c:2219] [hardware internal system critical-info] Migrating all tasks on the Packet Accelerator Card in slot 4 to the one in slot 13.

2009-Mar-08+04:44:13.980 [csp 7019 critical] [9/0/915 <cspctrl:0> spctrl_events.c:3045] [hardware internal system] The Packet Accelerator Card with serial number PLW02061482 in slot 4 has failed and will be brought down and brought back online. (Device=CPU_3, Reason=CPU_CRITICAL_TASK_FAILURE, Status=Boot Done [hb: E0], Boot Done [hb: E0], Boot Done [hb: E0], Boot Done [hb: E0])

2009-Mar-08+04:44:13.850 [hat 3033 error] [9/0/568 <hatsystem:0> atsystem_fail.c:957] [hardware internal system critical-info] Card error detected on card 4 device CPU_3 reason CPU_CRITICAL_TASK_FAILURE

2009-Mar-08+04:44:13.808 [hat 3051 warning] [9/0/568 <hatsystem:0> hat_hb_lib.c:1120] [software internal system critical-info] ICMP ping results from card 4: rtt(0,0)=0ms rtt(1,0)=0ms rtt(2,0)=0ms rtt(3,0)=0ms

2009-Mar-08+04:44:13.806 [hat 3014 critical] [9/0/568 <hatsystem:0> hat_hb_lib.c:611] [software internal system] HAT instance 0 found Critical task hatcpu/43 failed on cpu 4/3. Missed heartbeat sequence 4130607 ns_age 4 bounce_code N/A.

                                                   

Note the migration type is “Unplanned”.

******** show rct stats *******

RCT stats Details (Last 6 Actions)

Action         Type      From To   Start Time                Duration

-------------- --------- ---- ---- ------------------------  ---------- 

PAC migration  Unplanned  15   11  2009-Feb-06+14:29:30.229  10.594 sec

SPC switchover Unplanned   8    9  2009-Feb-11+21:56:08.687  65.313 sec

SPC switchover Unplanned   9    8  2009-Feb-22+08:24:07.776  71.215 sec

SPC switchover Unplanned   8    9  2009-Feb-23+11:48:10.135  74.927 sec

PAC migration  Unplanned  13   15  2009-Mar-02+09:42:09.621  12.253 sec

PAC migration  Unplanned   4   13  2009-Mar-08+04:44:14.357  12.601 sec

Migration with failed boot of failed card

A similar example as the previous one except the failed card fails to boot after migration:

Sun Mar 22 21:28:51 2009 Internal trap notification 73 (ManagerFailure) facility hatcpu instance 151 card 8 cpu 0

Sun Mar 22 21:28:52 2009 Internal trap notification 16 (PACMigrateStart) from card 15 to card 13

...

Sun Mar 22 21:29:36 2009 Internal trap notification 60 (CardDown) card 15

Sun Mar 22 21:29:38 2009 Internal trap notification 55 (CardActive) card 13

Sun Mar 22 21:29:38 2009 Internal trap notification 17 (PACMigrateComplete) from card 15 to card 13

Sun Mar 22 21:29:49 2009 Internal trap notification 139 (CardSPOFAlarm) card 13

Sun Mar 22 21:32:38 2009 Internal trap notification 9 (CardBootFailed) card 15

Sun Mar 22 21:35:41 2009 Internal trap notification 9 (CardBootFailed) card 15

Sun Mar 22 21:38:43 2009 Internal trap notification 9 (CardBootFailed) card 15

Sun Mar 22 21:41:51 2009 Internal trap notification 9 (CardBootFailed) card 15

...

2009-Mar-22+21:32:38.782 [hat 3033 error] [8/0/921 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 15 device CPU_0 reason CARD_BOOT_TIMEOUT_EXPIRED

2009-Mar-22+21:32:38.782 [csp 7005 critical] [8/0/930 <cspctrl:0> spctrl_events.c:2447] [hardware internal system] The Packet Services Card in slot 15 did not boot in the allowed time. CPU 0 did not boot. CPU 1 did not boot.

2009-Mar-22+21:29:38.943 [csp 7022 info] [8/0/930 <cspctrl:0> spctrl_events.c:3391] [hardware internal system critical-info] The migration of tasks on card 15 to card 13 is complete.

2009-Mar-22+21:29:36.820 [sitmain 4019 warning] [8/0/927 <npuctrl:0> sit_api.c:1219] [software internal system critical-info] Message bounced to facility sitmain instance 150

2009-Mar-22+21:29:36.343 [rct 13014 info] [8/0/923 <rct:0> rct_pac.c:1072] [software internal system critical-info] PAC Card 15 shutdown completed, took 0.018 sec

2009-Mar-22+21:29:36.325 [rct 13013 info] [8/0/923 <rct:0> rct_pac.c:952] [software internal system critical-info] PAC Card 15 shutdown started

2009-Mar-22+21:29:36.316 [rct 13002 info] [8/0/923 <rct:0> rct_pac.c:466] [software internal system critical-info] PAC task migration completed from card 15 to card 13

...

2009-Mar-22+21:28:54.786 [sitmain 4026 info] [15/0/712 <sitmain:150> sittask.c:2800] [software internal system critical-info] Task migration completed facility sessmgr instance 75 from card 15 cpu 0 to card 13 cpu 0 took 2.054931 sec

2009-Mar-22+21:28:52.731 [sitmain 4025 info] [15/0/712 <sitmain:150> sittask.c:3860] [software internal system critical-info] Task migration started facility sessmgr instance 75 from card 15 cpu 0 to card 13 cpu 0

2009-Mar-22+21:28:52.179 [rct 13001 info] [8/0/923 <rct:0> rct_pac.c:149] [software internal system critical-info] PAC task migration begin from card 15 to card 13

2009-Mar-22+21:28:52.154 [csp 7009 info] [8/0/930 <cspctrl:0> pctrl_helpers.c:2219] [hardware internal system critical-info] Migrating all tasks on the Packet Services Card in slot 15 to the one in slot 13.

2009-Mar-22+21:28:51.935 [csp 7019 critical] [8/0/930 <cspctrl:0> spctrl_events.c:3052] [hardware internal system] The Packet Services Card with serial number PLB48076369 in slot 15 has failed and will be brought down and brought back online. (Device=CPU_1, Reason=CPU_CRITICAL_TASK_FAILURE, Status=Boot Done [hb: E1], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], GPIO_IN: 00,ff,ff,ff)

2009-Mar-22+21:28:51.928 [hat 3033 error] [8/0/921 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 15 device CPU_1 reason CPU_CRITICAL_TASK_FAILURE

2009-Mar-22+21:28:50.464 [hat 3050 error] [8/0/921 <hatsystem:0> hat_hb_lib.c:1139] [software internal system critical-info] Kernel missed ICMP echo request(s) from card 15: cpu 1

2009-Mar-22+21:28:50.464 [hat 3051 warning] [8/0/921 <hatsystem:0> hat_hb_lib.c:1121] [software internal system critical-info] ICMP ping results from card 15: rtt(0,0)=0ms

2009-Mar-22+21:28:50.424 [hat 3014 critical] [8/0/921 <hatsystem:0> hat_hb_lib.c:611] [software internal system] HAT instance 0 found Critical task hatcpu/151 failed on cpu 15/1. Missed heartbeat sequence 12400479 ns_age 4 bounce_code N/A.

To troubleshoot an offline card, it should be (firmly) reseated physically in the slot to determine if it will come back up (into standby mode). If it remains offline after doing this, then the card should be RMA’d.

Standby SMC reboot

Conceptually, a SPC/SMC switchover works the same way as a PAC migration. In this example, the standby SMC rebooted, so there is no need for a switchover, just a reboot. (SR 6208)

Fri Mar 20 12:46:39 2009 Internal trap notification 73 (ManagerFailure) facility hatsystem instance 1 card 8 cpu 0

Fri Mar 20 12:46:39 2009 Internal trap notification 60 (CardDown) card 9

Fri Mar 20 12:46:42 2009 Internal trap notification 139 (CardSPOFAlarm) card 8

Fri Mar 20 12:48:09 2009 Internal trap notification 5 (CardUp) card 9

Fri Mar 20 12:48:09 2009 Internal trap notification 93 (CardStandby) card 9

2009-Mar-20+12:47:52.448 [sitmain 4003 info] [9/0/852 <sitmain:90> sitmain.c:469] [software internal system critical-info] Boxer starting on card 9 cpu 0 (Standby)

2009-Mar-20+12:46:57.555 [resmgr 14571 warning] [8/0/916 <rmctrl:0> ource_license.c:1109] [software internal system critical-info] License status changed to: good. It only matches one card. License is valid but is not redundant.

2009-Mar-20+12:46:39.732 [rct 13013 info] [8/0/913 <rct:0> rct_spc.c:284] [software internal system critical-info] SPC Card 9 shutdown started

2009-Mar-20+12:46:39.707 [csp 7019 critical] [8/0/920 <cspctrl:0> spctrl_events.c:3052] [hardware internal system] The System Management Card with serial number PLB32071119 in slot 9 has failed and will be reset and brought back online. (Device=CPU_0, Reason=CPU_C

RITICAL_TASK_FAILURE, Status=Read Fail)

2009-Mar-20+12:46:39.698 [hat 3033 error] [8/0/911 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 9 device CPU_0 reason CPU_CRITICAL_TASK_FAILURE

2009-Mar-20+12:46:39.698 [hat 3050 error] [8/0/911 <hatsystem:0> hat_hb_lib.c:1139] [software internal system critical-info] Kernel missed ICMP echo request(s) from card 9: cpu 0

2009-Mar-20+12:46:39.698 [hat 3051 warning] [8/0/911 <hatsystem:0> hat_hb_lib.c:1121] [software internal system critical-info] ICMP ping results from card 9: All CPUs failed to respond

2009-Mar-20+12:46:39.657 [hat 3014 critical] [8/0/911 <hatsystem:0> hat_hb_lib.c:611] [software internal system] HAT instance 0 found Critical task hatsystem/1 failed on cpu 9/0. Missed heartbeat sequence 12459150 ns_age 5 bounce_code N/A.

SMC Switchover

In this example, SMC 8 encountered a missed heartbeat and switched over to SMC 9, and then failed to boot properly the first time but did so on the second attempt (SR 6183)

Note that when an SMC switchover occurs, the SNMP traps and logs up to the point of the switchover are lost (though they would be available on your trap and syslog servers if configured).

2009-Mar-15+12:14:38.759 [sitmain 4003 info] [8/0/854 <sitmain:80> sitmain.c:469] [software internal system critical-info] Boxer starting on card 8 cpu 0 (Standby)

2009-Mar-15+12:13:25.392 [rct 13013 info] [9/0/956 <rct:0> rct_spc.c:284] [software internal system critical-info] SPC Card 8 shutdown started

2009-Mar-15+12:13:25.392 [csp 7019 critical] [9/0/10286 <cspctrl:0> spctrl_events.c:3052] [hardware internal system] The System Management Card with serial number Unknown in slot 8 has failed and will be brought down and brought back online. (Device=CPU_0, Reason=CARD_BOOT_TIMEOUT_EXPIRED, Status=Read Fail)

2009-Mar-15+12:13:25.391 [hat 3005 warning] [9/0/954 <hatsystem:0> hatsystem_msg.c:3147] [software internal system critical-info] message bounced from facility hatcpu instance 80

2009-Mar-15+12:13:25.391 [hat 3033 error] [9/0/954 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 8 device CPU_0 reason CARD_BOOT_TIMEOUT_EXPIRED

2009-Mar-15+12:13:25.391 [csp 7005 critical] [9/0/10286 <cspctrl:0> spctrl_events.c:2447] [hardware internal system] The System Management Card in slot 8 did not boot in the allowed time. CPU 0 did not boot.

2009-Mar-15+12:13:16.368 [system 1000 critical] [9/0/10282 <evlogd:0> evlgd_syslogd.c:146] [software internal system] CPU[8/0]: Gave up waiting for kernel, booting...

2009-Mar-15+12:09:31.552 [rct 13027 critical] [9/0/956 <rct:0> rct_spc.c:525] [software internal system critical-info] RCT readdress success instance 1 to instance 0

2009-Mar-15+12:09:31.447 [rct 13010 info] [9/0/956 <rct:1> rct_spc.c:431] [software internal system critical-info] SPC card 9 upgrading to Primary SPC

2009-Mar-15+12:09:31.409 [rct 13007 info] [9/0/956 <rct:1> rct_spc.c:245] [software internal system critical-info] Unplanned SPC switchover started from card 8 to card 9

2009-Mar-15+12:09:31.395 [hat 3050 error] [9/0/954 <hatsystem:1> hat_hb_lib.c:1139] [software internal system critical-info] Kernel missed ICMP echo request(s) from card 8: cpu 0

2009-Mar-15+12:09:31.395 [hat 3051 warning] [9/0/954 <hatsystem:1> hat_hb_lib.c:1121] [software internal system critical-info] ICMP ping results from card 8: All CPUs failed to respond

2009-Mar-15+12:09:31.347 [hat 3014 critical] [9/0/954 <hatsystem:1> hat_hb_lib.c:611] [software internal system] HAT instance 1 found Critical task hatsystem/0 failed on cpu 8/0. Missed heartbeat sequence 12460328 ns_age 5 bounce_code N/A.

******** show rct stats *******

RCT stats Details (Last 2 Actions)

Action         Type      From To   Start Time                Duration

-------------- --------- ---- ---- ------------------------  ---------- 

PSC migration  Planned    11    6  2009-Mar-05+09:54:16.824  34.811 sec

SMC switchover Unplanned   8    9  2009-Mar-15+12:09:31.447  62.395 sec

Because the card did not boot properly the first time, it would make sense to provide the logs to Starent for further analysis. If the card remained offline, it should be (firmly) reseated physically in the slot to determine if it will come back up (into standby mode). If it remains offline after doing this, then the card should be RMA’d.

Imported from Starent Networks Knowledgebase Article # 10440

Average Rating: 0 (0 ratings)

Actions

Login or Register to take actions

This Document

Posted January 24, 2012 at 3:23 PM
Stats:
Comments:0 Avg. Rating:0
Views:1817 Contributors:0
Shares:0
Tags: asr5k
+

Documents Leaderboard