Overview

David Damerjian · ‎01-24-2012

Overview.. 1

Basic hardware and software architecture. 1

Session Recovery. 8

Task Warnings and Alarms. 11

NON-critical Task Failures. 12

Migrations/Switchovers. 14

Critical Task Failures with or without migration. 16

No Migration and failed card fails to reboot successfully and remains offline. 16

No Migration and failed card reboots successfully. 19

Migration with successful boot of failed card. 20

Migration with failed boot of failed card. 21

Standby SMC reboot 22

SMC Switchover 23

Overview

This article explains the basics of PAC/PSC migrations and SPC/SMC switchovers, along with session/task failure recovery when hardware or software failures occur, how to recognize when such actions take place, and what actions can/should be taken as a result. It includes all the relevant CLI commands that can be run, what to look for in the output and how to interpret it, and what to report to Starent.

The STxx uses a redundant architecture, so that if the hardware and/or software fails for any reason, the system can continue to run with minimal impact. A failure could be software or hardware based, or even a combination, and the ultimate action to be taken as a result could be the issuance of an RMA for hardware or a new software build containing a fix for a software bug. The steps of detecting the failure, understanding the recovery action taken and what the impact was, and determining whether or not it needs further addressing are covered here.

Basic hardware and software architecture

The following is a very basic discussion of the system architecture, including card types, tasks,

The STxx chassis houses slots numbered 1 through 48, containing cards types as follows:

PAC/PSC: 1 – 7, 10 – 16

SPC/SMC: 8, 9

Linecards: 17 – 23, 26 – 39, 42 – 48 (two rows, with redundant cards located directly underneath active ones)

SPIO cards: 24, 25

RCC: 40, 41

See the System Overview manual for a much more in depth description of the chassis components, including diagrams, as this is not the focus of this article.

To see a list of all the cards in the system, run “show card table all”. Note the output below indicates an ST16 because the card types are PAC and SPC instead of PSC and SMC. Note the Attach column indicates which Linecard is “attached” to a given PAC card. Normally the linecard directly behind a PAC card (PAC card number + 16) will be attached by default, and that linecard’s redundant card will be numbered + 16. Note all the PAC cards are active except for one that is standby. A standby card is available in case there is a failure of an active card, and it will be made the active card in case one of the active cards fail – this will be discussed further in this article.

******** show card table all *******

Slot Card Type Oper State SPOF Attach

----------- -------------------------------- ------------- ---- ------

1: PAC/TAC None - - - -

2: PAC/TAC Packet Accelerator Card Active No 18 34

3: PAC/TAC Packet Accelerator Card Active No - -

4: PAC/TAC Packet Accelerator Card Active No - -

5: PAC/TAC None - - - -

6: PAC/TAC Packet Accelerator Card Active No - -

7: PAC/TAC None - - - -

8: SPC Switch Processor Card Active No 24 25

9: SPC Switch Processor Card Standby - - -

10: PAC/TAC None - - - -

11: PAC/TAC Packet Accelerator Card Active No - -

12: PAC/TAC None - - - -

13: PAC/TAC Packet Accelerator Card Active No - -

14: PAC/TAC None - - - -

15: PAC/TAC Packet Accelerator Card Standby - - -

16: PAC/TAC None - - - -

17: LC None - - -

18: LC 1000 Ethernet Line Card Active No 2

19: LC None - - -

20: LC None - - -

21: LC None - - -

22: LC None - - -

23: LC None - - -

24: SPIO Switch Processor I/O Card Active No 8

25: SPIO Switch Processor I/O Card Standby - 8

26: LC None - - -

27: LC None - - -

28: LC None - - -

29: LC None - - -

30: LC None - - -

31: LC None - - -

32: LC None - - -

33: LC None - - -

34: LC 1000 Ethernet Line Card Standby - 2

35: LC None - - -

36: LC None - - -

37: LC None - - -

38: LC None - - -

39: LC None - - -

40: RCC Redundancy Crossbar Card Standby -

41: RCC Redundancy Crossbar Card Standby -

42: LC None - - -

43: LC None - - -

44: LC None - - -

45: LC None - - -

46: LC None - - -

47: LC None - - -

48: LC None - - -

Other useful commands for troubleshooting and collecting information about cards are:

show card hardware

show card info

show card diag

The SMC (System Management Card, ST40) or SPC (System Processor Card, ST16) serve as the primary controller for the chassis and is responsible for initializing the entire system and loading the system configuration into other cards in the chassis. For redundancy, there are two of these cards in a system, one active and the other standby, and they reside in slots 8 and 9.

The PSC (Packet Services Card, ST40) or PAC (Packet Accelerator Card, ST16) is the guts of the system that does the subscriber packet processing and forwarding, implements the various protocols required to setup, teardown, and maintain calls, manages statistics, and forwards data traffic to the destination.

The ST16 contains 4 CPUs (control processors) while the ST40 contains 2 CPUs, and numbering starts at 0 (not 1). Both contain an NPU (network processor).

Tasks or Processes (interchangeable terms) run on PSC/PAC and SMC/SPC cards and can be viewed with the command “show task table”, while the amount of resources (CPU and memory) used by those tasks can be viewed with “show task resources”. An explanation of all the possible tasks that can be running, their function, how they are distributed across all cards, and all of their interdependencies is complex and beyond the scope of this document, but there is some basic information worth being aware of. Chapter 3, Software Architecture, in the System Overview manual, has an example high level software architecture block diagram.

Significant Tasks, some of which may be mentioned further are:

vpnmgr – Virtual Private Network – controls contexts

hatcpu – High Availability Task

sitmain – System Initialization Task

a11mgr – A11 protocol for call setup between PDSN and PCF

aaamgr – Radius protocol for authentication and accounting between PDSN and Radius servers

sessmgr – maintaining knowledge of subscriber sessions

acsmgr – processing for active charging service (merged into sessmgr starting in v8.1)

npumgr – processing of inbound and outbound packets and intra chassis packet routing

famgr – foreign agent service processing

hamgr – home agent service processing

The following only run on SPC/SMC:

csp – Card/Slot/Port – coordinates events occurring when cards inserted/locked/removed/shutdown/migrated

rct – Recovery Control Task – executes a defined recovery action for a failure

hatsystem – monitors all hatcpu tasks on PACs

rmmgr – Recource Manager – assigns resources to every task

cli – one started for each CLI session

bulkstat – processes bulkstat data

snmp – processes SNMP traps

orbs – communicates with EMS server

Each sessmgr task is responsible for a specific number of subscriber sessions, with a maximum number (6600 in example below) depending on the chassis type and licensing. Sessmgr tasks are spawned across all cards and sessions are load balanced across them. Here is an example from “show task resources” showing this on an ST16 (4 CPUs/card). Note the sessmgr processes on CPUs 1 - 3 are housing around 800 sessions each, while CPU 0 sessmgrs are housing slightly less. This same allocation pattern would be seen across all active PAC cards, as can be seen with the included output from CPU 4/1 (SR 7047)

******** show task resources *******

task cputime memory files sessions

cpu facility inst used allc used alloc used allc used allc S status

----------------------- --------- ------------- --------- ------------- ------

3/0 sitmain 30 0.0% 15% 1.8M 6.0M 14 1000 -- -- - good

3/0 sitparent 30 0.0% 20% 1.4M 4.0M 12 500 -- -- - good

3/0 hatcpu 30 0.1% 10% 1.6M 5.0M 13 500 -- -- - good

3/0 rmmgr 30 1.7% 10% 3.3M 7.0M 23 500 -- -- - good

3/0 dhmgr 30 0.0% 5% 2.0M 6.0M 18 2000 -- -- - good

3/0 dcardmgr 3 0.0% 60% 12.7M 20.0M 13 500 -- -- - good

3/0 npumgr 3 7.5% 100% 117.5M 225.3M 35 1000 -- -- - good

3/0 sft 300 6.9% 50% 2.8M 12.0M 43 500 -- -- - good

3/0 sessmgr 5013 0.1% 50% 18.1M 85.0M 18 500 -- -- S good

3/0 aaamgr 5014 0.0% 50% 6.5M 20.0M 16 500 -- -- - good

3/0 sessmgr 4 5.1% 100% 81.4M 155.0M 28 500 505 6600 I good

3/0 sessmgr 31 6.0% 100% 79.4M 155.0M 27 500 519 6600 I good

3/0 sessmgr 51 6.2% 100% 80.0M 155.0M 29 500 537 6600 I good

3/0 aaamgr 74 2.0% 94% 39.6M 77.0M 18 500 -- -- - good

3/0 aaamgr 79 2.0% 94% 39.4M 77.0M 18 500 -- -- - good

3/0 aaamgr 85 1.6% 94% 39.6M 77.0M 19 500 -- -- - good

3/0 aaamgr 94 1.9% 94% 39.4M 77.0M 18 500 -- -- - good

3/0 sitreap 3000509 0.0% 5% 0.1M 5.0M 5 100 -- -- - good

3/0 msgd 3000507 0.9% 5% 0.8M 5.0M 6 100 -- -- - good

3/0 nscontrol 3030563 0.0% 5% 1.0M 5.0M 10 500 -- -- - good

3/1 sitmain 31 0.1% 15% 1.8M 6.0M 14 1000 -- -- - good

3/1 sitparent 31 0.0% 20% 1.4M 4.0M 12 500 -- -- - good

3/1 hatcpu 31 0.2% 10% 1.6M 5.0M 13 500 -- -- - good

3/1 rmmgr 31 2.1% 10% 3.2M 7.0M 26 500 -- -- - good

3/1 dhmgr 31 0.0% 5% 1.8M 6.0M 18 2000 -- -- - good

3/1 sessmgr 5002 0.2% 50% 18.0M 85.0M 18 500 -- -- S good

3/1 aaamgr 5003 0.0% 50% 6.4M 20.0M 16 500 -- -- - good

3/1 aaamgr 1 1.7% 94% 38.8M 77.0M 19 500 -- -- - good

3/1 aaamgr 2 1.3% 94% 35.9M 77.0M 19 500 -- -- - good

3/1 aaamgr 3 1.8% 94% 39.3M 77.0M 18 500 -- -- - good

3/1 sessmgr 18 10% 100% 93.2M 155.0M 29 500 799 6600 I good

3/1 sessmgr 23 7.8% 100% 93.3M 155.0M 29 500 786 6600 I good

3/1 aaamgr 52 1.4% 94% 36.0M 77.0M 19 500 -- -- - good

3/1 sessmgr 56 7.6% 100% 91.0M 155.0M 28 500 787 6600 I good

3/1 sessmgr 61 9.2% 100% 90.4M 155.0M 29 500 798 6600 I good

3/1 sessmgr 81 8.9% 100% 91.1M 155.0M 28 500 802 6600 I good

3/1 sessmgr 93 9.5% 100% 92.0M 155.0M 28 500 783 6600 I good

3/1 aaamgr 102 2.1% 94% 38.7M 77.0M 19 500 -- -- - good

3/1 sitreap 3100472 0.0% 5% 0.1M 5.0M 5 100 -- -- - good

3/1 msgd 3100470 1.0% 5% 0.8M 5.0M 6 100 -- -- - good

3/1 nscontrol 3130563 0.0% 5% 1.0M 5.0M 10 500 -- -- - good

3/2 sitmain 32 0.0% 15% 1.8M 6.0M 14 1000 -- -- - good

3/2 sitparent 32 0.0% 20% 1.4M 4.0M 12 500 -- -- - good

3/2 hatcpu 32 0.2% 10% 1.6M 5.0M 13 500 -- -- - good

3/2 rmmgr 32 2.2% 10% 3.4M 7.0M 25 500 -- -- - good

3/2 dhmgr 32 0.0% 5% 1.8M 6.0M 18 2000 -- -- - good

3/2 sessmgr 5021 0.2% 50% 17.9M 85.0M 18 500 -- -- S good

3/2 aaamgr 5022 0.0% 50% 6.5M 20.0M 16 500 -- -- - good

3/2 aaamgr 5 1.5% 94% 36.1M 77.0M 18 500 -- -- - good

3/2 sessmgr 7 10% 100% 93.1M 155.0M 29 500 813 6600 I good

3/2 aaamgr 25 1.6% 94% 38.5M 77.0M 19 500 -- -- - good

3/2 sessmgr 35 10% 100% 95.2M 155.0M 26 500 810 6600 I good

3/2 sessmgr 41 10% 100% 95.5M 155.0M 28 500 803 6600 I good

3/2 aaamgr 55 1.5% 94% 36.2M 77.0M 19 500 -- -- - good

3/2 aaamgr 62 1.8% 94% 39.4M 77.0M 20 500 -- -- - good

3/2 sessmgr 66 8.3% 100% 93.1M 155.0M 28 500 804 6600 I good

3/2 aaamgr 84 2.0% 94% 38.6M 77.0M 19 500 -- -- - good

3/2 sessmgr 86 7.5% 100% 93.3M 155.0M 28 500 804 6600 I good

3/2 sessmgr 98 8.6% 100% 94.1M 155.0M 27 500 784 6600 I good

3/2 sitreap 3200472 0.0% 5% 0.1M 5.0M 5 100 -- -- - good

3/2 msgd 3200470 1.0% 5% 0.8M 5.0M 6 100 -- -- - good

3/2 nscontrol 3230563 0.0% 5% 1.0M 5.0M 10 500 -- -- - good

3/3 sitmain 33 0.1% 15% 1.8M 6.0M 14 1000 -- -- - good

3/3 sitparent 33 0.0% 20% 1.3M 4.0M 12 500 -- -- - good

3/3 hatcpu 33 0.2% 10% 1.6M 5.0M 13 500 -- -- - good

3/3 rmmgr 33 2.1% 10% 3.4M 7.0M 25 500 -- -- - good

3/3 dhmgr 33 0.0% 5% 2.0M 6.0M 18 2000 -- -- - good

3/3 sessmgr 5026 0.2% 50% 18.1M 85.0M 18 500 -- -- S good

3/3 aaamgr 5027 0.0% 50% 6.5M 20.0M 16 500 -- -- - good

3/3 aaamgr 11 1.8% 94% 39.6M 77.0M 19 500 -- -- - good

3/3 sessmgr 13 9.4% 100% 92.3M 155.0M 28 500 802 6600 I good

3/3 aaamgr 21 1.3% 94% 35.6M 77.0M 18 500 -- -- - good

3/3 aaamgr 34 2.0% 94% 39.5M 77.0M 19 500 -- -- - good

3/3 aaamgr 38 2.2% 94% 38.6M 77.0M 19 500 -- -- - good

3/3 sessmgr 40 10% 100% 91.3M 155.0M 29 500 802 6600 I good

3/3 sessmgr 46 10% 100% 91.7M 155.0M 30 500 814 6600 I good

3/3 aaamgr 70 2.0% 94% 39.3M 77.0M 19 500 -- -- - good

3/3 sessmgr 71 8.3% 100% 90.9M 155.0M 29 500 808 6600 I good

3/3 sessmgr 80 7.4% 100% 91.5M 155.0M 29 500 793 6600 I good

3/3 sessmgr 90 8.6% 100% 91.3M 155.0M 28 500 801 6600 I good

3/3 sitreap 3300472 0.0% 5% 0.1M 5.0M 5 100 -- -- - good

3/3 msgd 3300470 1.0% 5% 0.8M 5.0M 6 100 -- -- - good

3/3 nscontrol 3330557 0.0% 5% 1.0M 5.0M 10 500 -- -- - good

......

4/1 sitmain 41 0.1% 15% 1.8M 6.0M 14 1000 -- -- - good

4/1 sitparent 41 0.0% 20% 1.5M 4.0M 12 500 -- -- - good

4/1 hatcpu 41 0.2% 10% 1.6M 5.0M 13 500 -- -- - good

4/1 rmmgr 41 2.1% 10% 3.3M 7.0M 26 500 -- -- - good

4/1 dhmgr 41 0.0% 5% 1.9M 6.0M 18 2000 -- -- - good

4/1 sessmgr 5004 0.2% 50% 18.0M 85.0M 18 500 -- -- S good

4/1 aaamgr 5005 0.1% 50% 6.5M 20.0M 16 500 -- -- - good

4/1 sessmgr 1 7.3% 100% 93.1M 155.0M 30 500 793 6600 I good

4/1 sessmgr 24 10% 100% 91.5M 155.0M 27 500 804 6600 I good

4/1 aaamgr 37 2.2% 94% 38.7M 77.0M 19 500 -- -- - good

4/1 aaamgr 41 2.2% 94% 39.5M 77.0M 19 500 -- -- - good

4/1 aaamgr 44 2.2% 94% 39.8M 77.0M 20 500 -- -- - good

4/1 aaamgr 49 1.7% 94% 38.6M 77.0M 19 500 -- -- - good

4/1 sessmgr 57 11% 100% 93.1M 155.0M 29 500 807 6600 I good

4/1 sessmgr 62 7.9% 100% 95.3M 155.0M 30 500 791 6600 I good

4/1 sessmgr 82 9.5% 100% 91.9M 155.0M 28 500 803 6600 I good

4/1 sessmgr 94 8.2% 100% 94.1M 155.0M 29 500 791 6600 I good

4/1 aaamgr 101 2.2% 94% 38.8M 77.0M 19 500 -- -- - good

4/1 sitreap 4100472 0.0% 5% 0.1M 5.0M 5 100 -- -- - good

4/1 msgd 4100470 1.0% 5% 0.8M 5.0M 6 100 -- -- - good

4/1 nscontrol 4130563 0.0% 5% 1.0M 5.0M 10 500 -- -- - good

The task table maps the task instance number with the process id (pid), as well as with the parent task instance and pid, for example:

******** show task table *******

task parent

cpu facility inst pid pri facility inst pid

---- ----------------------------- -------------------------

3/0 aaamgr 94 639 0 sessctrl 0 582

3/0 aaamgr 88 632 0 sessctrl 0 582

3/0 aaamgr 79 625 0 sessctrl 0 582

3/0 aaamgr 70 624 0 sessctrl 0 582

3/0 sessmgr 52 617 0 sessctrl 0 582

3/0 sessmgr 37 610 0 sessctrl 0 582

3/0 aaamgr 5026 585 0 sessctrl 0 582

3/0 sessmgr 3 572 0 sessctrl 0 582

3/0 sessmgr 5015 571 0 sessctrl 0 582

3/0 sft 300 528 0 npuctrl 0 564

3/0 npumgr 3 521 -8 npuctrl 0 564

3/0 dcardmgr 3 520 0 dcardctrl 0 567

3/0 dhmgr 30 519 -4 sitparent 30 510

3/0 rmmgr 30 512 0 sitparent 30 510

3/0 hatcpu 30 511 -12 sitparent 30 510

3/0 sitparent 30 510 -12 sitmain 30 420

3/0 sitmain 30 420 -12 sitmain 30 420

3/0 sitreap 3000509 509 0 sitmain 30 420

3/0 msgd 3000507 507 0 - 0 0

3/0 nscontrol 3024701 24701 0 - 0 0

Note that on PAC 15, the standby card, all the task instances are numbered greater than 5000, which is the signature of a standby task. Note above that there are also standby sessmgr and aaamgr tasks on active cards ready to take over for task failures (more on this discussed below), but most are active.

15/0 sitmain 150 0.0% 15% 1.8M 6.0M 14 1000 -- -- - good

15/0 sitparent 150 0.0% 20% 1.4M 4.0M 12 500 -- -- - good

15/0 hatcpu 150 0.2% 10% 1.6M 5.0M 13 500 -- -- - good

15/0 rmmgr 150 1.3% 10% 3.2M 7.0M 22 500 -- -- - good

15/0 dhmgr 150 0.0% 5% 1.7M 6.0M 18 2000 -- -- - good

15/0 dcardmgr 15 0.0% 60% 12.7M 20.0M 13 500 -- -- - good

15/0 npumgr 15 5.1% 100% 117.5M 225.3M 34 1000 -- -- - good

15/0 sft 1500 0.3% 50% 2.5M 12.0M 15 500 -- -- - good

15/0 sessmgr 5045 0.1% 50% 18.0M 85.0M 17 500 -- -- S good

15/0 sessmgr 5046 0.1% 50% 18.1M 85.0M 17 500 -- -- S good

15/0 sessmgr 5047 0.1% 50% 18.1M 85.0M 16 500 -- -- S good

15/0 aaamgr 5048 0.0% 50% 6.1M 20.0M 16 500 -- -- - good

15/0 aaamgr 5049 0.0% 50% 6.1M 20.0M 16 500 -- -- - good

15/0 aaamgr 5050 0.0% 50% 6.4M 20.0M 16 500 -- -- - good

15/0 aaamgr 5051 0.0% 50% 6.4M 20.0M 16 500 -- -- - good

15/0 sitreap 15000509 0.0% 5% 0.1M 5.0M 5 100 -- -- - good

15/0 msgd 15000507 0.7% 5% 0.8M 5.0M 6 100 -- -- - good

15/0 nscontrol 15030489 0.0% 5% 1.0M 5.0M 10 500 -- -- - good

15/1 sitmain 151 0.0% 15% 1.8M 6.0M 14 1000 -- -- - good

15/1 sitparent 151 0.0% 20% 1.4M 4.0M 12 500 -- -- - good

15/1 hatcpu 151 0.2% 10% 1.5M 5.0M 13 500 -- -- - good

15/1 rmmgr 151 1.2% 10% 3.2M 7.0M 25 500 -- -- - good

15/1 dhmgr 151 0.0% 5% 1.9M 6.0M 18 2000 -- -- - good

15/1 sessmgr 5040 0.1% 50% 17.9M 85.0M 17 500 -- -- S good

15/1 sessmgr 5041 0.0% 50% 18.1M 85.0M 17 500 -- -- S good

15/1 sessmgr 5042 0.1% 50% 18.1M 85.0M 17 500 -- -- S good

15/1 sessmgr 5043 0.1% 50% 18.1M 85.0M 17 500 -- -- S good

15/1 sessmgr 5044 0.1% 50% 18.1M 85.0M 17 500 -- -- S good

15/1 sessmgr 5052 0.1% 50% 18.1M 85.0M 16 500 -- -- S good

15/1 aaamgr 5053 0.0% 50% 6.0M 20.0M 16 500 -- -- - good

15/1 aaamgr 5054 0.0% 50% 6.0M 20.0M 16 500 -- -- - good

15/1 aaamgr 5055 0.0% 50% 6.1M 20.0M 16 500 -- -- - good

15/1 aaamgr 5056 0.0% 50% 6.0M 20.0M 16 500 -- -- - good

15/1 aaamgr 5057 0.0% 50% 6.4M 20.0M 16 500 -- -- - good

15/1 aaamgr 5058 0.0% 50% 6.4M 20.0M 16 500 -- -- - good

15/1 sitreap 15100472 0.0% 5% 0.1M 5.0M 5 100 -- -- - good

15/1 msgd 15100470 0.7% 5% 0.8M 5.0M 6 100 -- -- - good

15/1 nscontrol 15130514 0.0% 5% 1.0M 5.0M 10 500 -- -- - good

15/2 sitmain 152 0.0% 15% 1.8M 6.0M 14 1000 -- -- - good

15/2 sitparent 152 0.0% 20% 1.4M 4.0M 12 500 -- -- - good

15/2 hatcpu 152 0.2% 10% 1.6M 5.0M 13 500 -- -- - good

15/2 rmmgr 152 1.1% 10% 3.3M 7.0M 25 500 -- -- - good

15/2 dhmgr 152 0.0% 5% 2.0M 6.0M 18 2000 -- -- - good

15/2 sessmgr 5059 0.1% 50% 17.9M 85.0M 17 500 -- -- S good

15/2 sessmgr 5060 0.1% 50% 18.1M 85.0M 17 500 -- -- S good

15/2 sessmgr 5061 0.1% 50% 18.1M 85.0M 17 500 -- -- S good

15/2 sessmgr 5062 0.1% 50% 18.1M 85.0M 16 500 -- -- S good

15/2 sessmgr 5069 0.1% 50% 18.1M 85.0M 17 500 -- -- S good

15/2 sessmgr 5070 0.1% 50% 18.1M 85.0M 16 500 -- -- S good

15/2 aaamgr 5071 0.0% 50% 6.0M 20.0M 16 500 -- -- - good

15/2 aaamgr 5072 0.0% 50% 6.0M 20.0M 16 500 -- -- - good

15/2 aaamgr 5073 0.0% 50% 6.1M 20.0M 16 500 -- -- - good

15/2 aaamgr 5074 0.0% 50% 6.0M 20.0M 16 500 -- -- - good

15/2 aaamgr 5075 0.0% 50% 6.5M 20.0M 16 500 -- -- - good

15/2 aaamgr 5076 0.0% 50% 6.4M 20.0M 16 500 -- -- - good

15/2 sitreap 15200472 0.0% 5% 0.1M 5.0M 5 100 -- -- - good

15/2 msgd 15200470 0.5% 5% 0.8M 5.0M 6 100 -- -- - good

15/2 nscontrol 15230514 0.0% 5% 1.0M 5.0M 10 500 -- -- - good

15/3 sitmain 153 0.0% 15% 1.8M 6.0M 14 1000 -- -- - good

15/3 sitparent 153 0.0% 20% 1.5M 4.0M 12 500 -- -- - good

15/3 hatcpu 153 0.2% 10% 1.6M 5.0M 13 500 -- -- - good

15/3 rmmgr 153 1.2% 10% 3.2M 7.0M 25 500 -- -- - good

15/3 dhmgr 153 0.0% 5% 1.8M 6.0M 18 2000 -- -- - good

15/3 sessmgr 5063 0.1% 50% 17.9M 85.0M 17 500 -- -- S good

15/3 sessmgr 5064 0.1% 50% 18.1M 85.0M 17 500 -- -- S good

15/3 sessmgr 5065 0.1% 50% 18.1M 85.0M 17 500 -- -- S good

15/3 sessmgr 5066 0.1% 50% 18.1M 85.0M 17 500 -- -- S good

15/3 sessmgr 5067 0.1% 50% 18.1M 85.0M 17 500 -- -- S good

15/3 sessmgr 5068 0.1% 50% 18.1M 85.0M 16 500 -- -- S good

15/3 aaamgr 5077 0.0% 50% 6.1M 20.0M 16 500 -- -- - good

15/3 aaamgr 5078 0.0% 50% 6.2M 20.0M 16 500 -- -- - good

15/3 aaamgr 5079 0.0% 50% 6.4M 20.0M 16 500 -- -- - good

15/3 aaamgr 5080 0.0% 50% 6.5M 20.0M 16 500 -- -- - good

15/3 aaamgr 5081 0.0% 50% 6.2M 20.0M 16 500 -- -- - good

15/3 aaamgr 5082 0.0% 50% 6.1M 20.0M 16 500 -- -- - good

15/3 aaamgr 5083 0.0% 50% 6.1M 20.0M 16 500 -- -- - good

15/3 sitreap 15300472 0.0% 5% 0.1M 5.0M 5 100 -- -- - good

15/3 msgd 15300470 0.6% 5% 0.8M 5.0M 6 100 -- -- - good

15/3 nscontrol 15330545 0.0% 5% 1.0M 5.0M 10 500 -- -- - good

To round out the picture, here is output for the tasks running on the SPC:

******** show task resources *******

task cputime memory files sessions

cpu facility inst used allc used alloc used allc used allc S status

----------------------- --------- ------------- --------- ------------- ------

8/0 sitmain 80 0.0% 15% 1.9M 6.0M 15 1000 -- -- - good

8/0 sitparent 80 0.0% 20% 1.5M 4.0M 13 500 -- -- - good

8/0 evlogd 0 18% 94% 29.7M 60.0M 134 4000 -- -- - good

8/0 drvctrl 0 1.4% 15% 3.3M 5.0M 15 500 -- -- - good

8/0 hatsystem 0 1.1% 10% 2.0M 5.0M 18 500 -- -- - good

8/0 hatcpu 80 0.2% 10% 1.6M 5.0M 16 500 -- -- - good

8/0 rct 0 0.0% 5% 1.4M 5.0M 13 500 -- -- - good

8/0 sct 0 4.0% 50% 9.4M 65.0M 18 500 -- -- - good

8/0 rmmgr 80 2.5% 10% 3.5M 7.0M 19 500 -- -- - good

8/0 npumgr 8 2.6% 100% 46.0M 90.5M 24 1000 -- -- - good

8/0 vpnmgr 1 0.6% 30% 8.4M 12.0M 34 1000 -- -- - good

8/0 zebos 1 0.1% 50% 7.9M 12.0M 18 500 -- -- - good

8/0 aaamgr 231 0.1% 30% 11.7M 25.0M 20 500 -- -- - good

8/0 dhmgr 80 0.0% 5% 2.0M 6.0M 18 2000 -- -- - good

8/0 rmctrl 0 1.4% 20% 7.5M 22.0M 45 1000 -- -- - good

8/0 npuctrl 0 0.6% 10% 2.7M 10.0M 27 500 -- -- - good

8/0 dcardctrl 0 0.0% 10% 1.5M 8.0M 12 500 -- -- - good

8/0 cspctrl 0 1.7% 50% 3.5M 12.0M 16 500 -- -- - good

8/0 vpnctrl 0 0.1% 15% 3.5M 7.0M 16 1500 -- -- - good

8/0 cssctrl 0 0.0% 15% 1.4M 7.0M 12 500 -- -- - good

8/0 sessctrl 0 0.6% 89% 7.9M 57.0M 35 2000 -- -- - good

8/0 vmgctrl 0 0.0% 7% 1.5M 20.0M 13 500 -- -- - good

8/0 snmp 0 6.6% 34% 11.9M 12.0M 360 1500 -- -- - warn

8/0 orbs 0 0.0% 50% 9.2M 28.0M 16 1500 -- -- - good

8/0 bulkstat 0 20% 89% 16.1M 35.0M 316 1500 -- -- - good

8/0 threshold 0 0.5% 89% 4.2M 23.0M 12 1500 -- -- - good

8/0 ipsecctrl 0 0.1% 15% 2.2M 15.0M 23 500 -- -- - good

8/0 orbns 0 0.0% 50% 2.6M 12.0M 17 1500 -- -- - good

8/0 cli 8017052 0.0% 50% 4.0M 20.0M 11 2000 -- -- - good

8/0 cli 8017526 31% 50% 9.8M 20.0M 385 2000 -- -- - good

8/0 sitreap 8000546 0.0% 5% 0.1M 5.0M 5 100 -- -- - good

8/0 msgd 8000542 1.0% 5% 0.4M 5.0M 6 100 -- -- - good

8/0 msgd 8000543 0.0% 5% 0.1M 5.0M 7 100 -- -- - good

8/0 inetd 8000593 0.0% 5% 0.1M 5.0M 4 100 -- -- - good

8/0 sshd 8001310 0.0% 30% 0.2M 5.0M 4 100 -- -- - good

8/0 ntpd 8001320 0.0% 5% 0.2M 5.0M 8 100 -- -- - good

8/0 orbnsd 8001323 0.0% 5% 0.7M 10.0M 8 100 -- -- - good

8/0 sshd 8017050 0.0% 30% 0.3M 5.0M 9 100 -- -- - good

8/0 nscontrol 8017481 0.0% 5% 1.0M 5.0M 10 500 -- -- - good

8/0 sshd 8017522 2.1% 30% 0.3M 5.0M 9 100 -- -- - good

Session Recovery

Each sessmgr process is paired with a aaamgr process that is handling the aaa requests for the subscribers on that sessmgr, and that aaamgr process is located on a different card. The task instance ids for each pair is the same. The process pairings are across various cards, so that the processes are not paired just between two given cards.

Also, if session recovery feature is enabled, then that aaamgr also contains the state information of the sessions managed by the sessmgr process it is paired with. As well, pre-v8.1, acsmgr processes are also mirrored (acsmgr functionality has been combined into the sessmgr process with v8.1). In the case of a failed PAC, the sessions can be recreated on the standby card that has become active seamlessly based on the redundant information. Also, an additional PAC card is largely used for the recovery process (in addition to the standby card that takes over for the failed card). This PAC card is called the Demux card, and there are no session related tasks running on this card, and it is always in the active state. So, the session recovery feature requires at least three active cards and one standby card. Here is example output for the demux card (2) taken from the same chassis as the above output did.

2/0 sitmain 20 0.0% 15% 1.8M 6.0M 14 1000 -- -- - good

2/0 sitparent 20 0.0% 20% 1.3M 4.0M 12 500 -- -- - good

2/0 hatcpu 20 0.1% 10% 1.6M 5.0M 13 500 -- -- - good

2/0 rmmgr 20 1.0% 10% 3.2M 7.0M 19 500 -- -- - good

2/0 dhmgr 20 0.0% 5% 1.7M 6.0M 16 2000 -- -- - good

2/0 dcardmgr 2 0.0% 60% 12.5M 20.0M 13 500 -- -- - good

2/0 npumgr 2 8.1% 100% 117.5M 225.3M 34 1000 -- -- - good

2/0 sft 200 0.3% 50% 2.5M 12.0M 15 500 -- -- - good

2/0 vpnmgr 2 0.2% 100% 5.4M 12.0M 136 1000 -- -- - good

2/0 zebos 2 0.0% 50% 7.2M 12.0M 17 500 -- -- - good

2/0 sitreap 2000509 0.0% 5% 0.1M 5.0M 5 100 -- -- - good

2/0 msgd 2000507 0.5% 5% 0.8M 5.0M 6 100 -- -- - good

2/0 inetd 2000554 0.0% 5% 0.1M 5.0M 4 100 -- -- - good

2/0 nscontrol 2030377 0.0% 5% 1.0M 5.0M 10 500 -- -- - good

2/1 sitmain 21 0.0% 15% 1.8M 6.0M 14 1000 -- -- - good

2/1 sitparent 21 0.0% 20% 1.4M 4.0M 12 500 -- -- - good

2/1 hatcpu 21 0.1% 10% 1.6M 5.0M 13 500 -- -- - good

2/1 rmmgr 21 0.7% 10% 3.3M 7.0M 20 500 -- -- - good

2/1 dhmgr 21 0.0% 5% 1.9M 6.0M 18 2000 -- -- - good

2/1 l2tpdemux 1 0.0% 94% 2.3M 20.0M 14 500 -- -- - good

2/1 l2tpmgr 1 0.1% 100% 12.7M 266.0M 16 1500 0 200k - good

2/1 sitreap 2100472 0.0% 5% 0.1M 5.0M 5 100 -- -- - good

2/1 msgd 2100470 0.5% 5% 0.8M 5.0M 6 100 -- -- - good

2/1 nscontrol 2130198 0.0% 5% 1.0M 5.0M 10 500 -- -- - good

2/2 sitmain 22 0.0% 15% 1.8M 6.0M 14 1000 -- -- - good

2/2 sitparent 22 0.0% 20% 1.4M 4.0M 12 500 -- -- - good

2/2 hatcpu 22 0.1% 10% 1.6M 5.0M 13 500 -- -- - good

2/2 rmmgr 22 1.3% 10% 3.2M 7.0M 21 500 -- -- - good

2/2 dhmgr 22 0.0% 5% 1.7M 6.0M 14 2000 -- -- - good

2/2 vpnmgr 3 7.4% 100% 9.1M 41.4M 241 1000 -- -- - good

2/2 zebos 3 0.0% 50% 7.9M 15.0M 17 500 -- -- - good

2/2 famgr 1 9.9% 100% 34.7M 96.0M 236 1000 63k 200k - good

2/2 vpnmgr 4 0.3% 100% 4.9M 12.0M 28 1000 -- -- - good

2/2 zebos 4 0.0% 50% 7.4M 12.0M 17 500 -- -- - good

2/2 l2tpmgr 2 0.6% 100% 14.2M 150.0M 196 1500 144 130k - good

2/2 cdrmod 2 0.3% 94% 42.2M 315.9M 87 1024 -- -- - good

2/2 sitreap 2200472 0.0% 5% 0.1M 5.0M 5 100 -- -- - good

2/2 msgd 2200470 0.8% 5% 0.8M 5.0M 6 100 -- -- - good

2/2 inetd 2200515 0.0% 5% 0.1M 5.0M 4 100 -- -- - good

2/2 inetd 2200526 0.0% 5% 0.1M 5.0M 5 100 -- -- - good

2/2 sshd 2200527 0.0% 30% 0.2M 5.0M 4 100 -- -- - good

2/2 sshd 2200577 0.0% 30% 0.3M 5.0M 8 100 -- -- - good

2/2 sftp-serv 2200578 0.0% 10% 0.3M 5.0M 5 100 -- -- - good

2/2 sshd 2200579 0.0% 30% 0.3M 5.0M 8 100 -- -- - good

2/2 sftp-serv 2200586 0.0% 10% 0.4M 5.0M 5 100 -- -- - good

2/2 nscontrol 2203737 0.0% 5% 1.0M 5.0M 10 500 -- -- - good

2/3 sitmain 23 0.0% 15% 1.8M 6.0M 14 1000 -- -- - good

2/3 sitparent 23 0.0% 20% 1.4M 4.0M 12 500 -- -- - good

2/3 hatcpu 23 0.1% 10% 1.6M 5.0M 13 500 -- -- - good

2/3 rmmgr 23 0.8% 10% 3.1M 7.0M 20 500 -- -- - good

2/3 dhmgr 23 0.0% 5% 2.0M 6.0M 18 2000 -- -- - good

2/3 a11mgr 1 26% 100% 50.9M 171.0M 243 1000 79k 200k - good

2/3 sitreap 2300472 0.0% 5% 0.1M 5.0M 5 100 -- -- - good

2/3 msgd 2300470 0.8% 5% 0.8M 5.0M 6 100 -- -- - good

2/3 nscontrol 2330179 0.0% 5% 1.0M 5.0M 10 500 -- -- - good

The command “show session recovery verbose” also indicates which card is the demux card:

******** show session recovery status verbose *******

Session Recovery Status:

Overall Status : Ready For Recovery

Last Status Update : 7 seconds ago

----sessmgr--- ----aaamgr---- demux

cpu state active standby active standby active status

---- ------- ------ ------- ------ ------- ------ -------------------------

2/0 Active 0 0 0 0 1 Good (Demux)

2/1 Active 0 0 0 0 2 Good (Demux)

2/2 Active 0 0 0 0 5 Good (Demux)

2/3 Active 0 0 0 0 1 Good (Demux)

3/0 Active 3 1 4 1 0 Good

3/1 Active 6 1 5 1 0 Good

3/2 Active 6 1 5 1 0 Good

3/3 Active 6 1 5 1 0 Good

4/0 Active 3 1 4 1 0 Good

4/1 Active 6 1 5 1 0 Good

4/2 Active 6 1 5 1 0 Good

4/3 Active 6 1 6 1 0 Good

6/0 Active 3 1 4 1 0 Good

6/1 Active 6 1 6 1 0 Good

6/2 Active 6 1 6 1 0 Good

6/3 Active 6 1 6 1 0 Good

11/0 Active 3 1 4 1 0 Good

11/1 Active 6 1 5 1 0 Good

11/2 Active 6 1 6 1 0 Good

11/3 Active 6 1 7 1 0 Good

13/0 Active 3 1 4 1 0 Good

13/1 Active 6 1 5 1 0 Good

13/2 Active 6 1 6 1 0 Good

13/3 Active 6 1 7 1 0 Good

15/0 Standby 0 3 0 4 0 Good

15/1 Standby 0 6 0 6 0 Good

15/2 Standby 0 6 0 6 0 Good

15/3 Standby 0 6 0 7 0 Good

Another command will allow examining what card CPUs are devoted to what kind of activity. Note the demux card is 2, the sessmgr cards are 3, 4, 6, 11, 13. That leaves the remaining cards as standby (only one card, 15, in this case, but there could be multiple standbys)

******** show resources *******

Active CPUs:

Total CPUs : 24

Session CPUs : 20 [3/0 3/1 3/2 3/3 4/0 4/1 4/2 4/3 6/0 6/1 6/2 6/3 11/0 11/1 11/2 11/3 13/0 13/1 13/2 13/3]

Demux CPUs : 4 [2/0 2/1 2/2 2/3]

Highest Load : 2.09 (CPU 3/2)

Total Memory : 48.0G

Total Used : 18.7G

Least Free : 1059M (CPU 13/2)

Total Temporary Files : 5M

Most Temporary Files : 2M (CPU 2/3)

Standby CPUs:

Total CPUs : 4

Highest Load : 0.07 (CPU 15/2)

Total Memory : 8.0G

Total Used : 1.2G

Least Free : 1642M (CPU 15/0)

Total Temporary Files : 0M

Most Temporary Files : 0M (CPU 15/0)

Session Information:

In-Use Session Managers:

Number of Managers : 105

Capacity : 173250 min / 346500 typical / 693000 max

Usage : 79347

Busy-Out Session Managers:

Number of Managers : 0

Capacity : 0 min / 0 typical / 0 max

Usage : 0

Standby Session Managers:

Number of Managers : 41

PDSN Service:

In Use : 79347

Max Used : 86135

Limit : 200000

License Status : Within Acceptable Limits

As always the case with STxx commands, you can see that there are multiple views with overlapping information, each serving a specific purpose in the information gathering, monitoring, and troubleshooting process.

Task Warnings and Alarms

This is a good opportunity to mention the status column of show task resources. If the amount of memory that has been allocated to the process has been exceeded, then the status will indicate “warn” or “alarm”. If this occurs, contact Starent support so that the appropriate action can be taken. This might involve resetting the task, which only Starent can do, but also collecting data from the chassis that might indicate how it got into that state, and then possibly reporting this to engineering as a problem report. If a task does get into this state, it is likely that an SNMP trap and/or error log entry has been saved, and possibly an alarm triggered:

show snmp trap history (or view your SNMP trap server for more detailed information)

show logs

show alarm outstanding

Here is an example of multiple logs for a bulkstats process exceeding the allocated memory limit.

show task resources
8/0 bulkstat 0 0.3% 89% 52.2M 35.0M 332 1500 -- -- - alarm

[local]BRHOALTBPN3-HA# show logs | grep bulk
2009-Feb-21+20:18:09.342 [resmgr 14508 error] [8/0/920 <rmmgr:80> _resource_cpu.c:2292] [software internal system critical-info] The task bulkstat-0 is way over it's memory limit! Allocated 35840K, Using 57588K
2009-Feb-18+17:50:09.348 [resmgr 14508 error] [8/0/920 <rmmgr:80> _resource_cpu.c:2292] [software internal system critical-info] The task bulkstat-0 is way over it's memory limit! Allocated 35840K, Using 55636K
2009-Feb-16+11:56:09.351 [resmgr 14508 error] [8/0/920 <rmmgr:80> _resource_cpu.c:2292] [software internal system critical-info] The task bulkstat-0 is way over it's memory limit! Allocated 35840K, Using 53812K
2009-Feb-13+05:34:09.348 [resmgr 14508 error] [8/0/920 <rmmgr:80> _resource_cpu.c:2292] [software internal system critical-info] The task bulkstat-0 is way over it's memory limit! Allocated 35840K, Using 51948K

NON-critical Task Failures

Besides a task getting into the warning or alarm state, a task could also crash (unrelated to warning or alarm). Task monitoring is handled by the hatcpu processes on each card, and if the failure is not critical, then the parent task (name ends in main, ctrl, or parent) is notified, and the task can simply be respawned without any other action (such as a migration).

If session recovery is enabled, then data from any lost sessmgr, aaamgr, (or acsmgr for pre-v8.1) process(es) is transferred to standby process(es) which become active, and then those process(es) are then re-assigned the instance number of failed process(es). This re-assignment is done for convenience because other tasks in the system know about the instance number of the failed task number, not the one that took over. Also, the task that takes over has a value greater than 5000 which by definition is standby, but the task is now active. Finally, a new standby task is spawned to replace the one that became active, and that task’s instance number is usually the previous standby task instance number + 1 (see the task table for evidence of this)

The SNMP trap history and associated logs may look like this (SR 5898)

Mon Feb 16 04:06:50 2009 Internal trap notification 73 (ManagerFailure) facility acsmgr instance 50 card 13 cpu 0

Mon Feb 16 04:06:50 2009 Internal trap notification 150 (TaskFailed) facility acsmgr instance 50 on card 13 cpu 0

2009-Feb-16+04:06:51.313 [sitmain 4075 trace] [8/0/912 <sitparent:80> crashd.c:579] [software internal system critical-info] Crash handler file transfer ended (type=1 size=127535 child_ct=0 core_ct=0 pid=2257 status=1 elapsed=1s)

2009-Feb-16+04:06:50.444 [sitmain 4023 info] [13/0/718 <sitmain:130> sittask.c:4282] [software internal system critical-info] Readdress successful for facility acsmgr instance 6009 to instance 50

2009-Feb-16+04:06:50.247 [sitmain 4074 trace] [8/0/912 <sitparent:80> crashd.c:851] [software internal system critical-info] Crash handler file transfer starting (type=1 size=127535 child_ct=1 core_ct=0 pid=2257)

2009-Feb-16+04:06:50.245 [sitmain 4080 info] [8/0/912 <sitparent:80> crashd.c:812] [software internal system critical-info] Core file transfer to SPC complete, received 127535/127535 bytes

2009-Feb-16+04:06:50.238 [sitmain 4022 info] [13/0/718 <sitmain:130> sittask.c:4128] [software internal system critical-info] Readdress requested for facility acsmgr instance 6009 to instance 50

2009-Feb-16+04:06:50.207 [rct 13038 info] [8/0/918 <rct:0> rct_task.c:275] [software internal system critical-info] Death notification of task acsmgr/50 on 13/0 sent to parent task acsctrl/0 with failover of acsmgr/6009 on 13/0

2009-Feb-16+04:06:50.205 [sitmain 4017 critical] [13/0/718 <sitmain:130> sitproc.c:1471] [software internal system] Process acsmgr pid 7846 died on card 13 cpu 0 signal=11 wstatus=0x8b

2009-Feb-16+04:06:50.204 [sitmain 4027 critical] [13/0/718 <sitmain:130> crash_mini.c:795] [software internal system] Process Crash Info: time 2009-Feb-16+04:06:47(hex time 4998e657) card 13 cpu 00 pid 07846 procname acsmgr crash_details

Fatal Signal 11: Segmentation fault

PC: [ a280732/X] acs_control_charge()

Faulty address: (nil)

Signal detail: address not mapped to object

Process: card=13 cpu=0 arch=X pid=7846 cpu=~2% argv0=acsmgr

Crash time: 2009-Feb-15+23:06:47

Recent errno: 11 Resource temporarily unavailable

........

In this case, the backed up acsmgr task data from the failed task 50 was populated into standby task 6009 which took over and became active, and then was re-assigned the task instance number 50 of the failed task. The term “failover” means that session recovery was used to failover to standby task 6009. Note the logs use process id (pid) 7846 and task instance 6009 interchangeably.

Though the log output is intended for engineering to analyze, if you read through the logs, some of it may actually make some sense, especially in light of reading this document. Logs are in reverse chronological order, so you should read in the up direction. The SNMP trap history is the best place to START most of your troubleshooting, because it lays out a timeline of the most significant events. When you see ManagerFailure and/or TaskFailed, you can then check the logs for a matching timestamp and you should be able to find significant details on what occurred, even if the logging level for the chassis is set to error level for all facilities (default behavior). That is because event data associated with process crashes are often either error level (second to highest level) or critical level (highest level). In addition, many log events associated with important activities are set to critical-info (vs. non-critical-info), so even if the log event is less than error level and the facility in question is set to error level, it will still be saved to logs because it is critical-info.

In the above example, the full crash details have not been included here to save space (........), so the crash itself is considered one log event that spans many lines. The full crash itself will also be found in the crash list which can be retrieved with the command “show crash list”. Note the card, CPU and Process ID, software version and SPC/SMC and PAC/PSC card serial numbers are saved. The list itself can be useful for quick scanning to note any patterns of tasks or cards or timestamps of when crashes have occurred. Up to 120 crashes can be saved in the list, after which it needs to be cleared with “clear crash list” in order for any new crashes to be saved.

Finally note mention of crash files being transferred. If the crash enable feature is set in the configuration, then any crash dump files will be sent to the URL specified (it is encrypted when viewing the config), and Starent may or may not request such files (one per crash) depending on whether a root cause can be determined from the logs alone.

******** show crash list *******

== ==== ======= =========== ========== ================

# Time Process Card/CPU/ SW HW_SER_NUM

PID VERSION SMC / Crash Card

== ==== ======= =========== ========== ================

1 2009-Feb-16+04:06:47 acsmgr 13/00/07846 7.1(21029) PLB19082829/PLB24084573

********************* CRASH #01 ***********************

Fatal Signal 11: Segmentation fault

PC: [ a280732/X] acs_control_charge()

Faulty address: (nil)

Signal detail: address not mapped to object

Process: card=13 cpu=0 arch=X pid=7846 cpu=~2% argv0=acsmgr

Crash time: 2009-Feb-15+23:06:47

Recent errno: 11 Resource temporarily unavailable

........

Here’s another example with just a tiny snippet to make the point, though there were actually hundreds of log entries associated with these crashes. In this case both sessmgr processes were respawned. (SR 6032)

2009-Mar-06+08:40:10.297 [sitmain 4023 info] [16/0/1146 <sitmain:160> sittask.c:4321] [software internal system critical-info] Readdress successful for facility sessmgr instance 5022 to instance 22

2009-Mar-06+08:40:10.272 [sitmain 4022 info] [16/0/1146 <sitmain:160> sittask.c:4167] [software internal system critical-info] Readdress requested for facility sessmgr instance 5022 to instance 22

2009-Mar-06+08:40:10.244 [rct 13038 info] [8/0/1383 <rct:0> rct_task.c:275] [software internal system critical-info] Death notification of task sessmgr/22 on 16/0 sent to parent task sessctrl/0 with failover of sessmgr/5022 on 16/0

2009-Mar-06+08:40:10.243 [sitmain 4017 critical] [16/0/1146 <sitmain:160> sitproc.c:1514] [software internal system] Process sessmgr pid 1292 died on card 16 cpu 0 signal=6 wstatus=0x86

2009-Mar-06+08:40:09.889 [sitmain 4023 info] [3/0/1146 <sitmain:30> sittask.c:4321] [software internal system critical-info] Readdress successful for facility sessmgr instance 5002 to instance 50

2009-Mar-06+08:40:09.859 [sitmain 4022 info] [3/0/1146 <sitmain:30> sittask.c:4167] [software internal system critical-info] Readdress requested for facility sessmgr instance 5002 to instance 50

2009-Mar-06+08:40:09.829 [rct 13038 info] [8/0/1383 <rct:0> rct_task.c:275] [software internal system critical-info] Death notification of task sessmgr/50 on 3/0 sent to parent task sessctrl/0 with failover of sessmgr/5002 on 3/0

2009-Mar-06+08:40:09.786 [sitmain 4017 critical] [3/0/1146 <sitmain:30> sitproc.c:1514] [software internal system] Process sessmgr pid 1336 died on card 3 cpu 0 signal=6 wstatus=0x86

******** show crash list *******

== ==== ======= =========== ========== ================

# Time Process Card/CPU/ SW HW_SER_NUM

PID VERSION SMC / Crash Card

== ==== ======= =========== ========== ================

1 2009-Mar-06+08:40:07 sessmgr 16/00/01292 8.1(22707) PLB27085483/PLB46075509

2 2009-Mar-06+08:40:09 sessmgr 03/00/01336 8.1(22707) PLB27085483/PLB46075497

********************* CRASH #01 ***********************

Assertion failure at sess/smgr/sessmgr_ipv4.c:11833

Function: acsmgr_send_packet_merge_mode()

Expression: clp

Proclet: sessmgr (f=66000,i=22)

Process: card=16 cpu=0 arch=X pid=1292 cpu=~5% argv0=sessmgr

Crash time: 2009-Mar-06+03:40:07

Recent errno: 115 Operation now in progress

********************* CRASH #02 ***********************

Assertion failure at sess/smgr/sessmgr_ipv4.c:11833

Function: acsmgr_send_packet_merge_mode()

Expression: clp

Proclet: sessmgr (f=66000,i=50)

Process: card=3 cpu=0 arch=X pid=1336 cpu=~5% argv0=sessmgr

Crash time: 2009-Mar-06+03:40:09

Recent errno: 115 Operation now in progress

Migrations/Switchovers

The information presented thus far is important in developing an understanding of the various steps that take place when a PAC/PSC migration occurs.

The system will always try to maintain the same number of active cards as have been specified in the configuration with:

card x

mode active pac

#end

This is due to the fact that it is not trivial to decrease the number of certain tasks in the system without putting it into an unstable state.

So, whenever an active card goes into the offline or standby state, there should always be another standby card ready to become active and take its place. Whether an active card failure occurs or whether an active card is purposely made inactive (see below), the system is designed to ensure another card will become active automatically (without user intervention).

First let’s cover the cases where the migration was intentional or “planned”, that is, it was initiated by an administrator. These are the various ways that will cause a migration from one card to another to occur. This is assuming that the card being migrated from is an active card. In all cases, the migration happens FIRST before the reset action for the card in order to minimize impact.

Executive mode:

card pac migrate X to X – unlike the other commands below, this ensures which card will become active (if there are more than one standby card)

card restart – re-initializes card without reboot first (faster – it just restarts the software tasks, it doesn’t reboot the kernel)

card reboot – does both a reboot and re-initialization

card halt – card goes offline instead of standby (cannot apply this command to an offline card)

Restart, reboot, and halt all trigger a CardRebootRequest SNMP trap.

card configuration mode (for a specific card):

shutdown – card goes offline

Note: executing “mode standby” in card config mode will NOT result in a migration and so should not be used to accomplish this task. (It will also immediately disable the line card behind the card (if there is one)).

Certainly you can do a card reboot, restart, or halt to a standby card, but nothing will be migrated because it is standby.

All successful migrations have the following traps:

Internal trap notification 16 (PACMigrateStart) from card 6 to card 5

Internal trap notification 55 (CardActive) card 5

Internal trap notification 60 (CardDown) card 6

Internal trap notification 17 (PACMigrateComplete) from card 6 to card 5

And if the card being migrated from is destined for the standby state (migrate, reboot, or restart), then additionally:

Internal trap notification 5 (CardUp) card 6

Internal trap notification 93 (CardStandby) card 6

The state changes show up in the card table during the migration process:

5: PAC/TAC Packet Accelerator Card Migrate To - - -

6: PAC/TAC Packet Accelerator Card Migrate From - 20 -

There is a command that gets run when running “show support details” that shows all recovery attempts (you cannot run this command at the command line).

******** show rct stats *******

RCT stats Details (Last 1 Actions)

Action Type From To Start Time Duration

-------------- --------- ---- ---- ------------------------ ----------

PAC migration Planned 6 5 2009-May-24+12:49:19.969 8.852 sec

Whether or not session recovery is enabled, the above approaches should all result in a graceful recreation of tasks to a standby card with very minimal subscriber impact. This is because nothing has gone wrong with the existing tasks, there is no lost information, and so the tasks and data have sufficient time to simply be recreated on another card, without relying on a backup/mirrored card (which would be the case for session recovery).

Note: Another way to force a graceful migration is to lift the latch on the card. If you do this, do not pull out the card from the slot while the migration is taking place. If the card is physical pulled out, then migration of tasks is not possible without session recovery because all the tasks and data are lost when the card is pulled and loses power.

SPC/SMC switchovers work the same way as PAC/PSC migrations. Note that SPC/SMC cards are not “configured” to be active in the configuration like PAC/PSC cards are. The commands to run are:

card spc switchover – no need to specify the card to switch to because there are only two cards

card reboot (8 or 9)

card restart (8 or 9)

There is no version of the halt command as there is for PAC/PSC cards.

For more information on the impact of various failover scenarios in table format, see the System Overview manual, Chapter 4 Redundancy and Availability Features, Maintenance and Failure Scenarios section.

Critical Task Failures with or without migration

A critical task failure often is the symptom of a hardware failure on a card and the card must be failed/reset. The system uses heartbeat or ping UDP-based messages every second to determine if the system is running normally. On the Active SPC/SMC, the hatsystem task executes heartbeat messages that are sent to the heartbeat CPU tasks (hatcpu) located on all CPU-based cards (PAC/PSC/SPC/SMC) in the system. If two consecutive messages are lost (not responded to), plus some other criteria (too detailed/low-level to cover here), then a critical task failure will be declared and will need to be reset. Depending on the severity, a migration of whatever tasks are migratable may or may not be attempted first, but in either case, an available standby card will be made active immediately. If session recovery is enabled, then minimal subscriber impact will be felt as those tasks are populated with data from backup processes from the various other active cards.

It is often asked if it is necessary or recommended to put all the cards in the chassis back into the state that they are configured for in the configuration, so that all the active cards are active, and the standby card(s) are standby. This is not necessary unless you like to keep things “tidy and neat”; the system will function fine without doing so.

No Migration and failed card fails to reboot successfully and remains offline

In this example, no migration is attempted and the failed card did not boot back to standby status but goes offline (some traps removed): (SR 7095). Note that the logs do not mention anything about the new card (16) becoming active (because they are “non-critical-info”, “info” level logs and the logging level for this capture was “error” level), but the SNMP traps do.

Sun May 24 17:18:36 2009 Internal trap notification 73 (ManagerFailure) facility hatcpu instance 40 card 8 cpu 0

Sun May 24 17:18:36 2009 Internal trap notification 55 (CardActive) card 16

...

Sun May 24 17:18:37 2009 Internal trap notification 73 (ManagerFailure) facility hatcpu instance 40 card 8 cpu 0

...

Sun May 24 17:18:38 2009 Internal trap notification 60 (CardDown) card 4

...

Sun May 24 17:21:40 2009 Internal trap notification 9 (CardBootFailed) card 4

...

Sun May 24 17:24:42 2009 Internal trap notification 9 (CardBootFailed) card 4

Sun May 24 17:27:45 2009 Internal trap notification 9 (CardBootFailed) card 4

Sun May 24 17:30:53 2009 Internal trap notification 9 (CardBootFailed) card 4

The logs show the card being booted multiple times but not responding to pings as it should, and finally no more attempts are made after the fourth attempt. Note “Kernel missed ICMP echo request(s) from card 4: cpu:0 if:0, cpu:1 if:0” indicating that pings to both CPUs 0 and 1 (PSC cards have 2 CPUs) failed to be responded to.

2009-May-24+17:30:53.360 [csp 7019 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:3078] [hardware internal system] The Packet Services Card with serial number PLB50076807 in slot 4 has failed and will be brought down and kept down. (Device=CPU_0, Reason=CARD_BOOT_TIMEOUT_EXPIRED, Status=CFE_ROM: Step 2 [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], GPIO_IN: 02,ff,ff,ff)

2009-May-24+17:30:53.353 [alarmctrl 65201 info] [8/0/1378 <evlogd:0> alarmctrl.c:189] [software internal system critical-info] Alarm condition: id 50056e19844d0000 (Critical): Card Failure Detected on card 4 Device CPU_0 failure reason CARD_BOOT_TIMEOUT_EXPIRED

2009-May-24+17:30:53.351 [hat 3018 critical] [8/0/1380 <hatsystem:0> atsystem_fail.c:719] [hardware internal system] Card Failure Detected on card 4 Device CPU_0 failure reason CARD_BOOT_TIMEOUT_EXPIRED

2009-May-24+17:30:53.351 [hat 3033 error] [8/0/1380 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 4 device CPU_0 reason CARD_BOOT_TIMEOUT_EXPIRED

2009-May-24+17:30:53.351 [csp 7005 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:2473] [hardware internal system] The Packet Services Card in slot 4 did not boot in the allowed time. CPU 0 did not boot. CPU 1 did not boot.

2009-May-24+17:27:45.054 [csp 7019 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:3078] [hardware internal system] The Packet Services Card with serial number PLB50076807 in slot 4 has failed and will be power down and brought back online. (Device=CPU_0, Reason=CARD_BOOT_TIMEOUT_EXPIRED, Status=CFE_ROM: Step 2 [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], GPIO_IN: 02,ff,ff,ff)

2009-May-24+17:27:45.038 [hat 3005 warning] [8/0/1380 <hatsystem:0> hatsystem_msg.c:3204] [software internal system critical-info] message bounced from facility hatcpu instance 41

2009-May-24+17:27:45.038 [hat 3005 warning] [8/0/1380 <hatsystem:0> hatsystem_msg.c:3204] [software internal system critical-info] message bounced from facility hatcpu instance 40

2009-May-24+17:27:45.037 [hat 3033 error] [8/0/1380 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 4 device CPU_0 reason CARD_BOOT_TIMEOUT_EXPIRED

2009-May-24+17:27:45.029 [csp 7005 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:2473] [hardware internal system] The Packet Services Card in slot 4 did not boot in the allowed time. CPU 0 did not boot. CPU 1 did not boot.

2009-May-24+17:24:42.794 [csp 7019 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:3078] [hardware internal system] The Packet Services Card with serial number PLB50076807 in slot 4 has failed and will be brought down and brought back online. (Device=CPU_0, Reason=CARD_BOOT_TIMEOUT_EXPIRED, Status=CFE_ROM: Step 2 [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], GPIO_IN: 02,ff,ff,ff)

2009-May-24+17:24:42.790 [hat 3005 warning] [8/0/1380 <hatsystem:0> hatsystem_msg.c:3204] [software internal system critical-info] message bounced from facility hatcpu instance 41

2009-May-24+17:24:42.790 [hat 3005 warning] [8/0/1380 <hatsystem:0> hatsystem_msg.c:3204] [software internal system critical-info] message bounced from facility hatcpu instance 40

2009-May-24+17:24:42.789 [hat 3033 error] [8/0/1380 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 4 device CPU_0 reason CARD_BOOT_TIMEOUT_EXPIRED

2009-May-24+17:24:42.789 [csp 7005 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:2473] [hardware internal system] The Packet Services Card in slot 4 did not boot in the allowed time. CPU 0 did not boot. CPU 1 did not boot.

2009-May-24+17:21:40.578 [csp 7019 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:3078] [hardware internal system] The Packet Services Card with serial number PLB50076807 in slot 4 has failed and will be brought down and brought back online. (Device=CPU_0, Reason=CARD_BOOT_TIMEOUT_EXPIRED, Status=CFE_ROM: Step 2 [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], GPIO_IN: 02,ff,ff,ff)

2009-May-24+17:21:40.574 [hat 3005 warning] [8/0/1380 <hatsystem:0> hatsystem_msg.c:3204] [software internal system critical-info] message bounced from facility hatcpu instance 41

2009-May-24+17:21:40.574 [hat 3005 warning] [8/0/1380 <hatsystem:0> hatsystem_msg.c:3204] [software internal system critical-info] message bounced from facility hatcpu instance 40

2009-May-24+17:21:40.573 [hat 3033 error] [8/0/1380 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 4 device CPU_0 reason CARD_BOOT_TIMEOUT_EXPIRED

2009-May-24+17:21:40.572 [csp 7005 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:2473] [hardware internal system] The Packet Services Card in slot 4 did not boot in the allowed time. CPU 0 did not boot. CPU 1 did not boot.

2009-May-24+17:18:37.130 [hat 3017 error] [8/0/1380 <hatsystem:0> atsystem_fail.c:935] [software internal system] Ignoring multiple failure on card 4 cpu 0 reason CPU_CRITICAL_TASK_FAILURE

2009-May-24+17:18:36.834 [rct 13014 info] [8/0/1382 <rct:0> rct_pac.c:1105] [software internal system critical-info] Card 4 shutdown completed, took 0.479 sec

2009-May-24+17:18:36.436 [sitmain 4022 info] [16/0/1146 <sitmain:160> sittask.c:4167] [software internal system critical-info] Readdress requested for facility aaamgr instance 5031 to instance 1

2009-May-24+17:18:36.361 [hat 3005 warning] [8/0/1380 <hatsystem:0> hatsystem_msg.c:3204] [software internal system critical-info] message bounced from facility hatcpu instance 41

2009-May-24+17:18:36.356 [rct 13013 info] [8/0/1382 <rct:0> rct_pac.c:978] [software internal system critical-info] Card 4 shutdown started

2009-May-24+17:18:36.210 [dcardmgr 57099 info] [16/0/1244 <dcardmgr:16> dmgr_handlers.c:1610] [software internal system critical-info] finish taking over card 4

2009-May-24+17:18:36.204 [dcardmgr 57099 info] [16/0/1244 <dcardmgr:16> dmgr_handlers.c:1584] [software internal system critical-info] start taking over card 4

2009-May-24+17:18:36.136 [csp 7019 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:3078] [hardware internal system] The Packet Services Card with serial number PLB50076807 in slot 4 has failed and will be reset and brought back online. (Device=CPU_0, Reason=CPU_CRITICAL_TASK_FAILURE, Status=Boot Done [hb: CC], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], GPIO_IN: 02,ff,ff,ff)

2009-May-24+17:18:36.129 [hat 3033 error] [8/0/1380 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 4 device CPU_0 reason CPU_CRITICAL_TASK_FAILURE

2009-May-24+17:18:35.208 [hat 3050 error] [8/0/1380 <hatsystem:0> hat_hb_lib.c:1297] [software internal system critical-info] Kernel missed ICMP echo request(s) from card 4: cpu:0 if:0, cpu:1 if:0

2009-May-24+17:18:35.208 [hat 3051 warning] [8/0/1380 <hatsystem:0> hat_hb_lib.c:1272] [software internal system critical-info] ICMP ping results from card 4: All CPUs failed to respond

2009-May-24+17:18:35.168 [hat 3014 critical] [8/0/1380 <hatsystem:0> hat_hb_lib.c:727] [software internal system] HAT instance 0 found Critical task hatcpu/41 failed on cpu 4/1. Missed heartbeat sequence 2190818 ns_age 4 bounce_code N/A uptime 2192349.

2009-May-24+17:18:35.167 [hat 3050 error] [8/0/1380 <hatsystem:0> hat_hb_lib.c:1297] [software internal system critical-info] Kernel missed ICMP echo request(s) from card 4: cpu:0 if:0, cpu:1 if:0

2009-May-24+17:18:35.167 [hat 3051 warning] [8/0/1380 <hatsystem:0> hat_hb_lib.c:1272] [software internal system critical-info] ICMP ping results from card 4: All CPUs failed to respond

2009-May-24+17:18:35.127 [hat 3014 critical] [8/0/1380 <hatsystem:0> hat_hb_lib.c:727] [software internal system] HAT instance 0 found Critical task hatcpu/40 failed on cpu 4/0. Missed heartbeat sequence 2190818 ns_age 4 bounce_code N/A uptime 2192343.

In this case, no migration is even attempted:

******** show rct stats *******

RCT stats Details (Last 1 Actions)

Action Type From To Start Time Duration

-------------- --------- ---- ---- ------------------------ ----------

PSC shutdown N/A 4 16 2009-May-24+17:18:36.355 0.478 sec

RCT stats Summary

-----------------

PSC Migrations = 0

SMC switchover = 0

To troubleshoot an offline card, it should be (firmly) reseated physically in the slot to determine if it will come back up (into standby mode). If it remains offline after doing this, then the card should be RMA’d.

No Migration and failed card reboots successfully

This example is similar to the previous one except the failed card reset successfully (“Boxer starting …”) to standby status. (SR 6304)

Fri Mar 27 11:16:22 2009 Internal trap notification 73 (ManagerFailure) facility hatcpu instance 110 card 9 cpu 0
Fri Mar 27 11:16:22 2009 Internal trap notification 55 (CardActive) card 4
Fri Mar 27 11:16:25 2009 Internal trap notification 60 (CardDown) card 11
...
Fri Mar 27 11:18:43 2009 Internal trap notification 5 (CardUp) card 11
Fri Mar 27 11:18:43 2009 Internal trap notification 93 (CardStandby) card 11

2009-Mar-27+11:18:32.136 [sitmain 4003 info] [11/3/370 <sitmain:113> sitmain.c:549] [software internal system critical-info] Boxer starting on card 11 cpu 3 (PAC)

2009-Mar-27+11:18:32.105 [sitmain 4003 info] [11/2/370 <sitmain:112> sitmain.c:549] [software internal system critical-info] Boxer starting on card 11 cpu 2 (PAC)

2009-Mar-27+11:18:32.085 [sitmain 4003 info] [11/1/370 <sitmain:111> sitmain.c:549] [software internal system critical-info] Boxer starting on card 11 cpu 1 (PAC)

2009-Mar-27+11:18:31.425 [sitmain 4003 info] [11/0/397 <sitmain:110> sitmain.c:549] [software internal system critical-info] Boxer starting on card 11 cpu 0 (PAC)

2009-Mar-27+11:16:23.363 [rct 13014 info] [9/0/570 <rct:0> rct_pac.c:1071] [software internal system critical-info] PAC Card 11 shutdown completed, took 0.616 sec

2009-Mar-27+11:16:22.747 [rct 13013 info] [9/0/570 <rct:0> rct_pac.c:951] [software internal system critical-info] PAC Card 11 shutdown started

2009-Mar-27+11:16:22.391 [csp 7019 critical] [9/0/915 <cspctrl:0> spctrl_events.c:3045] [hardware internal system] The Packet Accelerator Card with serial number PLW05061700 in slot 11 has failed and will be reset and brought back online. (Device=CPU_0, Reason=CPU_CRITICAL_TASK_FAILURE, Status=Boot Done [hb: F3], Boot Done [hb: E3], Boot Done [hb: F3], Boot Done [hb: E3])

2009-Mar-27+11:16:22.346 [hat 3033 error] [9/0/568 <hatsystem:0> atsystem_fail.c:957] [hardware internal system critical-info] Card error detected on card 11 device CPU_0 reason CPU_CRITICAL_TASK_FAILURE

2009-Mar-27+11:16:22.252 [hat 3051 warning] [9/0/568 <hatsystem:0> hat_hb_lib.c:1120] [software internal system critical-info] ICMP ping results from card 11: rtt(0,0)=0ms rtt(1,0)=0ms rtt(2,0)=0ms rtt(3,0)=0ms

2009-Mar-27+11:16:22.250 [hat 3014 critical] [9/0/568 <hatsystem:0> hat_hb_lib.c:611] [software internal system] HAT instance 0 found Critical task hatcpu/110 failed on cpu 11/0. Missed heartbeat sequence 5794530 ns_age 4 bounce_code N/A.

Migration with successful boot of failed card

Here is another example where the system determined that a migration of tasks was possible, as specified in the line “Migrating all tasks on the Packet Accelerator Card in slot 4 to the one in slot 13”. Note the reboot of card 4 starts AFTER the migration is complete. A few task (facility) migration logs are displayed below, but most have been removed to save space. (SR 6304)

2009-Mar-08+04:46:35.767 [sitmain 4003 info] [4/2/370 <sitmain:42> sitmain.c:549] [software internal system critical-info] Boxer starting on card 4 cpu 2 (PAC)

2009-Mar-08+04:46:35.707 [sitmain 4003 info] [4/3/370 <sitmain:43> sitmain.c:549] [software internal system critical-info] Boxer starting on card 4 cpu 3 (PAC)

2009-Mar-08+04:46:35.687 [sitmain 4003 info] [4/1/370 <sitmain:41> sitmain.c:549] [software internal system critical-info] Boxer starting on card 4 cpu 1 (PAC)

2009-Mar-08+04:46:35.059 [sitmain 4003 info] [4/0/397 <sitmain:40> sitmain.c:549] [software internal system critical-info] Boxer starting on card 4 cpu 0 (PAC)

…

2009-Mar-08+04:44:30.069 [csp 7022 info] [9/0/915 <cspctrl:0> spctrl_events.c:3391] [hardware internal system critical-info] The migration of tasks on card 4 to card 13 is complete.

2009-Mar-08+04:44:29.186 [rct 13014 info] [9/0/570 <rct:0> rct_pac.c:1071] [software internal system critical-info] PAC Card 4 shutdown completed, took 0.071 sec

2009-Mar-08+04:44:29.115 [rct 13013 info] [9/0/570 <rct:0> rct_pac.c:951] [software internal system critical-info] PAC Card 4 shutdown started

2009-Mar-08+04:44:26.958 [rct 13002 info] [9/0/570 <rct:0> rct_pac.c:464] [software internal system critical-info] PAC task migration completed from card 4 to card 13

…

2009-Mar-08+04:44:15.364 [sitmain 4025 info] [4/1/370 <sitmain:41> sittask.c:3853] [software internal system critical-info] Task migration started facility sessmgr instance 208 from card 4 cpu 1 to card 13 cpu 1

2009-Mar-08+04:44:15.363 [sitmain 4025 info] [4/0/397 <sitmain:40> sittask.c:3853] [software internal system critical-info] Task migration started facility sessmgr instance 5154 from card 4 cpu 0 to card 13 cpu 0

2009-Mar-08+04:44:14.356 [rct 13001 info] [9/0/570 <rct:0> rct_pac.c:148] [software internal system critical-info] PAC task migration begin from card 4 to card 13

2009-Mar-08+04:44:14.181 [csp 7009 info] [9/0/915 <cspctrl:0> pctrl_helpers.c:2219] [hardware internal system critical-info] Migrating all tasks on the Packet Accelerator Card in slot 4 to the one in slot 13.

2009-Mar-08+04:44:13.980 [csp 7019 critical] [9/0/915 <cspctrl:0> spctrl_events.c:3045] [hardware internal system] The Packet Accelerator Card with serial number PLW02061482 in slot 4 has failed and will be brought down and brought back online. (Device=CPU_3, Reason=CPU_CRITICAL_TASK_FAILURE, Status=Boot Done [hb: E0], Boot Done [hb: E0], Boot Done [hb: E0], Boot Done [hb: E0])

2009-Mar-08+04:44:13.850 [hat 3033 error] [9/0/568 <hatsystem:0> atsystem_fail.c:957] [hardware internal system critical-info] Card error detected on card 4 device CPU_3 reason CPU_CRITICAL_TASK_FAILURE

…

2009-Mar-08+04:44:13.808 [hat 3051 warning] [9/0/568 <hatsystem:0> hat_hb_lib.c:1120] [software internal system critical-info] ICMP ping results from card 4: rtt(0,0)=0ms rtt(1,0)=0ms rtt(2,0)=0ms rtt(3,0)=0ms

2009-Mar-08+04:44:13.806 [hat 3014 critical] [9/0/568 <hatsystem:0> hat_hb_lib.c:611] [software internal system] HAT instance 0 found Critical task hatcpu/43 failed on cpu 4/3. Missed heartbeat sequence 4130607 ns_age 4 bounce_code N/A.

Note the migration type is “Unplanned”.

******** show rct stats *******

RCT stats Details (Last 6 Actions)

Action Type From To Start Time Duration

-------------- --------- ---- ---- ------------------------ ----------

PAC migration Unplanned 15 11 2009-Feb-06+14:29:30.229 10.594 sec

SPC switchover Unplanned 8 9 2009-Feb-11+21:56:08.687 65.313 sec

SPC switchover Unplanned 9 8 2009-Feb-22+08:24:07.776 71.215 sec

SPC switchover Unplanned 8 9 2009-Feb-23+11:48:10.135 74.927 sec

PAC migration Unplanned 13 15 2009-Mar-02+09:42:09.621 12.253 sec

PAC migration Unplanned 4 13 2009-Mar-08+04:44:14.357 12.601 sec

Migration with failed boot of failed card

A similar example as the previous one except the failed card fails to boot after migration:

Sun Mar 22 21:28:51 2009 Internal trap notification 73 (ManagerFailure) facility hatcpu instance 151 card 8 cpu 0

Sun Mar 22 21:28:52 2009 Internal trap notification 16 (PACMigrateStart) from card 15 to card 13

...

Sun Mar 22 21:29:36 2009 Internal trap notification 60 (CardDown) card 15

Sun Mar 22 21:29:38 2009 Internal trap notification 55 (CardActive) card 13

Sun Mar 22 21:29:38 2009 Internal trap notification 17 (PACMigrateComplete) from card 15 to card 13

Sun Mar 22 21:29:49 2009 Internal trap notification 139 (CardSPOFAlarm) card 13

Sun Mar 22 21:32:38 2009 Internal trap notification 9 (CardBootFailed) card 15

Sun Mar 22 21:35:41 2009 Internal trap notification 9 (CardBootFailed) card 15

Sun Mar 22 21:38:43 2009 Internal trap notification 9 (CardBootFailed) card 15

Sun Mar 22 21:41:51 2009 Internal trap notification 9 (CardBootFailed) card 15

...

2009-Mar-22+21:32:38.782 [hat 3033 error] [8/0/921 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 15 device CPU_0 reason CARD_BOOT_TIMEOUT_EXPIRED

2009-Mar-22+21:32:38.782 [csp 7005 critical] [8/0/930 <cspctrl:0> spctrl_events.c:2447] [hardware internal system] The Packet Services Card in slot 15 did not boot in the allowed time. CPU 0 did not boot. CPU 1 did not boot.

2009-Mar-22+21:29:38.943 [csp 7022 info] [8/0/930 <cspctrl:0> spctrl_events.c:3391] [hardware internal system critical-info] The migration of tasks on card 15 to card 13 is complete.

2009-Mar-22+21:29:36.820 [sitmain 4019 warning] [8/0/927 <npuctrl:0> sit_api.c:1219] [software internal system critical-info] Message bounced to facility sitmain instance 150

2009-Mar-22+21:29:36.343 [rct 13014 info] [8/0/923 <rct:0> rct_pac.c:1072] [software internal system critical-info] PAC Card 15 shutdown completed, took 0.018 sec

2009-Mar-22+21:29:36.325 [rct 13013 info] [8/0/923 <rct:0> rct_pac.c:952] [software internal system critical-info] PAC Card 15 shutdown started

2009-Mar-22+21:29:36.316 [rct 13002 info] [8/0/923 <rct:0> rct_pac.c:466] [software internal system critical-info] PAC task migration completed from card 15 to card 13

...

2009-Mar-22+21:28:54.786 [sitmain 4026 info] [15/0/712 <sitmain:150> sittask.c:2800] [software internal system critical-info] Task migration completed facility sessmgr instance 75 from card 15 cpu 0 to card 13 cpu 0 took 2.054931 sec

2009-Mar-22+21:28:52.731 [sitmain 4025 info] [15/0/712 <sitmain:150> sittask.c:3860] [software internal system critical-info] Task migration started facility sessmgr instance 75 from card 15 cpu 0 to card 13 cpu 0

2009-Mar-22+21:28:52.179 [rct 13001 info] [8/0/923 <rct:0> rct_pac.c:149] [software internal system critical-info] PAC task migration begin from card 15 to card 13

2009-Mar-22+21:28:52.154 [csp 7009 info] [8/0/930 <cspctrl:0> pctrl_helpers.c:2219] [hardware internal system critical-info] Migrating all tasks on the Packet Services Card in slot 15 to the one in slot 13.

2009-Mar-22+21:28:51.935 [csp 7019 critical] [8/0/930 <cspctrl:0> spctrl_events.c:3052] [hardware internal system] The Packet Services Card with serial number PLB48076369 in slot 15 has failed and will be brought down and brought back online. (Device=CPU_1, Reason=CPU_CRITICAL_TASK_FAILURE, Status=Boot Done [hb: E1], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], GPIO_IN: 00,ff,ff,ff)

2009-Mar-22+21:28:51.928 [hat 3033 error] [8/0/921 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 15 device CPU_1 reason CPU_CRITICAL_TASK_FAILURE

2009-Mar-22+21:28:50.464 [hat 3050 error] [8/0/921 <hatsystem:0> hat_hb_lib.c:1139] [software internal system critical-info] Kernel missed ICMP echo request(s) from card 15: cpu 1

2009-Mar-22+21:28:50.464 [hat 3051 warning] [8/0/921 <hatsystem:0> hat_hb_lib.c:1121] [software internal system critical-info] ICMP ping results from card 15: rtt(0,0)=0ms

2009-Mar-22+21:28:50.424 [hat 3014 critical] [8/0/921 <hatsystem:0> hat_hb_lib.c:611] [software internal system] HAT instance 0 found Critical task hatcpu/151 failed on cpu 15/1. Missed heartbeat sequence 12400479 ns_age 4 bounce_code N/A.

To troubleshoot an offline card, it should be (firmly) reseated physically in the slot to determine if it will come back up (into standby mode). If it remains offline after doing this, then the card should be RMA’d.

Standby SMC reboot

Conceptually, a SPC/SMC switchover works the same way as a PAC migration. In this example, the standby SMC rebooted, so there is no need for a switchover, just a reboot. (SR 6208)

Fri Mar 20 12:46:39 2009 Internal trap notification 73 (ManagerFailure) facility hatsystem instance 1 card 8 cpu 0

Fri Mar 20 12:46:39 2009 Internal trap notification 60 (CardDown) card 9

Fri Mar 20 12:46:42 2009 Internal trap notification 139 (CardSPOFAlarm) card 8

Fri Mar 20 12:48:09 2009 Internal trap notification 5 (CardUp) card 9

Fri Mar 20 12:48:09 2009 Internal trap notification 93 (CardStandby) card 9

2009-Mar-20+12:47:52.448 [sitmain 4003 info] [9/0/852 <sitmain:90> sitmain.c:469] [software internal system critical-info] Boxer starting on card 9 cpu 0 (Standby)

2009-Mar-20+12:46:57.555 [resmgr 14571 warning] [8/0/916 <rmctrl:0> ource_license.c:1109] [software internal system critical-info] License status changed to: good. It only matches one card. License is valid but is not redundant.

2009-Mar-20+12:46:39.732 [rct 13013 info] [8/0/913 <rct:0> rct_spc.c:284] [software internal system critical-info] SPC Card 9 shutdown started

2009-Mar-20+12:46:39.707 [csp 7019 critical] [8/0/920 <cspctrl:0> spctrl_events.c:3052] [hardware internal system] The System Management Card with serial number PLB32071119 in slot 9 has failed and will be reset and brought back online. (Device=CPU_0, Reason=CPU_C

RITICAL_TASK_FAILURE, Status=Read Fail)

2009-Mar-20+12:46:39.698 [hat 3033 error] [8/0/911 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 9 device CPU_0 reason CPU_CRITICAL_TASK_FAILURE

2009-Mar-20+12:46:39.698 [hat 3050 error] [8/0/911 <hatsystem:0> hat_hb_lib.c:1139] [software internal system critical-info] Kernel missed ICMP echo request(s) from card 9: cpu 0

2009-Mar-20+12:46:39.698 [hat 3051 warning] [8/0/911 <hatsystem:0> hat_hb_lib.c:1121] [software internal system critical-info] ICMP ping results from card 9: All CPUs failed to respond

2009-Mar-20+12:46:39.657 [hat 3014 critical] [8/0/911 <hatsystem:0> hat_hb_lib.c:611] [software internal system] HAT instance 0 found Critical task hatsystem/1 failed on cpu 9/0. Missed heartbeat sequence 12459150 ns_age 5 bounce_code N/A.

SMC Switchover

In this example, SMC 8 encountered a missed heartbeat and switched over to SMC 9, and then failed to boot properly the first time but did so on the second attempt (SR 6183)

Note that when an SMC switchover occurs, the SNMP traps and logs up to the point of the switchover are lost (though they would be available on your trap and syslog servers if configured).

2009-Mar-15+12:14:38.759 [sitmain 4003 info] [8/0/854 <sitmain:80> sitmain.c:469] [software internal system critical-info] Boxer starting on card 8 cpu 0 (Standby)

2009-Mar-15+12:13:25.392 [rct 13013 info] [9/0/956 <rct:0> rct_spc.c:284] [software internal system critical-info] SPC Card 8 shutdown started

2009-Mar-15+12:13:25.392 [csp 7019 critical] [9/0/10286 <cspctrl:0> spctrl_events.c:3052] [hardware internal system] The System Management Card with serial number Unknown in slot 8 has failed and will be brought down and brought back online. (Device=CPU_0, Reason=CARD_BOOT_TIMEOUT_EXPIRED, Status=Read Fail)

2009-Mar-15+12:13:25.391 [hat 3005 warning] [9/0/954 <hatsystem:0> hatsystem_msg.c:3147] [software internal system critical-info] message bounced from facility hatcpu instance 80

2009-Mar-15+12:13:25.391 [hat 3033 error] [9/0/954 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 8 device CPU_0 reason CARD_BOOT_TIMEOUT_EXPIRED

2009-Mar-15+12:13:25.391 [csp 7005 critical] [9/0/10286 <cspctrl:0> spctrl_events.c:2447] [hardware internal system] The System Management Card in slot 8 did not boot in the allowed time. CPU 0 did not boot.

2009-Mar-15+12:13:16.368 [system 1000 critical] [9/0/10282 <evlogd:0> evlgd_syslogd.c:146] [software internal system] CPU[8/0]: Gave up waiting for kernel, booting...

2009-Mar-15+12:09:31.552 [rct 13027 critical] [9/0/956 <rct:0> rct_spc.c:525] [software internal system critical-info] RCT readdress success instance 1 to instance 0

2009-Mar-15+12:09:31.447 [rct 13010 info] [9/0/956 <rct:1> rct_spc.c:431] [software internal system critical-info] SPC card 9 upgrading to Primary SPC

2009-Mar-15+12:09:31.409 [rct 13007 info] [9/0/956 <rct:1> rct_spc.c:245] [software internal system critical-info] Unplanned SPC switchover started from card 8 to card 9

2009-Mar-15+12:09:31.395 [hat 3050 error] [9/0/954 <hatsystem:1> hat_hb_lib.c:1139] [software internal system critical-info] Kernel missed ICMP echo request(s) from card 8: cpu 0

2009-Mar-15+12:09:31.395 [hat 3051 warning] [9/0/954 <hatsystem:1> hat_hb_lib.c:1121] [software internal system critical-info] ICMP ping results from card 8: All CPUs failed to respond

2009-Mar-15+12:09:31.347 [hat 3014 critical] [9/0/954 <hatsystem:1> hat_hb_lib.c:611] [software internal system] HAT instance 1 found Critical task hatsystem/0 failed on cpu 8/0. Missed heartbeat sequence 12460328 ns_age 5 bounce_code N/A.

******** show rct stats *******

RCT stats Details (Last 2 Actions)

Action Type From To Start Time Duration

-------------- --------- ---- ---- ------------------------ ----------

PSC migration Planned 11 6 2009-Mar-05+09:54:16.824 34.811 sec

SMC switchover Unplanned 8 9 2009-Mar-15+12:09:31.447 62.395 sec

Because the card did not boot properly the first time, it would make sense to provide the logs to Starent for further analysis. If the card remained offline, it should be (firmly) reseated physically in the slot to determine if it will come back up (into standby mode). If it remains offline after doing this, then the card should be RMA’d.

Imported from Starent Networks Knowledgebase Article # 10440

Troubleshooting and understanding Task/Process failures, Session Recovery, and PAC/PSC migrations and SPC/SMC switchovers

Overview

Basic hardware and software architecture

Session Recovery

Task Warnings and Alarms

NON-critical Task Failures

Migrations/Switchovers

Critical Task Failures with or without migration

No Migration and failed card fails to reboot successfully and remains offline

No Migration and failed card reboots successfully

Migration with successful boot of failed card

Migration with failed boot of failed card

Standby SMC reboot

SMC Switchover