01-24-2012 03:23 PM - edited 07-05-2021 12:11 PM
Overview.. 1
Basic hardware and software architecture. 1
Session Recovery. 8
Task Warnings and Alarms. 11
NON-critical Task Failures. 12
Migrations/Switchovers. 14
Critical Task Failures with or without migration. 16
No Migration and failed card fails to reboot successfully and remains offline. 16
No Migration and failed card reboots successfully. 19
Migration with successful boot of failed card. 20
Migration with failed boot of failed card. 21
Standby SMC reboot 22
SMC Switchover 23
This article explains the basics of PAC/PSC migrations and SPC/SMC switchovers, along with session/task failure recovery when hardware or software failures occur, how to recognize when such actions take place, and what actions can/should be taken as a result. It includes all the relevant CLI commands that can be run, what to look for in the output and how to interpret it, and what to report to Starent.
The STxx uses a redundant architecture, so that if the hardware and/or software fails for any reason, the system can continue to run with minimal impact. A failure could be software or hardware based, or even a combination, and the ultimate action to be taken as a result could be the issuance of an RMA for hardware or a new software build containing a fix for a software bug. The steps of detecting the failure, understanding the recovery action taken and what the impact was, and determining whether or not it needs further addressing are covered here.
The following is a very basic discussion of the system architecture, including card types, tasks,
The STxx chassis houses slots numbered 1 through 48, containing cards types as follows:
PAC/PSC: 1 – 7, 10 – 16
SPC/SMC: 8, 9
Linecards: 17 – 23, 26 – 39, 42 – 48 (two rows, with redundant cards located directly underneath active ones)
SPIO cards: 24, 25
RCC: 40, 41
See the System Overview manual for a much more in depth description of the chassis components, including diagrams, as this is not the focus of this article.
To see a list of all the cards in the system, run “show card table all”. Note the output below indicates an ST16 because the card types are PAC and SPC instead of PSC and SMC. Note the Attach column indicates which Linecard is “attached” to a given PAC card. Normally the linecard directly behind a PAC card (PAC card number + 16) will be attached by default, and that linecard’s redundant card will be numbered + 16. Note all the PAC cards are active except for one that is standby. A standby card is available in case there is a failure of an active card, and it will be made the active card in case one of the active cards fail – this will be discussed further in this article.
******** show card table all *******
Slot Card Type Oper State SPOF Attach
----------- -------------------------------- ------------- ---- ------
1: PAC/TAC None - - - -
2: PAC/TAC Packet Accelerator Card Active No 18 34
3: PAC/TAC Packet Accelerator Card Active No - -
4: PAC/TAC Packet Accelerator Card Active No - -
5: PAC/TAC None - - - -
6: PAC/TAC Packet Accelerator Card Active No - -
7: PAC/TAC None - - - -
8: SPC Switch Processor Card Active No 24 25
9: SPC Switch Processor Card Standby - - -
10: PAC/TAC None - - - -
11: PAC/TAC Packet Accelerator Card Active No - -
12: PAC/TAC None - - - -
13: PAC/TAC Packet Accelerator Card Active No - -
14: PAC/TAC None - - - -
15: PAC/TAC Packet Accelerator Card Standby - - -
16: PAC/TAC None - - - -
17: LC None - - -
18: LC 1000 Ethernet Line Card Active No 2
19: LC None - - -
20: LC None - - -
21: LC None - - -
22: LC None - - -
23: LC None - - -
24: SPIO Switch Processor I/O Card Active No 8
25: SPIO Switch Processor I/O Card Standby - 8
26: LC None - - -
27: LC None - - -
28: LC None - - -
29: LC None - - -
30: LC None - - -
31: LC None - - -
32: LC None - - -
33: LC None - - -
34: LC 1000 Ethernet Line Card Standby - 2
35: LC None - - -
36: LC None - - -
37: LC None - - -
38: LC None - - -
39: LC None - - -
40: RCC Redundancy Crossbar Card Standby -
41: RCC Redundancy Crossbar Card Standby -
42: LC None - - -
43: LC None - - -
44: LC None - - -
45: LC None - - -
46: LC None - - -
47: LC None - - -
48: LC None - - -
Other useful commands for troubleshooting and collecting information about cards are:
show card hardware
show card info
show card diag
The SMC (System Management Card, ST40) or SPC (System Processor Card, ST16) serve as the primary controller for the chassis and is responsible for initializing the entire system and loading the system configuration into other cards in the chassis. For redundancy, there are two of these cards in a system, one active and the other standby, and they reside in slots 8 and 9.
The PSC (Packet Services Card, ST40) or PAC (Packet Accelerator Card, ST16) is the guts of the system that does the subscriber packet processing and forwarding, implements the various protocols required to setup, teardown, and maintain calls, manages statistics, and forwards data traffic to the destination.
The ST16 contains 4 CPUs (control processors) while the ST40 contains 2 CPUs, and numbering starts at 0 (not 1). Both contain an NPU (network processor).
Tasks or Processes (interchangeable terms) run on PSC/PAC and SMC/SPC cards and can be viewed with the command “show task table”, while the amount of resources (CPU and memory) used by those tasks can be viewed with “show task resources”. An explanation of all the possible tasks that can be running, their function, how they are distributed across all cards, and all of their interdependencies is complex and beyond the scope of this document, but there is some basic information worth being aware of. Chapter 3, Software Architecture, in the System Overview manual, has an example high level software architecture block diagram.
Significant Tasks, some of which may be mentioned further are:
vpnmgr – Virtual Private Network – controls contexts
hatcpu – High Availability Task
sitmain – System Initialization Task
a11mgr – A11 protocol for call setup between PDSN and PCF
aaamgr – Radius protocol for authentication and accounting between PDSN and Radius servers
sessmgr – maintaining knowledge of subscriber sessions
acsmgr – processing for active charging service (merged into sessmgr starting in v8.1)
npumgr – processing of inbound and outbound packets and intra chassis packet routing
famgr – foreign agent service processing
hamgr – home agent service processing
The following only run on SPC/SMC:
csp – Card/Slot/Port – coordinates events occurring when cards inserted/locked/removed/shutdown/migrated
rct – Recovery Control Task – executes a defined recovery action for a failure
hatsystem – monitors all hatcpu tasks on PACs
rmmgr – Recource Manager – assigns resources to every task
cli – one started for each CLI session
bulkstat – processes bulkstat data
snmp – processes SNMP traps
orbs – communicates with EMS server
Each sessmgr task is responsible for a specific number of subscriber sessions, with a maximum number (6600 in example below) depending on the chassis type and licensing. Sessmgr tasks are spawned across all cards and sessions are load balanced across them. Here is an example from “show task resources” showing this on an ST16 (4 CPUs/card). Note the sessmgr processes on CPUs 1 - 3 are housing around 800 sessions each, while CPU 0 sessmgrs are housing slightly less. This same allocation pattern would be seen across all active PAC cards, as can be seen with the included output from CPU 4/1 (SR 7047)
******** show task resources *******
task cputime memory files sessions
cpu facility inst used allc used alloc used allc used allc S status
----------------------- --------- ------------- --------- ------------- ------
3/0 sitmain 30 0.0% 15% 1.8M 6.0M 14 1000 -- -- - good
3/0 sitparent 30 0.0% 20% 1.4M 4.0M 12 500 -- -- - good
3/0 hatcpu 30 0.1% 10% 1.6M 5.0M 13 500 -- -- - good
3/0 rmmgr 30 1.7% 10% 3.3M 7.0M 23 500 -- -- - good
3/0 dhmgr 30 0.0% 5% 2.0M 6.0M 18 2000 -- -- - good
3/0 dcardmgr 3 0.0% 60% 12.7M 20.0M 13 500 -- -- - good
3/0 npumgr 3 7.5% 100% 117.5M 225.3M 35 1000 -- -- - good
3/0 sft 300 6.9% 50% 2.8M 12.0M 43 500 -- -- - good
3/0 sessmgr 5013 0.1% 50% 18.1M 85.0M 18 500 -- -- S good
3/0 aaamgr 5014 0.0% 50% 6.5M 20.0M 16 500 -- -- - good
3/0 sessmgr 4 5.1% 100% 81.4M 155.0M 28 500 505 6600 I good
3/0 sessmgr 31 6.0% 100% 79.4M 155.0M 27 500 519 6600 I good
3/0 sessmgr 51 6.2% 100% 80.0M 155.0M 29 500 537 6600 I good
3/0 aaamgr 74 2.0% 94% 39.6M 77.0M 18 500 -- -- - good
3/0 aaamgr 79 2.0% 94% 39.4M 77.0M 18 500 -- -- - good
3/0 aaamgr 85 1.6% 94% 39.6M 77.0M 19 500 -- -- - good
3/0 aaamgr 94 1.9% 94% 39.4M 77.0M 18 500 -- -- - good
3/0 sitreap 3000509 0.0% 5% 0.1M 5.0M 5 100 -- -- - good
3/0 msgd 3000507 0.9% 5% 0.8M 5.0M 6 100 -- -- - good
3/0 nscontrol 3030563 0.0% 5% 1.0M 5.0M 10 500 -- -- - good
3/1 sitmain 31 0.1% 15% 1.8M 6.0M 14 1000 -- -- - good
3/1 sitparent 31 0.0% 20% 1.4M 4.0M 12 500 -- -- - good
3/1 hatcpu 31 0.2% 10% 1.6M 5.0M 13 500 -- -- - good
3/1 rmmgr 31 2.1% 10% 3.2M 7.0M 26 500 -- -- - good
3/1 dhmgr 31 0.0% 5% 1.8M 6.0M 18 2000 -- -- - good
3/1 sessmgr 5002 0.2% 50% 18.0M 85.0M 18 500 -- -- S good
3/1 aaamgr 5003 0.0% 50% 6.4M 20.0M 16 500 -- -- - good
3/1 aaamgr 1 1.7% 94% 38.8M 77.0M 19 500 -- -- - good
3/1 aaamgr 2 1.3% 94% 35.9M 77.0M 19 500 -- -- - good
3/1 aaamgr 3 1.8% 94% 39.3M 77.0M 18 500 -- -- - good
3/1 sessmgr 18 10% 100% 93.2M 155.0M 29 500 799 6600 I good
3/1 sessmgr 23 7.8% 100% 93.3M 155.0M 29 500 786 6600 I good
3/1 aaamgr 52 1.4% 94% 36.0M 77.0M 19 500 -- -- - good
3/1 sessmgr 56 7.6% 100% 91.0M 155.0M 28 500 787 6600 I good
3/1 sessmgr 61 9.2% 100% 90.4M 155.0M 29 500 798 6600 I good
3/1 sessmgr 81 8.9% 100% 91.1M 155.0M 28 500 802 6600 I good
3/1 sessmgr 93 9.5% 100% 92.0M 155.0M 28 500 783 6600 I good
3/1 aaamgr 102 2.1% 94% 38.7M 77.0M 19 500 -- -- - good
3/1 sitreap 3100472 0.0% 5% 0.1M 5.0M 5 100 -- -- - good
3/1 msgd 3100470 1.0% 5% 0.8M 5.0M 6 100 -- -- - good
3/1 nscontrol 3130563 0.0% 5% 1.0M 5.0M 10 500 -- -- - good
3/2 sitmain 32 0.0% 15% 1.8M 6.0M 14 1000 -- -- - good
3/2 sitparent 32 0.0% 20% 1.4M 4.0M 12 500 -- -- - good
3/2 hatcpu 32 0.2% 10% 1.6M 5.0M 13 500 -- -- - good
3/2 rmmgr 32 2.2% 10% 3.4M 7.0M 25 500 -- -- - good
3/2 dhmgr 32 0.0% 5% 1.8M 6.0M 18 2000 -- -- - good
3/2 sessmgr 5021 0.2% 50% 17.9M 85.0M 18 500 -- -- S good
3/2 aaamgr 5022 0.0% 50% 6.5M 20.0M 16 500 -- -- - good
3/2 aaamgr 5 1.5% 94% 36.1M 77.0M 18 500 -- -- - good
3/2 sessmgr 7 10% 100% 93.1M 155.0M 29 500 813 6600 I good
3/2 aaamgr 25 1.6% 94% 38.5M 77.0M 19 500 -- -- - good
3/2 sessmgr 35 10% 100% 95.2M 155.0M 26 500 810 6600 I good
3/2 sessmgr 41 10% 100% 95.5M 155.0M 28 500 803 6600 I good
3/2 aaamgr 55 1.5% 94% 36.2M 77.0M 19 500 -- -- - good
3/2 aaamgr 62 1.8% 94% 39.4M 77.0M 20 500 -- -- - good
3/2 sessmgr 66 8.3% 100% 93.1M 155.0M 28 500 804 6600 I good
3/2 aaamgr 84 2.0% 94% 38.6M 77.0M 19 500 -- -- - good
3/2 sessmgr 86 7.5% 100% 93.3M 155.0M 28 500 804 6600 I good
3/2 sessmgr 98 8.6% 100% 94.1M 155.0M 27 500 784 6600 I good
3/2 sitreap 3200472 0.0% 5% 0.1M 5.0M 5 100 -- -- - good
3/2 msgd 3200470 1.0% 5% 0.8M 5.0M 6 100 -- -- - good
3/2 nscontrol 3230563 0.0% 5% 1.0M 5.0M 10 500 -- -- - good
3/3 sitmain 33 0.1% 15% 1.8M 6.0M 14 1000 -- -- - good
3/3 sitparent 33 0.0% 20% 1.3M 4.0M 12 500 -- -- - good
3/3 hatcpu 33 0.2% 10% 1.6M 5.0M 13 500 -- -- - good
3/3 rmmgr 33 2.1% 10% 3.4M 7.0M 25 500 -- -- - good
3/3 dhmgr 33 0.0% 5% 2.0M 6.0M 18 2000 -- -- - good
3/3 sessmgr 5026 0.2% 50% 18.1M 85.0M 18 500 -- -- S good
3/3 aaamgr 5027 0.0% 50% 6.5M 20.0M 16 500 -- -- - good
3/3 aaamgr 11 1.8% 94% 39.6M 77.0M 19 500 -- -- - good
3/3 sessmgr 13 9.4% 100% 92.3M 155.0M 28 500 802 6600 I good
3/3 aaamgr 21 1.3% 94% 35.6M 77.0M 18 500 -- -- - good
3/3 aaamgr 34 2.0% 94% 39.5M 77.0M 19 500 -- -- - good
3/3 aaamgr 38 2.2% 94% 38.6M 77.0M 19 500 -- -- - good
3/3 sessmgr 40 10% 100% 91.3M 155.0M 29 500 802 6600 I good
3/3 sessmgr 46 10% 100% 91.7M 155.0M 30 500 814 6600 I good
3/3 aaamgr 70 2.0% 94% 39.3M 77.0M 19 500 -- -- - good
3/3 sessmgr 71 8.3% 100% 90.9M 155.0M 29 500 808 6600 I good
3/3 sessmgr 80 7.4% 100% 91.5M 155.0M 29 500 793 6600 I good
3/3 sessmgr 90 8.6% 100% 91.3M 155.0M 28 500 801 6600 I good
3/3 sitreap 3300472 0.0% 5% 0.1M 5.0M 5 100 -- -- - good
3/3 msgd 3300470 1.0% 5% 0.8M 5.0M 6 100 -- -- - good
3/3 nscontrol 3330557 0.0% 5% 1.0M 5.0M 10 500 -- -- - good
......
4/1 sitmain 41 0.1% 15% 1.8M 6.0M 14 1000 -- -- - good
4/1 sitparent 41 0.0% 20% 1.5M 4.0M 12 500 -- -- - good
4/1 hatcpu 41 0.2% 10% 1.6M 5.0M 13 500 -- -- - good
4/1 rmmgr 41 2.1% 10% 3.3M 7.0M 26 500 -- -- - good
4/1 dhmgr 41 0.0% 5% 1.9M 6.0M 18 2000 -- -- - good
4/1 sessmgr 5004 0.2% 50% 18.0M 85.0M 18 500 -- -- S good
4/1 aaamgr 5005 0.1% 50% 6.5M 20.0M 16 500 -- -- - good
4/1 sessmgr 1 7.3% 100% 93.1M 155.0M 30 500 793 6600 I good
4/1 sessmgr 24 10% 100% 91.5M 155.0M 27 500 804 6600 I good
4/1 aaamgr 37 2.2% 94% 38.7M 77.0M 19 500 -- -- - good
4/1 aaamgr 41 2.2% 94% 39.5M 77.0M 19 500 -- -- - good
4/1 aaamgr 44 2.2% 94% 39.8M 77.0M 20 500 -- -- - good
4/1 aaamgr 49 1.7% 94% 38.6M 77.0M 19 500 -- -- - good
4/1 sessmgr 57 11% 100% 93.1M 155.0M 29 500 807 6600 I good
4/1 sessmgr 62 7.9% 100% 95.3M 155.0M 30 500 791 6600 I good
4/1 sessmgr 82 9.5% 100% 91.9M 155.0M 28 500 803 6600 I good
4/1 sessmgr 94 8.2% 100% 94.1M 155.0M 29 500 791 6600 I good
4/1 aaamgr 101 2.2% 94% 38.8M 77.0M 19 500 -- -- - good
4/1 sitreap 4100472 0.0% 5% 0.1M 5.0M 5 100 -- -- - good
4/1 msgd 4100470 1.0% 5% 0.8M 5.0M 6 100 -- -- - good
4/1 nscontrol 4130563 0.0% 5% 1.0M 5.0M 10 500 -- -- - good
The task table maps the task instance number with the process id (pid), as well as with the parent task instance and pid, for example:
******** show task table *******
task parent
cpu facility inst pid pri facility inst pid
---- ----------------------------- -------------------------
3/0 aaamgr 94 639 0 sessctrl 0 582
3/0 aaamgr 88 632 0 sessctrl 0 582
3/0 aaamgr 79 625 0 sessctrl 0 582
3/0 aaamgr 70 624 0 sessctrl 0 582
3/0 sessmgr 52 617 0 sessctrl 0 582
3/0 sessmgr 37 610 0 sessctrl 0 582
3/0 aaamgr 5026 585 0 sessctrl 0 582
3/0 sessmgr 3 572 0 sessctrl 0 582
3/0 sessmgr 5015 571 0 sessctrl 0 582
3/0 sft 300 528 0 npuctrl 0 564
3/0 npumgr 3 521 -8 npuctrl 0 564
3/0 dcardmgr 3 520 0 dcardctrl 0 567
3/0 dhmgr 30 519 -4 sitparent 30 510
3/0 rmmgr 30 512 0 sitparent 30 510
3/0 hatcpu 30 511 -12 sitparent 30 510
3/0 sitparent 30 510 -12 sitmain 30 420
3/0 sitmain 30 420 -12 sitmain 30 420
3/0 sitreap 3000509 509 0 sitmain 30 420
3/0 msgd 3000507 507 0 - 0 0
3/0 nscontrol 3024701 24701 0 - 0 0
Note that on PAC 15, the standby card, all the task instances are numbered greater than 5000, which is the signature of a standby task. Note above that there are also standby sessmgr and aaamgr tasks on active cards ready to take over for task failures (more on this discussed below), but most are active.
15/0 sitmain 150 0.0% 15% 1.8M 6.0M 14 1000 -- -- - good
15/0 sitparent 150 0.0% 20% 1.4M 4.0M 12 500 -- -- - good
15/0 hatcpu 150 0.2% 10% 1.6M 5.0M 13 500 -- -- - good
15/0 rmmgr 150 1.3% 10% 3.2M 7.0M 22 500 -- -- - good
15/0 dhmgr 150 0.0% 5% 1.7M 6.0M 18 2000 -- -- - good
15/0 dcardmgr 15 0.0% 60% 12.7M 20.0M 13 500 -- -- - good
15/0 npumgr 15 5.1% 100% 117.5M 225.3M 34 1000 -- -- - good
15/0 sft 1500 0.3% 50% 2.5M 12.0M 15 500 -- -- - good
15/0 sessmgr 5045 0.1% 50% 18.0M 85.0M 17 500 -- -- S good
15/0 sessmgr 5046 0.1% 50% 18.1M 85.0M 17 500 -- -- S good
15/0 sessmgr 5047 0.1% 50% 18.1M 85.0M 16 500 -- -- S good
15/0 aaamgr 5048 0.0% 50% 6.1M 20.0M 16 500 -- -- - good
15/0 aaamgr 5049 0.0% 50% 6.1M 20.0M 16 500 -- -- - good
15/0 aaamgr 5050 0.0% 50% 6.4M 20.0M 16 500 -- -- - good
15/0 aaamgr 5051 0.0% 50% 6.4M 20.0M 16 500 -- -- - good
15/0 sitreap 15000509 0.0% 5% 0.1M 5.0M 5 100 -- -- - good
15/0 msgd 15000507 0.7% 5% 0.8M 5.0M 6 100 -- -- - good
15/0 nscontrol 15030489 0.0% 5% 1.0M 5.0M 10 500 -- -- - good
15/1 sitmain 151 0.0% 15% 1.8M 6.0M 14 1000 -- -- - good
15/1 sitparent 151 0.0% 20% 1.4M 4.0M 12 500 -- -- - good
15/1 hatcpu 151 0.2% 10% 1.5M 5.0M 13 500 -- -- - good
15/1 rmmgr 151 1.2% 10% 3.2M 7.0M 25 500 -- -- - good
15/1 dhmgr 151 0.0% 5% 1.9M 6.0M 18 2000 -- -- - good
15/1 sessmgr 5040 0.1% 50% 17.9M 85.0M 17 500 -- -- S good
15/1 sessmgr 5041 0.0% 50% 18.1M 85.0M 17 500 -- -- S good
15/1 sessmgr 5042 0.1% 50% 18.1M 85.0M 17 500 -- -- S good
15/1 sessmgr 5043 0.1% 50% 18.1M 85.0M 17 500 -- -- S good
15/1 sessmgr 5044 0.1% 50% 18.1M 85.0M 17 500 -- -- S good
15/1 sessmgr 5052 0.1% 50% 18.1M 85.0M 16 500 -- -- S good
15/1 aaamgr 5053 0.0% 50% 6.0M 20.0M 16 500 -- -- - good
15/1 aaamgr 5054 0.0% 50% 6.0M 20.0M 16 500 -- -- - good
15/1 aaamgr 5055 0.0% 50% 6.1M 20.0M 16 500 -- -- - good
15/1 aaamgr 5056 0.0% 50% 6.0M 20.0M 16 500 -- -- - good
15/1 aaamgr 5057 0.0% 50% 6.4M 20.0M 16 500 -- -- - good
15/1 aaamgr 5058 0.0% 50% 6.4M 20.0M 16 500 -- -- - good
15/1 sitreap 15100472 0.0% 5% 0.1M 5.0M 5 100 -- -- - good
15/1 msgd 15100470 0.7% 5% 0.8M 5.0M 6 100 -- -- - good
15/1 nscontrol 15130514 0.0% 5% 1.0M 5.0M 10 500 -- -- - good
15/2 sitmain 152 0.0% 15% 1.8M 6.0M 14 1000 -- -- - good
15/2 sitparent 152 0.0% 20% 1.4M 4.0M 12 500 -- -- - good
15/2 hatcpu 152 0.2% 10% 1.6M 5.0M 13 500 -- -- - good
15/2 rmmgr 152 1.1% 10% 3.3M 7.0M 25 500 -- -- - good
15/2 dhmgr 152 0.0% 5% 2.0M 6.0M 18 2000 -- -- - good
15/2 sessmgr 5059 0.1% 50% 17.9M 85.0M 17 500 -- -- S good
15/2 sessmgr 5060 0.1% 50% 18.1M 85.0M 17 500 -- -- S good
15/2 sessmgr 5061 0.1% 50% 18.1M 85.0M 17 500 -- -- S good
15/2 sessmgr 5062 0.1% 50% 18.1M 85.0M 16 500 -- -- S good
15/2 sessmgr 5069 0.1% 50% 18.1M 85.0M 17 500 -- -- S good
15/2 sessmgr 5070 0.1% 50% 18.1M 85.0M 16 500 -- -- S good
15/2 aaamgr 5071 0.0% 50% 6.0M 20.0M 16 500 -- -- - good
15/2 aaamgr 5072 0.0% 50% 6.0M 20.0M 16 500 -- -- - good
15/2 aaamgr 5073 0.0% 50% 6.1M 20.0M 16 500 -- -- - good
15/2 aaamgr 5074 0.0% 50% 6.0M 20.0M 16 500 -- -- - good
15/2 aaamgr 5075 0.0% 50% 6.5M 20.0M 16 500 -- -- - good
15/2 aaamgr 5076 0.0% 50% 6.4M 20.0M 16 500 -- -- - good
15/2 sitreap 15200472 0.0% 5% 0.1M 5.0M 5 100 -- -- - good
15/2 msgd 15200470 0.5% 5% 0.8M 5.0M 6 100 -- -- - good
15/2 nscontrol 15230514 0.0% 5% 1.0M 5.0M 10 500 -- -- - good
15/3 sitmain 153 0.0% 15% 1.8M 6.0M 14 1000 -- -- - good
15/3 sitparent 153 0.0% 20% 1.5M 4.0M 12 500 -- -- - good
15/3 hatcpu 153 0.2% 10% 1.6M 5.0M 13 500 -- -- - good
15/3 rmmgr 153 1.2% 10% 3.2M 7.0M 25 500 -- -- - good
15/3 dhmgr 153 0.0% 5% 1.8M 6.0M 18 2000 -- -- - good
15/3 sessmgr 5063 0.1% 50% 17.9M 85.0M 17 500 -- -- S good
15/3 sessmgr 5064 0.1% 50% 18.1M 85.0M 17 500 -- -- S good
15/3 sessmgr 5065 0.1% 50% 18.1M 85.0M 17 500 -- -- S good
15/3 sessmgr 5066 0.1% 50% 18.1M 85.0M 17 500 -- -- S good
15/3 sessmgr 5067 0.1% 50% 18.1M 85.0M 17 500 -- -- S good
15/3 sessmgr 5068 0.1% 50% 18.1M 85.0M 16 500 -- -- S good
15/3 aaamgr 5077 0.0% 50% 6.1M 20.0M 16 500 -- -- - good
15/3 aaamgr 5078 0.0% 50% 6.2M 20.0M 16 500 -- -- - good
15/3 aaamgr 5079 0.0% 50% 6.4M 20.0M 16 500 -- -- - good
15/3 aaamgr 5080 0.0% 50% 6.5M 20.0M 16 500 -- -- - good
15/3 aaamgr 5081 0.0% 50% 6.2M 20.0M 16 500 -- -- - good
15/3 aaamgr 5082 0.0% 50% 6.1M 20.0M 16 500 -- -- - good
15/3 aaamgr 5083 0.0% 50% 6.1M 20.0M 16 500 -- -- - good
15/3 sitreap 15300472 0.0% 5% 0.1M 5.0M 5 100 -- -- - good
15/3 msgd 15300470 0.6% 5% 0.8M 5.0M 6 100 -- -- - good
15/3 nscontrol 15330545 0.0% 5% 1.0M 5.0M 10 500 -- -- - good
To round out the picture, here is output for the tasks running on the SPC:
******** show task resources *******
task cputime memory files sessions
cpu facility inst used allc used alloc used allc used allc S status
----------------------- --------- ------------- --------- ------------- ------
8/0 sitmain 80 0.0% 15% 1.9M 6.0M 15 1000 -- -- - good
8/0 sitparent 80 0.0% 20% 1.5M 4.0M 13 500 -- -- - good
8/0 evlogd 0 18% 94% 29.7M 60.0M 134 4000 -- -- - good
8/0 drvctrl 0 1.4% 15% 3.3M 5.0M 15 500 -- -- - good
8/0 hatsystem 0 1.1% 10% 2.0M 5.0M 18 500 -- -- - good
8/0 hatcpu 80 0.2% 10% 1.6M 5.0M 16 500 -- -- - good
8/0 rct 0 0.0% 5% 1.4M 5.0M 13 500 -- -- - good
8/0 sct 0 4.0% 50% 9.4M 65.0M 18 500 -- -- - good
8/0 rmmgr 80 2.5% 10% 3.5M 7.0M 19 500 -- -- - good
8/0 npumgr 8 2.6% 100% 46.0M 90.5M 24 1000 -- -- - good
8/0 vpnmgr 1 0.6% 30% 8.4M 12.0M 34 1000 -- -- - good
8/0 zebos 1 0.1% 50% 7.9M 12.0M 18 500 -- -- - good
8/0 aaamgr 231 0.1% 30% 11.7M 25.0M 20 500 -- -- - good
8/0 dhmgr 80 0.0% 5% 2.0M 6.0M 18 2000 -- -- - good
8/0 rmctrl 0 1.4% 20% 7.5M 22.0M 45 1000 -- -- - good
8/0 npuctrl 0 0.6% 10% 2.7M 10.0M 27 500 -- -- - good
8/0 dcardctrl 0 0.0% 10% 1.5M 8.0M 12 500 -- -- - good
8/0 cspctrl 0 1.7% 50% 3.5M 12.0M 16 500 -- -- - good
8/0 vpnctrl 0 0.1% 15% 3.5M 7.0M 16 1500 -- -- - good
8/0 cssctrl 0 0.0% 15% 1.4M 7.0M 12 500 -- -- - good
8/0 sessctrl 0 0.6% 89% 7.9M 57.0M 35 2000 -- -- - good
8/0 vmgctrl 0 0.0% 7% 1.5M 20.0M 13 500 -- -- - good
8/0 snmp 0 6.6% 34% 11.9M 12.0M 360 1500 -- -- - warn
8/0 orbs 0 0.0% 50% 9.2M 28.0M 16 1500 -- -- - good
8/0 bulkstat 0 20% 89% 16.1M 35.0M 316 1500 -- -- - good
8/0 threshold 0 0.5% 89% 4.2M 23.0M 12 1500 -- -- - good
8/0 ipsecctrl 0 0.1% 15% 2.2M 15.0M 23 500 -- -- - good
8/0 orbns 0 0.0% 50% 2.6M 12.0M 17 1500 -- -- - good
8/0 cli 8017052 0.0% 50% 4.0M 20.0M 11 2000 -- -- - good
8/0 cli 8017526 31% 50% 9.8M 20.0M 385 2000 -- -- - good
8/0 sitreap 8000546 0.0% 5% 0.1M 5.0M 5 100 -- -- - good
8/0 msgd 8000542 1.0% 5% 0.4M 5.0M 6 100 -- -- - good
8/0 msgd 8000543 0.0% 5% 0.1M 5.0M 7 100 -- -- - good
8/0 inetd 8000593 0.0% 5% 0.1M 5.0M 4 100 -- -- - good
8/0 sshd 8001310 0.0% 30% 0.2M 5.0M 4 100 -- -- - good
8/0 ntpd 8001320 0.0% 5% 0.2M 5.0M 8 100 -- -- - good
8/0 orbnsd 8001323 0.0% 5% 0.7M 10.0M 8 100 -- -- - good
8/0 sshd 8017050 0.0% 30% 0.3M 5.0M 9 100 -- -- - good
8/0 nscontrol 8017481 0.0% 5% 1.0M 5.0M 10 500 -- -- - good
8/0 sshd 8017522 2.1% 30% 0.3M 5.0M 9 100 -- -- - good
Each sessmgr process is paired with a aaamgr process that is handling the aaa requests for the subscribers on that sessmgr, and that aaamgr process is located on a different card. The task instance ids for each pair is the same. The process pairings are across various cards, so that the processes are not paired just between two given cards.
Also, if session recovery feature is enabled, then that aaamgr also contains the state information of the sessions managed by the sessmgr process it is paired with. As well, pre-v8.1, acsmgr processes are also mirrored (acsmgr functionality has been combined into the sessmgr process with v8.1). In the case of a failed PAC, the sessions can be recreated on the standby card that has become active seamlessly based on the redundant information. Also, an additional PAC card is largely used for the recovery process (in addition to the standby card that takes over for the failed card). This PAC card is called the Demux card, and there are no session related tasks running on this card, and it is always in the active state. So, the session recovery feature requires at least three active cards and one standby card. Here is example output for the demux card (2) taken from the same chassis as the above output did.
2/0 sitmain 20 0.0% 15% 1.8M 6.0M 14 1000 -- -- - good
2/0 sitparent 20 0.0% 20% 1.3M 4.0M 12 500 -- -- - good
2/0 hatcpu 20 0.1% 10% 1.6M 5.0M 13 500 -- -- - good
2/0 rmmgr 20 1.0% 10% 3.2M 7.0M 19 500 -- -- - good
2/0 dhmgr 20 0.0% 5% 1.7M 6.0M 16 2000 -- -- - good
2/0 dcardmgr 2 0.0% 60% 12.5M 20.0M 13 500 -- -- - good
2/0 npumgr 2 8.1% 100% 117.5M 225.3M 34 1000 -- -- - good
2/0 sft 200 0.3% 50% 2.5M 12.0M 15 500 -- -- - good
2/0 vpnmgr 2 0.2% 100% 5.4M 12.0M 136 1000 -- -- - good
2/0 zebos 2 0.0% 50% 7.2M 12.0M 17 500 -- -- - good
2/0 sitreap 2000509 0.0% 5% 0.1M 5.0M 5 100 -- -- - good
2/0 msgd 2000507 0.5% 5% 0.8M 5.0M 6 100 -- -- - good
2/0 inetd 2000554 0.0% 5% 0.1M 5.0M 4 100 -- -- - good
2/0 nscontrol 2030377 0.0% 5% 1.0M 5.0M 10 500 -- -- - good
2/1 sitmain 21 0.0% 15% 1.8M 6.0M 14 1000 -- -- - good
2/1 sitparent 21 0.0% 20% 1.4M 4.0M 12 500 -- -- - good
2/1 hatcpu 21 0.1% 10% 1.6M 5.0M 13 500 -- -- - good
2/1 rmmgr 21 0.7% 10% 3.3M 7.0M 20 500 -- -- - good
2/1 dhmgr 21 0.0% 5% 1.9M 6.0M 18 2000 -- -- - good
2/1 l2tpdemux 1 0.0% 94% 2.3M 20.0M 14 500 -- -- - good
2/1 l2tpmgr 1 0.1% 100% 12.7M 266.0M 16 1500 0 200k - good
2/1 sitreap 2100472 0.0% 5% 0.1M 5.0M 5 100 -- -- - good
2/1 msgd 2100470 0.5% 5% 0.8M 5.0M 6 100 -- -- - good
2/1 nscontrol 2130198 0.0% 5% 1.0M 5.0M 10 500 -- -- - good
2/2 sitmain 22 0.0% 15% 1.8M 6.0M 14 1000 -- -- - good
2/2 sitparent 22 0.0% 20% 1.4M 4.0M 12 500 -- -- - good
2/2 hatcpu 22 0.1% 10% 1.6M 5.0M 13 500 -- -- - good
2/2 rmmgr 22 1.3% 10% 3.2M 7.0M 21 500 -- -- - good
2/2 dhmgr 22 0.0% 5% 1.7M 6.0M 14 2000 -- -- - good
2/2 vpnmgr 3 7.4% 100% 9.1M 41.4M 241 1000 -- -- - good
2/2 zebos 3 0.0% 50% 7.9M 15.0M 17 500 -- -- - good
2/2 famgr 1 9.9% 100% 34.7M 96.0M 236 1000 63k 200k - good
2/2 vpnmgr 4 0.3% 100% 4.9M 12.0M 28 1000 -- -- - good
2/2 zebos 4 0.0% 50% 7.4M 12.0M 17 500 -- -- - good
2/2 l2tpmgr 2 0.6% 100% 14.2M 150.0M 196 1500 144 130k - good
2/2 cdrmod 2 0.3% 94% 42.2M 315.9M 87 1024 -- -- - good
2/2 sitreap 2200472 0.0% 5% 0.1M 5.0M 5 100 -- -- - good
2/2 msgd 2200470 0.8% 5% 0.8M 5.0M 6 100 -- -- - good
2/2 inetd 2200515 0.0% 5% 0.1M 5.0M 4 100 -- -- - good
2/2 inetd 2200526 0.0% 5% 0.1M 5.0M 5 100 -- -- - good
2/2 sshd 2200527 0.0% 30% 0.2M 5.0M 4 100 -- -- - good
2/2 sshd 2200577 0.0% 30% 0.3M 5.0M 8 100 -- -- - good
2/2 sftp-serv 2200578 0.0% 10% 0.3M 5.0M 5 100 -- -- - good
2/2 sshd 2200579 0.0% 30% 0.3M 5.0M 8 100 -- -- - good
2/2 sftp-serv 2200586 0.0% 10% 0.4M 5.0M 5 100 -- -- - good
2/2 nscontrol 2203737 0.0% 5% 1.0M 5.0M 10 500 -- -- - good
2/3 sitmain 23 0.0% 15% 1.8M 6.0M 14 1000 -- -- - good
2/3 sitparent 23 0.0% 20% 1.4M 4.0M 12 500 -- -- - good
2/3 hatcpu 23 0.1% 10% 1.6M 5.0M 13 500 -- -- - good
2/3 rmmgr 23 0.8% 10% 3.1M 7.0M 20 500 -- -- - good
2/3 dhmgr 23 0.0% 5% 2.0M 6.0M 18 2000 -- -- - good
2/3 a11mgr 1 26% 100% 50.9M 171.0M 243 1000 79k 200k - good
2/3 sitreap 2300472 0.0% 5% 0.1M 5.0M 5 100 -- -- - good
2/3 msgd 2300470 0.8% 5% 0.8M 5.0M 6 100 -- -- - good
2/3 nscontrol 2330179 0.0% 5% 1.0M 5.0M 10 500 -- -- - good
The command “show session recovery verbose” also indicates which card is the demux card:
******** show session recovery status verbose *******
Session Recovery Status:
Overall Status : Ready For Recovery
Last Status Update : 7 seconds ago
----sessmgr--- ----aaamgr---- demux
cpu state active standby active standby active status
---- ------- ------ ------- ------ ------- ------ -------------------------
2/0 Active 0 0 0 0 1 Good (Demux)
2/1 Active 0 0 0 0 2 Good (Demux)
2/2 Active 0 0 0 0 5 Good (Demux)
2/3 Active 0 0 0 0 1 Good (Demux)
3/0 Active 3 1 4 1 0 Good
3/1 Active 6 1 5 1 0 Good
3/2 Active 6 1 5 1 0 Good
3/3 Active 6 1 5 1 0 Good
4/0 Active 3 1 4 1 0 Good
4/1 Active 6 1 5 1 0 Good
4/2 Active 6 1 5 1 0 Good
4/3 Active 6 1 6 1 0 Good
6/0 Active 3 1 4 1 0 Good
6/1 Active 6 1 6 1 0 Good
6/2 Active 6 1 6 1 0 Good
6/3 Active 6 1 6 1 0 Good
11/0 Active 3 1 4 1 0 Good
11/1 Active 6 1 5 1 0 Good
11/2 Active 6 1 6 1 0 Good
11/3 Active 6 1 7 1 0 Good
13/0 Active 3 1 4 1 0 Good
13/1 Active 6 1 5 1 0 Good
13/2 Active 6 1 6 1 0 Good
13/3 Active 6 1 7 1 0 Good
15/0 Standby 0 3 0 4 0 Good
15/1 Standby 0 6 0 6 0 Good
15/2 Standby 0 6 0 6 0 Good
15/3 Standby 0 6 0 7 0 Good
Another command will allow examining what card CPUs are devoted to what kind of activity. Note the demux card is 2, the sessmgr cards are 3, 4, 6, 11, 13. That leaves the remaining cards as standby (only one card, 15, in this case, but there could be multiple standbys)
******** show resources *******
Active CPUs:
Total CPUs : 24
Session CPUs : 20 [3/0 3/1 3/2 3/3 4/0 4/1 4/2 4/3 6/0 6/1 6/2 6/3 11/0 11/1 11/2 11/3 13/0 13/1 13/2 13/3]
Demux CPUs : 4 [2/0 2/1 2/2 2/3]
Highest Load : 2.09 (CPU 3/2)
Total Memory : 48.0G
Total Used : 18.7G
Least Free : 1059M (CPU 13/2)
Total Temporary Files : 5M
Most Temporary Files : 2M (CPU 2/3)
Standby CPUs:
Total CPUs : 4
Highest Load : 0.07 (CPU 15/2)
Total Memory : 8.0G
Total Used : 1.2G
Least Free : 1642M (CPU 15/0)
Total Temporary Files : 0M
Most Temporary Files : 0M (CPU 15/0)
Session Information:
In-Use Session Managers:
Number of Managers : 105
Capacity : 173250 min / 346500 typical / 693000 max
Usage : 79347
Busy-Out Session Managers:
Number of Managers : 0
Capacity : 0 min / 0 typical / 0 max
Usage : 0
Standby Session Managers:
Number of Managers : 41
PDSN Service:
In Use : 79347
Max Used : 86135
Limit : 200000
License Status : Within Acceptable Limits
As always the case with STxx commands, you can see that there are multiple views with overlapping information, each serving a specific purpose in the information gathering, monitoring, and troubleshooting process.
This is a good opportunity to mention the status column of show task resources. If the amount of memory that has been allocated to the process has been exceeded, then the status will indicate “warn” or “alarm”. If this occurs, contact Starent support so that the appropriate action can be taken. This might involve resetting the task, which only Starent can do, but also collecting data from the chassis that might indicate how it got into that state, and then possibly reporting this to engineering as a problem report. If a task does get into this state, it is likely that an SNMP trap and/or error log entry has been saved, and possibly an alarm triggered:
show snmp trap history (or view your SNMP trap server for more detailed information)
show logs
show alarm outstanding
Here is an example of multiple logs for a bulkstats process exceeding the allocated memory limit.
show task resources
8/0 bulkstat 0 0.3% 89% 52.2M 35.0M 332 1500 -- -- - alarm
[local]BRHOALTBPN3-HA# show logs | grep bulk
2009-Feb-21+20:18:09.342 [resmgr 14508 error] [8/0/920 <rmmgr:80> _resource_cpu.c:2292] [software internal system critical-info] The task bulkstat-0 is way over it's memory limit! Allocated 35840K, Using 57588K
2009-Feb-18+17:50:09.348 [resmgr 14508 error] [8/0/920 <rmmgr:80> _resource_cpu.c:2292] [software internal system critical-info] The task bulkstat-0 is way over it's memory limit! Allocated 35840K, Using 55636K
2009-Feb-16+11:56:09.351 [resmgr 14508 error] [8/0/920 <rmmgr:80> _resource_cpu.c:2292] [software internal system critical-info] The task bulkstat-0 is way over it's memory limit! Allocated 35840K, Using 53812K
2009-Feb-13+05:34:09.348 [resmgr 14508 error] [8/0/920 <rmmgr:80> _resource_cpu.c:2292] [software internal system critical-info] The task bulkstat-0 is way over it's memory limit! Allocated 35840K, Using 51948K
Besides a task getting into the warning or alarm state, a task could also crash (unrelated to warning or alarm). Task monitoring is handled by the hatcpu processes on each card, and if the failure is not critical, then the parent task (name ends in main, ctrl, or parent) is notified, and the task can simply be respawned without any other action (such as a migration).
If session recovery is enabled, then data from any lost sessmgr, aaamgr, (or acsmgr for pre-v8.1) process(es) is transferred to standby process(es) which become active, and then those process(es) are then re-assigned the instance number of failed process(es). This re-assignment is done for convenience because other tasks in the system know about the instance number of the failed task number, not the one that took over. Also, the task that takes over has a value greater than 5000 which by definition is standby, but the task is now active. Finally, a new standby task is spawned to replace the one that became active, and that task’s instance number is usually the previous standby task instance number + 1 (see the task table for evidence of this)
The SNMP trap history and associated logs may look like this (SR 5898)
Mon Feb 16 04:06:50 2009 Internal trap notification 73 (ManagerFailure) facility acsmgr instance 50 card 13 cpu 0
Mon Feb 16 04:06:50 2009 Internal trap notification 150 (TaskFailed) facility acsmgr instance 50 on card 13 cpu 0
2009-Feb-16+04:06:51.313 [sitmain 4075 trace] [8/0/912 <sitparent:80> crashd.c:579] [software internal system critical-info] Crash handler file transfer ended (type=1 size=127535 child_ct=0 core_ct=0 pid=2257 status=1 elapsed=1s)
2009-Feb-16+04:06:50.444 [sitmain 4023 info] [13/0/718 <sitmain:130> sittask.c:4282] [software internal system critical-info] Readdress successful for facility acsmgr instance 6009 to instance 50
2009-Feb-16+04:06:50.247 [sitmain 4074 trace] [8/0/912 <sitparent:80> crashd.c:851] [software internal system critical-info] Crash handler file transfer starting (type=1 size=127535 child_ct=1 core_ct=0 pid=2257)
2009-Feb-16+04:06:50.245 [sitmain 4080 info] [8/0/912 <sitparent:80> crashd.c:812] [software internal system critical-info] Core file transfer to SPC complete, received 127535/127535 bytes
2009-Feb-16+04:06:50.238 [sitmain 4022 info] [13/0/718 <sitmain:130> sittask.c:4128] [software internal system critical-info] Readdress requested for facility acsmgr instance 6009 to instance 50
2009-Feb-16+04:06:50.207 [rct 13038 info] [8/0/918 <rct:0> rct_task.c:275] [software internal system critical-info] Death notification of task acsmgr/50 on 13/0 sent to parent task acsctrl/0 with failover of acsmgr/6009 on 13/0
2009-Feb-16+04:06:50.205 [sitmain 4017 critical] [13/0/718 <sitmain:130> sitproc.c:1471] [software internal system] Process acsmgr pid 7846 died on card 13 cpu 0 signal=11 wstatus=0x8b
2009-Feb-16+04:06:50.204 [sitmain 4027 critical] [13/0/718 <sitmain:130> crash_mini.c:795] [software internal system] Process Crash Info: time 2009-Feb-16+04:06:47(hex time 4998e657) card 13 cpu 00 pid 07846 procname acsmgr crash_details
Fatal Signal 11: Segmentation fault
PC: [ a280732/X] acs_control_charge()
Faulty address: (nil)
Signal detail: address not mapped to object
Process: card=13 cpu=0 arch=X pid=7846 cpu=~2% argv0=acsmgr
Crash time: 2009-Feb-15+23:06:47
Recent errno: 11 Resource temporarily unavailable
........
In this case, the backed up acsmgr task data from the failed task 50 was populated into standby task 6009 which took over and became active, and then was re-assigned the task instance number 50 of the failed task. The term “failover” means that session recovery was used to failover to standby task 6009. Note the logs use process id (pid) 7846 and task instance 6009 interchangeably.
Though the log output is intended for engineering to analyze, if you read through the logs, some of it may actually make some sense, especially in light of reading this document. Logs are in reverse chronological order, so you should read in the up direction. The SNMP trap history is the best place to START most of your troubleshooting, because it lays out a timeline of the most significant events. When you see ManagerFailure and/or TaskFailed, you can then check the logs for a matching timestamp and you should be able to find significant details on what occurred, even if the logging level for the chassis is set to error level for all facilities (default behavior). That is because event data associated with process crashes are often either error level (second to highest level) or critical level (highest level). In addition, many log events associated with important activities are set to critical-info (vs. non-critical-info), so even if the log event is less than error level and the facility in question is set to error level, it will still be saved to logs because it is critical-info.
In the above example, the full crash details have not been included here to save space (........), so the crash itself is considered one log event that spans many lines. The full crash itself will also be found in the crash list which can be retrieved with the command “show crash list”. Note the card, CPU and Process ID, software version and SPC/SMC and PAC/PSC card serial numbers are saved. The list itself can be useful for quick scanning to note any patterns of tasks or cards or timestamps of when crashes have occurred. Up to 120 crashes can be saved in the list, after which it needs to be cleared with “clear crash list” in order for any new crashes to be saved.
Finally note mention of crash files being transferred. If the crash enable feature is set in the configuration, then any crash dump files will be sent to the URL specified (it is encrypted when viewing the config), and Starent may or may not request such files (one per crash) depending on whether a root cause can be determined from the logs alone.
******** show crash list *******
== ==== ======= =========== ========== ================
# Time Process Card/CPU/ SW HW_SER_NUM
PID VERSION SMC / Crash Card
== ==== ======= =========== ========== ================
1 2009-Feb-16+04:06:47 acsmgr 13/00/07846 7.1(21029) PLB19082829/PLB24084573
********************* CRASH #01 ***********************
Fatal Signal 11: Segmentation fault
PC: [ a280732/X] acs_control_charge()
Faulty address: (nil)
Signal detail: address not mapped to object
Process: card=13 cpu=0 arch=X pid=7846 cpu=~2% argv0=acsmgr
Crash time: 2009-Feb-15+23:06:47
Recent errno: 11 Resource temporarily unavailable
........
Here’s another example with just a tiny snippet to make the point, though there were actually hundreds of log entries associated with these crashes. In this case both sessmgr processes were respawned. (SR 6032)
2009-Mar-06+08:40:10.297 [sitmain 4023 info] [16/0/1146 <sitmain:160> sittask.c:4321] [software internal system critical-info] Readdress successful for facility sessmgr instance 5022 to instance 22
2009-Mar-06+08:40:10.272 [sitmain 4022 info] [16/0/1146 <sitmain:160> sittask.c:4167] [software internal system critical-info] Readdress requested for facility sessmgr instance 5022 to instance 22
2009-Mar-06+08:40:10.244 [rct 13038 info] [8/0/1383 <rct:0> rct_task.c:275] [software internal system critical-info] Death notification of task sessmgr/22 on 16/0 sent to parent task sessctrl/0 with failover of sessmgr/5022 on 16/0
2009-Mar-06+08:40:10.243 [sitmain 4017 critical] [16/0/1146 <sitmain:160> sitproc.c:1514] [software internal system] Process sessmgr pid 1292 died on card 16 cpu 0 signal=6 wstatus=0x86
2009-Mar-06+08:40:09.889 [sitmain 4023 info] [3/0/1146 <sitmain:30> sittask.c:4321] [software internal system critical-info] Readdress successful for facility sessmgr instance 5002 to instance 50
2009-Mar-06+08:40:09.859 [sitmain 4022 info] [3/0/1146 <sitmain:30> sittask.c:4167] [software internal system critical-info] Readdress requested for facility sessmgr instance 5002 to instance 50
2009-Mar-06+08:40:09.829 [rct 13038 info] [8/0/1383 <rct:0> rct_task.c:275] [software internal system critical-info] Death notification of task sessmgr/50 on 3/0 sent to parent task sessctrl/0 with failover of sessmgr/5002 on 3/0
2009-Mar-06+08:40:09.786 [sitmain 4017 critical] [3/0/1146 <sitmain:30> sitproc.c:1514] [software internal system] Process sessmgr pid 1336 died on card 3 cpu 0 signal=6 wstatus=0x86
******** show crash list *******
== ==== ======= =========== ========== ================
# Time Process Card/CPU/ SW HW_SER_NUM
PID VERSION SMC / Crash Card
== ==== ======= =========== ========== ================
1 2009-Mar-06+08:40:07 sessmgr 16/00/01292 8.1(22707) PLB27085483/PLB46075509
2 2009-Mar-06+08:40:09 sessmgr 03/00/01336 8.1(22707) PLB27085483/PLB46075497
********************* CRASH #01 ***********************
Assertion failure at sess/smgr/sessmgr_ipv4.c:11833
Function: acsmgr_send_packet_merge_mode()
Expression: clp
Proclet: sessmgr (f=66000,i=22)
Process: card=16 cpu=0 arch=X pid=1292 cpu=~5% argv0=sessmgr
Crash time: 2009-Mar-06+03:40:07
Recent errno: 115 Operation now in progress
********************* CRASH #02 ***********************
Assertion failure at sess/smgr/sessmgr_ipv4.c:11833
Function: acsmgr_send_packet_merge_mode()
Expression: clp
Proclet: sessmgr (f=66000,i=50)
Process: card=3 cpu=0 arch=X pid=1336 cpu=~5% argv0=sessmgr
Crash time: 2009-Mar-06+03:40:09
Recent errno: 115 Operation now in progress
The information presented thus far is important in developing an understanding of the various steps that take place when a PAC/PSC migration occurs.
The system will always try to maintain the same number of active cards as have been specified in the configuration with:
card x
mode active pac
#end
This is due to the fact that it is not trivial to decrease the number of certain tasks in the system without putting it into an unstable state.
So, whenever an active card goes into the offline or standby state, there should always be another standby card ready to become active and take its place. Whether an active card failure occurs or whether an active card is purposely made inactive (see below), the system is designed to ensure another card will become active automatically (without user intervention).
First let’s cover the cases where the migration was intentional or “planned”, that is, it was initiated by an administrator. These are the various ways that will cause a migration from one card to another to occur. This is assuming that the card being migrated from is an active card. In all cases, the migration happens FIRST before the reset action for the card in order to minimize impact.
Executive mode:
card pac migrate X to X – unlike the other commands below, this ensures which card will become active (if there are more than one standby card)
card restart – re-initializes card without reboot first (faster – it just restarts the software tasks, it doesn’t reboot the kernel)
card reboot – does both a reboot and re-initialization
card halt – card goes offline instead of standby (cannot apply this command to an offline card)
Restart, reboot, and halt all trigger a CardRebootRequest SNMP trap.
card configuration mode (for a specific card):
shutdown – card goes offline
Note: executing “mode standby” in card config mode will NOT result in a migration and so should not be used to accomplish this task. (It will also immediately disable the line card behind the card (if there is one)).
Certainly you can do a card reboot, restart, or halt to a standby card, but nothing will be migrated because it is standby.
All successful migrations have the following traps:
Internal trap notification 16 (PACMigrateStart) from card 6 to card 5
Internal trap notification 55 (CardActive) card 5
Internal trap notification 60 (CardDown) card 6
Internal trap notification 17 (PACMigrateComplete) from card 6 to card 5
And if the card being migrated from is destined for the standby state (migrate, reboot, or restart), then additionally:
Internal trap notification 5 (CardUp) card 6
Internal trap notification 93 (CardStandby) card 6
The state changes show up in the card table during the migration process:
5: PAC/TAC Packet Accelerator Card Migrate To - - -
6: PAC/TAC Packet Accelerator Card Migrate From - 20 -
There is a command that gets run when running “show support details” that shows all recovery attempts (you cannot run this command at the command line).
******** show rct stats *******
RCT stats Details (Last 1 Actions)
Action Type From To Start Time Duration
-------------- --------- ---- ---- ------------------------ ----------
PAC migration Planned 6 5 2009-May-24+12:49:19.969 8.852 sec
Whether or not session recovery is enabled, the above approaches should all result in a graceful recreation of tasks to a standby card with very minimal subscriber impact. This is because nothing has gone wrong with the existing tasks, there is no lost information, and so the tasks and data have sufficient time to simply be recreated on another card, without relying on a backup/mirrored card (which would be the case for session recovery).
Note: Another way to force a graceful migration is to lift the latch on the card. If you do this, do not pull out the card from the slot while the migration is taking place. If the card is physical pulled out, then migration of tasks is not possible without session recovery because all the tasks and data are lost when the card is pulled and loses power.
SPC/SMC switchovers work the same way as PAC/PSC migrations. Note that SPC/SMC cards are not “configured” to be active in the configuration like PAC/PSC cards are. The commands to run are:
card spc switchover – no need to specify the card to switch to because there are only two cards
card reboot (8 or 9)
card restart (8 or 9)
There is no version of the halt command as there is for PAC/PSC cards.
For more information on the impact of various failover scenarios in table format, see the System Overview manual, Chapter 4 Redundancy and Availability Features, Maintenance and Failure Scenarios section.
A critical task failure often is the symptom of a hardware failure on a card and the card must be failed/reset. The system uses heartbeat or ping UDP-based messages every second to determine if the system is running normally. On the Active SPC/SMC, the hatsystem task executes heartbeat messages that are sent to the heartbeat CPU tasks (hatcpu) located on all CPU-based cards (PAC/PSC/SPC/SMC) in the system. If two consecutive messages are lost (not responded to), plus some other criteria (too detailed/low-level to cover here), then a critical task failure will be declared and will need to be reset. Depending on the severity, a migration of whatever tasks are migratable may or may not be attempted first, but in either case, an available standby card will be made active immediately. If session recovery is enabled, then minimal subscriber impact will be felt as those tasks are populated with data from backup processes from the various other active cards.
It is often asked if it is necessary or recommended to put all the cards in the chassis back into the state that they are configured for in the configuration, so that all the active cards are active, and the standby card(s) are standby. This is not necessary unless you like to keep things “tidy and neat”; the system will function fine without doing so.
In this example, no migration is attempted and the failed card did not boot back to standby status but goes offline (some traps removed): (SR 7095). Note that the logs do not mention anything about the new card (16) becoming active (because they are “non-critical-info”, “info” level logs and the logging level for this capture was “error” level), but the SNMP traps do.
Sun May 24 17:18:36 2009 Internal trap notification 73 (ManagerFailure) facility hatcpu instance 40 card 8 cpu 0
Sun May 24 17:18:36 2009 Internal trap notification 55 (CardActive) card 16
...
Sun May 24 17:18:37 2009 Internal trap notification 73 (ManagerFailure) facility hatcpu instance 40 card 8 cpu 0
...
Sun May 24 17:18:38 2009 Internal trap notification 60 (CardDown) card 4
...
Sun May 24 17:21:40 2009 Internal trap notification 9 (CardBootFailed) card 4
...
Sun May 24 17:24:42 2009 Internal trap notification 9 (CardBootFailed) card 4
Sun May 24 17:27:45 2009 Internal trap notification 9 (CardBootFailed) card 4
Sun May 24 17:30:53 2009 Internal trap notification 9 (CardBootFailed) card 4
The logs show the card being booted multiple times but not responding to pings as it should, and finally no more attempts are made after the fourth attempt. Note “Kernel missed ICMP echo request(s) from card 4: cpu:0 if:0, cpu:1 if:0” indicating that pings to both CPUs 0 and 1 (PSC cards have 2 CPUs) failed to be responded to.
2009-May-24+17:30:53.360 [csp 7019 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:3078] [hardware internal system] The Packet Services Card with serial number PLB50076807 in slot 4 has failed and will be brought down and kept down. (Device=CPU_0, Reason=CARD_BOOT_TIMEOUT_EXPIRED, Status=CFE_ROM: Step 2 [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], GPIO_IN: 02,ff,ff,ff)
2009-May-24+17:30:53.353 [alarmctrl 65201 info] [8/0/1378 <evlogd:0> alarmctrl.c:189] [software internal system critical-info] Alarm condition: id 50056e19844d0000 (Critical): Card Failure Detected on card 4 Device CPU_0 failure reason CARD_BOOT_TIMEOUT_EXPIRED
2009-May-24+17:30:53.351 [hat 3018 critical] [8/0/1380 <hatsystem:0> atsystem_fail.c:719] [hardware internal system] Card Failure Detected on card 4 Device CPU_0 failure reason CARD_BOOT_TIMEOUT_EXPIRED
2009-May-24+17:30:53.351 [hat 3033 error] [8/0/1380 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 4 device CPU_0 reason CARD_BOOT_TIMEOUT_EXPIRED
2009-May-24+17:30:53.351 [csp 7005 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:2473] [hardware internal system] The Packet Services Card in slot 4 did not boot in the allowed time. CPU 0 did not boot. CPU 1 did not boot.
2009-May-24+17:27:45.054 [csp 7019 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:3078] [hardware internal system] The Packet Services Card with serial number PLB50076807 in slot 4 has failed and will be power down and brought back online. (Device=CPU_0, Reason=CARD_BOOT_TIMEOUT_EXPIRED, Status=CFE_ROM: Step 2 [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], GPIO_IN: 02,ff,ff,ff)
2009-May-24+17:27:45.038 [hat 3005 warning] [8/0/1380 <hatsystem:0> hatsystem_msg.c:3204] [software internal system critical-info] message bounced from facility hatcpu instance 41
2009-May-24+17:27:45.038 [hat 3005 warning] [8/0/1380 <hatsystem:0> hatsystem_msg.c:3204] [software internal system critical-info] message bounced from facility hatcpu instance 40
2009-May-24+17:27:45.037 [hat 3033 error] [8/0/1380 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 4 device CPU_0 reason CARD_BOOT_TIMEOUT_EXPIRED
2009-May-24+17:27:45.029 [csp 7005 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:2473] [hardware internal system] The Packet Services Card in slot 4 did not boot in the allowed time. CPU 0 did not boot. CPU 1 did not boot.
2009-May-24+17:24:42.794 [csp 7019 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:3078] [hardware internal system] The Packet Services Card with serial number PLB50076807 in slot 4 has failed and will be brought down and brought back online. (Device=CPU_0, Reason=CARD_BOOT_TIMEOUT_EXPIRED, Status=CFE_ROM: Step 2 [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], GPIO_IN: 02,ff,ff,ff)
2009-May-24+17:24:42.790 [hat 3005 warning] [8/0/1380 <hatsystem:0> hatsystem_msg.c:3204] [software internal system critical-info] message bounced from facility hatcpu instance 41
2009-May-24+17:24:42.790 [hat 3005 warning] [8/0/1380 <hatsystem:0> hatsystem_msg.c:3204] [software internal system critical-info] message bounced from facility hatcpu instance 40
2009-May-24+17:24:42.789 [hat 3033 error] [8/0/1380 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 4 device CPU_0 reason CARD_BOOT_TIMEOUT_EXPIRED
2009-May-24+17:24:42.789 [csp 7005 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:2473] [hardware internal system] The Packet Services Card in slot 4 did not boot in the allowed time. CPU 0 did not boot. CPU 1 did not boot.
2009-May-24+17:21:40.578 [csp 7019 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:3078] [hardware internal system] The Packet Services Card with serial number PLB50076807 in slot 4 has failed and will be brought down and brought back online. (Device=CPU_0, Reason=CARD_BOOT_TIMEOUT_EXPIRED, Status=CFE_ROM: Step 2 [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], GPIO_IN: 02,ff,ff,ff)
2009-May-24+17:21:40.574 [hat 3005 warning] [8/0/1380 <hatsystem:0> hatsystem_msg.c:3204] [software internal system critical-info] message bounced from facility hatcpu instance 41
2009-May-24+17:21:40.574 [hat 3005 warning] [8/0/1380 <hatsystem:0> hatsystem_msg.c:3204] [software internal system critical-info] message bounced from facility hatcpu instance 40
2009-May-24+17:21:40.573 [hat 3033 error] [8/0/1380 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 4 device CPU_0 reason CARD_BOOT_TIMEOUT_EXPIRED
2009-May-24+17:21:40.572 [csp 7005 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:2473] [hardware internal system] The Packet Services Card in slot 4 did not boot in the allowed time. CPU 0 did not boot. CPU 1 did not boot.
2009-May-24+17:18:37.130 [hat 3017 error] [8/0/1380 <hatsystem:0> atsystem_fail.c:935] [software internal system] Ignoring multiple failure on card 4 cpu 0 reason CPU_CRITICAL_TASK_FAILURE
2009-May-24+17:18:36.834 [rct 13014 info] [8/0/1382 <rct:0> rct_pac.c:1105] [software internal system critical-info] Card 4 shutdown completed, took 0.479 sec
2009-May-24+17:18:36.436 [sitmain 4022 info] [16/0/1146 <sitmain:160> sittask.c:4167] [software internal system critical-info] Readdress requested for facility aaamgr instance 5031 to instance 1
2009-May-24+17:18:36.361 [hat 3005 warning] [8/0/1380 <hatsystem:0> hatsystem_msg.c:3204] [software internal system critical-info] message bounced from facility hatcpu instance 41
2009-May-24+17:18:36.356 [rct 13013 info] [8/0/1382 <rct:0> rct_pac.c:978] [software internal system critical-info] Card 4 shutdown started
2009-May-24+17:18:36.210 [dcardmgr 57099 info] [16/0/1244 <dcardmgr:16> dmgr_handlers.c:1610] [software internal system critical-info] finish taking over card 4
2009-May-24+17:18:36.204 [dcardmgr 57099 info] [16/0/1244 <dcardmgr:16> dmgr_handlers.c:1584] [software internal system critical-info] start taking over card 4
2009-May-24+17:18:36.136 [csp 7019 critical] [8/0/1396 <cspctrl:0> spctrl_events.c:3078] [hardware internal system] The Packet Services Card with serial number PLB50076807 in slot 4 has failed and will be reset and brought back online. (Device=CPU_0, Reason=CPU_CRITICAL_TASK_FAILURE, Status=Boot Done [hb: CC], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], GPIO_IN: 02,ff,ff,ff)
2009-May-24+17:18:36.129 [hat 3033 error] [8/0/1380 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 4 device CPU_0 reason CPU_CRITICAL_TASK_FAILURE
2009-May-24+17:18:35.208 [hat 3050 error] [8/0/1380 <hatsystem:0> hat_hb_lib.c:1297] [software internal system critical-info] Kernel missed ICMP echo request(s) from card 4: cpu:0 if:0, cpu:1 if:0
2009-May-24+17:18:35.208 [hat 3051 warning] [8/0/1380 <hatsystem:0> hat_hb_lib.c:1272] [software internal system critical-info] ICMP ping results from card 4: All CPUs failed to respond
2009-May-24+17:18:35.168 [hat 3014 critical] [8/0/1380 <hatsystem:0> hat_hb_lib.c:727] [software internal system] HAT instance 0 found Critical task hatcpu/41 failed on cpu 4/1. Missed heartbeat sequence 2190818 ns_age 4 bounce_code N/A uptime 2192349.
2009-May-24+17:18:35.167 [hat 3050 error] [8/0/1380 <hatsystem:0> hat_hb_lib.c:1297] [software internal system critical-info] Kernel missed ICMP echo request(s) from card 4: cpu:0 if:0, cpu:1 if:0
2009-May-24+17:18:35.167 [hat 3051 warning] [8/0/1380 <hatsystem:0> hat_hb_lib.c:1272] [software internal system critical-info] ICMP ping results from card 4: All CPUs failed to respond
2009-May-24+17:18:35.127 [hat 3014 critical] [8/0/1380 <hatsystem:0> hat_hb_lib.c:727] [software internal system] HAT instance 0 found Critical task hatcpu/40 failed on cpu 4/0. Missed heartbeat sequence 2190818 ns_age 4 bounce_code N/A uptime 2192343.
In this case, no migration is even attempted:
******** show rct stats *******
RCT stats Details (Last 1 Actions)
Action Type From To Start Time Duration
-------------- --------- ---- ---- ------------------------ ----------
PSC shutdown N/A 4 16 2009-May-24+17:18:36.355 0.478 sec
RCT stats Summary
-----------------
PSC Migrations = 0
SMC switchover = 0
To troubleshoot an offline card, it should be (firmly) reseated physically in the slot to determine if it will come back up (into standby mode). If it remains offline after doing this, then the card should be RMA’d.
This example is similar to the previous one except the failed card reset successfully (“Boxer starting …”) to standby status. (SR 6304)
Fri Mar 27 11:16:22 2009 Internal trap notification 73 (ManagerFailure) facility hatcpu instance 110 card 9 cpu 0
Fri Mar 27 11:16:22 2009 Internal trap notification 55 (CardActive) card 4
Fri Mar 27 11:16:25 2009 Internal trap notification 60 (CardDown) card 11
...
Fri Mar 27 11:18:43 2009 Internal trap notification 5 (CardUp) card 11
Fri Mar 27 11:18:43 2009 Internal trap notification 93 (CardStandby) card 11
2009-Mar-27+11:18:32.136 [sitmain 4003 info] [11/3/370 <sitmain:113> sitmain.c:549] [software internal system critical-info] Boxer starting on card 11 cpu 3 (PAC)
2009-Mar-27+11:18:32.105 [sitmain 4003 info] [11/2/370 <sitmain:112> sitmain.c:549] [software internal system critical-info] Boxer starting on card 11 cpu 2 (PAC)
2009-Mar-27+11:18:32.085 [sitmain 4003 info] [11/1/370 <sitmain:111> sitmain.c:549] [software internal system critical-info] Boxer starting on card 11 cpu 1 (PAC)
2009-Mar-27+11:18:31.425 [sitmain 4003 info] [11/0/397 <sitmain:110> sitmain.c:549] [software internal system critical-info] Boxer starting on card 11 cpu 0 (PAC)
2009-Mar-27+11:16:23.363 [rct 13014 info] [9/0/570 <rct:0> rct_pac.c:1071] [software internal system critical-info] PAC Card 11 shutdown completed, took 0.616 sec
2009-Mar-27+11:16:22.747 [rct 13013 info] [9/0/570 <rct:0> rct_pac.c:951] [software internal system critical-info] PAC Card 11 shutdown started
2009-Mar-27+11:16:22.391 [csp 7019 critical] [9/0/915 <cspctrl:0> spctrl_events.c:3045] [hardware internal system] The Packet Accelerator Card with serial number PLW05061700 in slot 11 has failed and will be reset and brought back online. (Device=CPU_0, Reason=CPU_CRITICAL_TASK_FAILURE, Status=Boot Done [hb: F3], Boot Done [hb: E3], Boot Done [hb: F3], Boot Done [hb: E3])
2009-Mar-27+11:16:22.346 [hat 3033 error] [9/0/568 <hatsystem:0> atsystem_fail.c:957] [hardware internal system critical-info] Card error detected on card 11 device CPU_0 reason CPU_CRITICAL_TASK_FAILURE
2009-Mar-27+11:16:22.252 [hat 3051 warning] [9/0/568 <hatsystem:0> hat_hb_lib.c:1120] [software internal system critical-info] ICMP ping results from card 11: rtt(0,0)=0ms rtt(1,0)=0ms rtt(2,0)=0ms rtt(3,0)=0ms
2009-Mar-27+11:16:22.250 [hat 3014 critical] [9/0/568 <hatsystem:0> hat_hb_lib.c:611] [software internal system] HAT instance 0 found Critical task hatcpu/110 failed on cpu 11/0. Missed heartbeat sequence 5794530 ns_age 4 bounce_code N/A.
Here is another example where the system determined that a migration of tasks was possible, as specified in the line “Migrating all tasks on the Packet Accelerator Card in slot 4 to the one in slot 13”. Note the reboot of card 4 starts AFTER the migration is complete. A few task (facility) migration logs are displayed below, but most have been removed to save space. (SR 6304)
2009-Mar-08+04:46:35.767 [sitmain 4003 info] [4/2/370 <sitmain:42> sitmain.c:549] [software internal system critical-info] Boxer starting on card 4 cpu 2 (PAC)
2009-Mar-08+04:46:35.707 [sitmain 4003 info] [4/3/370 <sitmain:43> sitmain.c:549] [software internal system critical-info] Boxer starting on card 4 cpu 3 (PAC)
2009-Mar-08+04:46:35.687 [sitmain 4003 info] [4/1/370 <sitmain:41> sitmain.c:549] [software internal system critical-info] Boxer starting on card 4 cpu 1 (PAC)
2009-Mar-08+04:46:35.059 [sitmain 4003 info] [4/0/397 <sitmain:40> sitmain.c:549] [software internal system critical-info] Boxer starting on card 4 cpu 0 (PAC)
…
2009-Mar-08+04:44:30.069 [csp 7022 info] [9/0/915 <cspctrl:0> spctrl_events.c:3391] [hardware internal system critical-info] The migration of tasks on card 4 to card 13 is complete.
2009-Mar-08+04:44:29.186 [rct 13014 info] [9/0/570 <rct:0> rct_pac.c:1071] [software internal system critical-info] PAC Card 4 shutdown completed, took 0.071 sec
2009-Mar-08+04:44:29.115 [rct 13013 info] [9/0/570 <rct:0> rct_pac.c:951] [software internal system critical-info] PAC Card 4 shutdown started
2009-Mar-08+04:44:26.958 [rct 13002 info] [9/0/570 <rct:0> rct_pac.c:464] [software internal system critical-info] PAC task migration completed from card 4 to card 13
…
2009-Mar-08+04:44:15.364 [sitmain 4025 info] [4/1/370 <sitmain:41> sittask.c:3853] [software internal system critical-info] Task migration started facility sessmgr instance 208 from card 4 cpu 1 to card 13 cpu 1
2009-Mar-08+04:44:15.363 [sitmain 4025 info] [4/0/397 <sitmain:40> sittask.c:3853] [software internal system critical-info] Task migration started facility sessmgr instance 5154 from card 4 cpu 0 to card 13 cpu 0
2009-Mar-08+04:44:14.356 [rct 13001 info] [9/0/570 <rct:0> rct_pac.c:148] [software internal system critical-info] PAC task migration begin from card 4 to card 13
2009-Mar-08+04:44:14.181 [csp 7009 info] [9/0/915 <cspctrl:0> pctrl_helpers.c:2219] [hardware internal system critical-info] Migrating all tasks on the Packet Accelerator Card in slot 4 to the one in slot 13.
2009-Mar-08+04:44:13.980 [csp 7019 critical] [9/0/915 <cspctrl:0> spctrl_events.c:3045] [hardware internal system] The Packet Accelerator Card with serial number PLW02061482 in slot 4 has failed and will be brought down and brought back online. (Device=CPU_3, Reason=CPU_CRITICAL_TASK_FAILURE, Status=Boot Done [hb: E0], Boot Done [hb: E0], Boot Done [hb: E0], Boot Done [hb: E0])
2009-Mar-08+04:44:13.850 [hat 3033 error] [9/0/568 <hatsystem:0> atsystem_fail.c:957] [hardware internal system critical-info] Card error detected on card 4 device CPU_3 reason CPU_CRITICAL_TASK_FAILURE
…
2009-Mar-08+04:44:13.808 [hat 3051 warning] [9/0/568 <hatsystem:0> hat_hb_lib.c:1120] [software internal system critical-info] ICMP ping results from card 4: rtt(0,0)=0ms rtt(1,0)=0ms rtt(2,0)=0ms rtt(3,0)=0ms
2009-Mar-08+04:44:13.806 [hat 3014 critical] [9/0/568 <hatsystem:0> hat_hb_lib.c:611] [software internal system] HAT instance 0 found Critical task hatcpu/43 failed on cpu 4/3. Missed heartbeat sequence 4130607 ns_age 4 bounce_code N/A.
Note the migration type is “Unplanned”.
******** show rct stats *******
RCT stats Details (Last 6 Actions)
Action Type From To Start Time Duration
-------------- --------- ---- ---- ------------------------ ----------
PAC migration Unplanned 15 11 2009-Feb-06+14:29:30.229 10.594 sec
SPC switchover Unplanned 8 9 2009-Feb-11+21:56:08.687 65.313 sec
SPC switchover Unplanned 9 8 2009-Feb-22+08:24:07.776 71.215 sec
SPC switchover Unplanned 8 9 2009-Feb-23+11:48:10.135 74.927 sec
PAC migration Unplanned 13 15 2009-Mar-02+09:42:09.621 12.253 sec
PAC migration Unplanned 4 13 2009-Mar-08+04:44:14.357 12.601 sec
A similar example as the previous one except the failed card fails to boot after migration:
Sun Mar 22 21:28:51 2009 Internal trap notification 73 (ManagerFailure) facility hatcpu instance 151 card 8 cpu 0
Sun Mar 22 21:28:52 2009 Internal trap notification 16 (PACMigrateStart) from card 15 to card 13
...
Sun Mar 22 21:29:36 2009 Internal trap notification 60 (CardDown) card 15
Sun Mar 22 21:29:38 2009 Internal trap notification 55 (CardActive) card 13
Sun Mar 22 21:29:38 2009 Internal trap notification 17 (PACMigrateComplete) from card 15 to card 13
Sun Mar 22 21:29:49 2009 Internal trap notification 139 (CardSPOFAlarm) card 13
Sun Mar 22 21:32:38 2009 Internal trap notification 9 (CardBootFailed) card 15
Sun Mar 22 21:35:41 2009 Internal trap notification 9 (CardBootFailed) card 15
Sun Mar 22 21:38:43 2009 Internal trap notification 9 (CardBootFailed) card 15
Sun Mar 22 21:41:51 2009 Internal trap notification 9 (CardBootFailed) card 15
...
2009-Mar-22+21:32:38.782 [hat 3033 error] [8/0/921 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 15 device CPU_0 reason CARD_BOOT_TIMEOUT_EXPIRED
2009-Mar-22+21:32:38.782 [csp 7005 critical] [8/0/930 <cspctrl:0> spctrl_events.c:2447] [hardware internal system] The Packet Services Card in slot 15 did not boot in the allowed time. CPU 0 did not boot. CPU 1 did not boot.
2009-Mar-22+21:29:38.943 [csp 7022 info] [8/0/930 <cspctrl:0> spctrl_events.c:3391] [hardware internal system critical-info] The migration of tasks on card 15 to card 13 is complete.
2009-Mar-22+21:29:36.820 [sitmain 4019 warning] [8/0/927 <npuctrl:0> sit_api.c:1219] [software internal system critical-info] Message bounced to facility sitmain instance 150
2009-Mar-22+21:29:36.343 [rct 13014 info] [8/0/923 <rct:0> rct_pac.c:1072] [software internal system critical-info] PAC Card 15 shutdown completed, took 0.018 sec
2009-Mar-22+21:29:36.325 [rct 13013 info] [8/0/923 <rct:0> rct_pac.c:952] [software internal system critical-info] PAC Card 15 shutdown started
2009-Mar-22+21:29:36.316 [rct 13002 info] [8/0/923 <rct:0> rct_pac.c:466] [software internal system critical-info] PAC task migration completed from card 15 to card 13
...
2009-Mar-22+21:28:54.786 [sitmain 4026 info] [15/0/712 <sitmain:150> sittask.c:2800] [software internal system critical-info] Task migration completed facility sessmgr instance 75 from card 15 cpu 0 to card 13 cpu 0 took 2.054931 sec
2009-Mar-22+21:28:52.731 [sitmain 4025 info] [15/0/712 <sitmain:150> sittask.c:3860] [software internal system critical-info] Task migration started facility sessmgr instance 75 from card 15 cpu 0 to card 13 cpu 0
2009-Mar-22+21:28:52.179 [rct 13001 info] [8/0/923 <rct:0> rct_pac.c:149] [software internal system critical-info] PAC task migration begin from card 15 to card 13
2009-Mar-22+21:28:52.154 [csp 7009 info] [8/0/930 <cspctrl:0> pctrl_helpers.c:2219] [hardware internal system critical-info] Migrating all tasks on the Packet Services Card in slot 15 to the one in slot 13.
2009-Mar-22+21:28:51.935 [csp 7019 critical] [8/0/930 <cspctrl:0> spctrl_events.c:3052] [hardware internal system] The Packet Services Card with serial number PLB48076369 in slot 15 has failed and will be brought down and brought back online. (Device=CPU_1, Reason=CPU_CRITICAL_TASK_FAILURE, Status=Boot Done [hb: E1], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], CFE_ROM: Start [hb: 00], GPIO_IN: 00,ff,ff,ff)
2009-Mar-22+21:28:51.928 [hat 3033 error] [8/0/921 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 15 device CPU_1 reason CPU_CRITICAL_TASK_FAILURE
2009-Mar-22+21:28:50.464 [hat 3050 error] [8/0/921 <hatsystem:0> hat_hb_lib.c:1139] [software internal system critical-info] Kernel missed ICMP echo request(s) from card 15: cpu 1
2009-Mar-22+21:28:50.464 [hat 3051 warning] [8/0/921 <hatsystem:0> hat_hb_lib.c:1121] [software internal system critical-info] ICMP ping results from card 15: rtt(0,0)=0ms
2009-Mar-22+21:28:50.424 [hat 3014 critical] [8/0/921 <hatsystem:0> hat_hb_lib.c:611] [software internal system] HAT instance 0 found Critical task hatcpu/151 failed on cpu 15/1. Missed heartbeat sequence 12400479 ns_age 4 bounce_code N/A.
To troubleshoot an offline card, it should be (firmly) reseated physically in the slot to determine if it will come back up (into standby mode). If it remains offline after doing this, then the card should be RMA’d.
Conceptually, a SPC/SMC switchover works the same way as a PAC migration. In this example, the standby SMC rebooted, so there is no need for a switchover, just a reboot. (SR 6208)
Fri Mar 20 12:46:39 2009 Internal trap notification 73 (ManagerFailure) facility hatsystem instance 1 card 8 cpu 0
Fri Mar 20 12:46:39 2009 Internal trap notification 60 (CardDown) card 9
Fri Mar 20 12:46:42 2009 Internal trap notification 139 (CardSPOFAlarm) card 8
Fri Mar 20 12:48:09 2009 Internal trap notification 5 (CardUp) card 9
Fri Mar 20 12:48:09 2009 Internal trap notification 93 (CardStandby) card 9
2009-Mar-20+12:47:52.448 [sitmain 4003 info] [9/0/852 <sitmain:90> sitmain.c:469] [software internal system critical-info] Boxer starting on card 9 cpu 0 (Standby)
2009-Mar-20+12:46:57.555 [resmgr 14571 warning] [8/0/916 <rmctrl:0> ource_license.c:1109] [software internal system critical-info] License status changed to: good. It only matches one card. License is valid but is not redundant.
2009-Mar-20+12:46:39.732 [rct 13013 info] [8/0/913 <rct:0> rct_spc.c:284] [software internal system critical-info] SPC Card 9 shutdown started
2009-Mar-20+12:46:39.707 [csp 7019 critical] [8/0/920 <cspctrl:0> spctrl_events.c:3052] [hardware internal system] The System Management Card with serial number PLB32071119 in slot 9 has failed and will be reset and brought back online. (Device=CPU_0, Reason=CPU_C
RITICAL_TASK_FAILURE, Status=Read Fail)
2009-Mar-20+12:46:39.698 [hat 3033 error] [8/0/911 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 9 device CPU_0 reason CPU_CRITICAL_TASK_FAILURE
2009-Mar-20+12:46:39.698 [hat 3050 error] [8/0/911 <hatsystem:0> hat_hb_lib.c:1139] [software internal system critical-info] Kernel missed ICMP echo request(s) from card 9: cpu 0
2009-Mar-20+12:46:39.698 [hat 3051 warning] [8/0/911 <hatsystem:0> hat_hb_lib.c:1121] [software internal system critical-info] ICMP ping results from card 9: All CPUs failed to respond
2009-Mar-20+12:46:39.657 [hat 3014 critical] [8/0/911 <hatsystem:0> hat_hb_lib.c:611] [software internal system] HAT instance 0 found Critical task hatsystem/1 failed on cpu 9/0. Missed heartbeat sequence 12459150 ns_age 5 bounce_code N/A.
In this example, SMC 8 encountered a missed heartbeat and switched over to SMC 9, and then failed to boot properly the first time but did so on the second attempt (SR 6183)
Note that when an SMC switchover occurs, the SNMP traps and logs up to the point of the switchover are lost (though they would be available on your trap and syslog servers if configured).
2009-Mar-15+12:14:38.759 [sitmain 4003 info] [8/0/854 <sitmain:80> sitmain.c:469] [software internal system critical-info] Boxer starting on card 8 cpu 0 (Standby)
2009-Mar-15+12:13:25.392 [rct 13013 info] [9/0/956 <rct:0> rct_spc.c:284] [software internal system critical-info] SPC Card 8 shutdown started
2009-Mar-15+12:13:25.392 [csp 7019 critical] [9/0/10286 <cspctrl:0> spctrl_events.c:3052] [hardware internal system] The System Management Card with serial number Unknown in slot 8 has failed and will be brought down and brought back online. (Device=CPU_0, Reason=CARD_BOOT_TIMEOUT_EXPIRED, Status=Read Fail)
2009-Mar-15+12:13:25.391 [hat 3005 warning] [9/0/954 <hatsystem:0> hatsystem_msg.c:3147] [software internal system critical-info] message bounced from facility hatcpu instance 80
2009-Mar-15+12:13:25.391 [hat 3033 error] [9/0/954 <hatsystem:0> atsystem_fail.c:959] [hardware internal system critical-info] Card error detected on card 8 device CPU_0 reason CARD_BOOT_TIMEOUT_EXPIRED
2009-Mar-15+12:13:25.391 [csp 7005 critical] [9/0/10286 <cspctrl:0> spctrl_events.c:2447] [hardware internal system] The System Management Card in slot 8 did not boot in the allowed time. CPU 0 did not boot.
2009-Mar-15+12:13:16.368 [system 1000 critical] [9/0/10282 <evlogd:0> evlgd_syslogd.c:146] [software internal system] CPU[8/0]: Gave up waiting for kernel, booting...
2009-Mar-15+12:09:31.552 [rct 13027 critical] [9/0/956 <rct:0> rct_spc.c:525] [software internal system critical-info] RCT readdress success instance 1 to instance 0
2009-Mar-15+12:09:31.447 [rct 13010 info] [9/0/956 <rct:1> rct_spc.c:431] [software internal system critical-info] SPC card 9 upgrading to Primary SPC
2009-Mar-15+12:09:31.409 [rct 13007 info] [9/0/956 <rct:1> rct_spc.c:245] [software internal system critical-info] Unplanned SPC switchover started from card 8 to card 9
2009-Mar-15+12:09:31.395 [hat 3050 error] [9/0/954 <hatsystem:1> hat_hb_lib.c:1139] [software internal system critical-info] Kernel missed ICMP echo request(s) from card 8: cpu 0
2009-Mar-15+12:09:31.395 [hat 3051 warning] [9/0/954 <hatsystem:1> hat_hb_lib.c:1121] [software internal system critical-info] ICMP ping results from card 8: All CPUs failed to respond
2009-Mar-15+12:09:31.347 [hat 3014 critical] [9/0/954 <hatsystem:1> hat_hb_lib.c:611] [software internal system] HAT instance 1 found Critical task hatsystem/0 failed on cpu 8/0. Missed heartbeat sequence 12460328 ns_age 5 bounce_code N/A.
******** show rct stats *******
RCT stats Details (Last 2 Actions)
Action Type From To Start Time Duration
-------------- --------- ---- ---- ------------------------ ----------
PSC migration Planned 11 6 2009-Mar-05+09:54:16.824 34.811 sec
SMC switchover Unplanned 8 9 2009-Mar-15+12:09:31.447 62.395 sec
Because the card did not boot properly the first time, it would make sense to provide the logs to Starent for further analysis. If the card remained offline, it should be (firmly) reseated physically in the slot to determine if it will come back up (into standby mode). If it remains offline after doing this, then the card should be RMA’d.
Imported from Starent Networks Knowledgebase Article # 10440
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: