StarOS の CPU/Memory 使用率に関するトラブルシューティング

Tomonobu Okada · ‎2014-10-24

Resource Management の仕組み
想定される原因
Syslog/SNMP トラップ
取得するログ

概要

ASR5000 や ASR5500 で稼働する StarOS では物理的な CPU/Memory リソースの使用率以外に、各プロセスを個別にモニターする仕組みがあり、多くの場合このモニター機能により問題がレポートされます。本ドキュメントではその仕組みを解説することでレポートされた問題を正確に理解し、トラブルシューティングに必要なログを取得できるようにすることを目指します。

Resource Management の仕組み

StarOS 上では sessmgr や aaamgr など多くのプロセスが稼働していますが、それぞれのプロセスに個別に設定された CPU/Memory 使用率のしきい値があり、これらのしきい値を超えた場合 Syslog や SNMP トラップが出力されます。

各プロセスのしきい値や CPU/Memory 使用率は show task resources コマンドで確認できます。

[local]asr5500-1# show task resources                               

                   task   cputime        memory     files      sessions
 cpu facility      inst used allc   used  alloc used allc  used  allc S status
----------------------- --------- ------------- --------- ------------- ------
 2/0 sitmain         20 0.1%  15%  8.79M 24.00M   13 1000    --    -- -   good
 2/0 sitparent       20 0.1%  20%  9.52M 14.00M   10  500    --    -- -   good
 2/0 hatcpu          20 0.2%  10%  9.74M 24.00M   11  500    --    -- -   good
 2/0 afmgr           20 0.1%  10% 11.88M 20.00M   13  500    --    -- -   good
 2/0 rmmgr           20 1.1%  15% 13.51M 23.00M  231  500    --    -- -   good
 2/0 hwmgr           20 0.1%  15%  9.84M 15.00M   12  500    --    -- -   good
 2/0 dhmgr           20 0.0%  15% 10.86M 35.00M   19 6000    --    -- -   good
 2/0 connproxy       20 0.0%  90% 10.73M 35.00M   11 1000    --    -- -   good
 2/0 npumgr          20 0.4% 100% 481.4M  2.27G   23 1000    --    -- -   good
 2/0 dcardmgr        20 0.1%  60% 41.51M 600.0M   13  500    --    -- -   good
 2/0 npusim         201 0.1% 100% 14.43M 60.00M   12  500    --    -- -   good
 2/0 sft            200 0.2%  50% 13.97M 30.00M   30  500    --    -- -   good
 2/0 sessmgr       5343 0.2%  50% 105.3M 230.0M   12  500    --    -- S   good
 2/0 sessmgr       5344 0.2%  50% 105.3M 230.0M   12  500    --    -- S   good
 2/0 sessmgr       5345 0.2%  50% 105.3M 230.0M   13  500    --    -- S   good
 2/0 sessmgr       5346 0.2%  50% 105.3M 230.0M   13  500    --    -- S   good
 2/0 sessmgr       5347 0.2%  50% 105.3M 230.0M   13  500    --    -- S   good

＜フィールドの意味＞

cputime used	プロセスの CPU 使用率
cputime allc	プロセスの CPU 使用率のしきい値
memory used	プロセスの Memory 使用量
memory alloc	プロセスの Memory 使用量のしきい値
status	プロセスのしきい値の状態: good / warn / over

本機能はあくまでモニターするための仕組みであり、実際にはしきい値を超えても CPU/Memory は使用できますので（しきい値でリミットされる訳ではない）、通常は超えたからと言ってすぐに問題が発生するわけではありません。各プロセスのしきい値は StarOS のバージョンによって変わる場合があり、また、CLI から変更することはできません。

想定される原因

一般的には一時的な上昇である場合は問題にはなりませんが、もし CPU 使用率が 100% のまま、あるいは Memory 使用量が減ることがなく継続的に増え続けている、という状況の場合は無限ループやメモリリークといった問題である可能性があるため、調査が必要となります。

一時的な上昇の主な原因

出力が多いコマンドの取得による一時的な上昇（cli プロセス）
内部で保持しているログが増えることによる上昇（evlogd プロセス）
これらの他にも多数あります

調査が必要になる事象

無限ループによる CPU 使用率の上昇（100% のまま、など）
Memory リークやフラグメンテーションによる Memory 使用量の継続的な増加

Syslog/SNMP トラップ

CPU 使用率の場合

CPU 使用率がしきい値に近づいた場合、あるいは超えてしまった場合は以下の Syslog および SNMP トラップが出力されます。

SNMP Trap

Internal trap notification 1215 (CPUWarn) facility sct instance 0 card 8 cpu 0 allocated 500 used 451

Internal trap notification 1219 (CPUOver) facility cli instance 5010046 card 5 cpu 0 allocated 600 used 609

※上記の例では、インスタンス番号 5010046 の cli プロセスが、しきい値が 60% であるのに対して 60.9% に達したことを意味しています。

Syslog

[resmgr 14502 warning] [2/0/2352 <rmmgr:20> _resource_cpu.c:2876] [software internal system] The task ipsecmgr-202 is over it's cputime limit. Allocated 50.0%, Using 51.8%

※ こちらの Syslog は warning レベルであるため、デフォルトの logging 設定では出力されません。出力させるためには、resmgr の logging レベルを warning に設定する必要があります。

Memory 使用量の場合

Memory 使用量がしきい値に近づいた場合、あるいは超えてしまった場合は以下の Syslog および SNMP トラップが出力されます。

SNMP Trap

Internal trap notification 1217 (MemoryWarn) facility cli instance 5005588 card 5 cpu 0 allocated 66560 used 70212

Internal trap notification 1221 (MemoryOver) facility cli instance 5010046 card 5 cpu 0 allocated 66560 used 89940

※上記の例では、インスタンス番号 5010046 の cli プロセスが、しきい値が 66560 であったのに対して 89940 に達してしまったことを意味しています。

Syslog

[resmgr 14500 warning] [8/0/4054 <rmmgr:80> _resource_cpu.c:3622] [software internal system syslog] The task bulkstat-0 is over its memory limit. Allocated 46080K, Using 48120K

※ こちらの Syslog は warning レベルであるため、デフォルトの logging 設定では出力されません。出力させるためには、resmgr の logging レベルを warning に設定する必要があります。

[resmgr 14508 error] [5/0/6036 <rmmgr:50> _resource_cpu.c:3629] [software internal system critical-info syslog] The task cli-5013640 is way over its memory limit! Allocated 66560K, Using 98148K

取得するログ

調査が必要な場合は以下のログが必要となります。一部のコマンドは hidden モードに入って取得する必要があります。通常 hidden モードは TAC から指示があった場合のみ使用するように注意してください。

CPU 使用率の場合

show task resources
show task resource max
show snmp trap history
show profile facility <process name> instance <instance#> depth 4 (hidden モード)

Memory 使用量の場合

show task resources
show task resource max
show snmp trap history

以下の２つは間隔を開けて複数回取得してください。（例：15分間隔で4回）

show messenger proclet facility <process name> instance <instance#> heap (hidden モード)
show messenger proclet facility <process name> instance <instance#> system heap (hidden モード)

CPU/Memory 共通

show support detail

show profile コマンドの実行例

[local]asr5500-1# show profile facility sessmgr instance 1 depth 4                                                                                                      
 100.0%          1  libc.so.6/writev()
                   [0ab2b705/X] xtcp_conn_write()
                   [0ab41220/X] sn_msg_xtcp_send_reply()
                   [0ab136eb/X] sn_msg_handle_issue_response()
Total samples collected: 1

show messenger proclet コマンドの実行例は長くなってしまうので省略しますが、同様にプロセス名とインスタンス番号と指定して取得ください。プロセス名とインスタンス番号は SNMP トラップなどから取得できます。

Internal trap notification 1221 (MemoryOver) facility cli instance 5010046 card 5 cpu 0 allocated 66560 used 89940