cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1684
Views
0
Helpful
6
Replies

IOM Crash and Coldstart

Walter Dey
VIP Alumni
VIP Alumni

I have a customer that has 3 different domains, where from time to time (months !) IOM's are crashing, creating a dump file.

Versions are 2.1.1b and 2.0.4b

No TAC case open yet; is this a known bug ?

Thanks for any clarification

Walter.

 

Apr 23 07:17:15 192.168.15.16 : 2014 Apr 23 07:17:15 CEST: %NOHMS-2-NOHMS_ENV_FEX_OFFLINE: FEX-3 Off-line (Serial Number FCH172570Q6)
Apr 23 07:17:15 192.168.15.16 : 2014 Apr 23 07:17:15 CEST: %PFMA-2-FEX_STATUS: Fex 3 is offline
Apr 23 07:17:37 192.168.15.16 : 2014 Apr 23 07:17:37 CEST: %UCSM-2-EQUIPMENT_INACCESSIBLE: [F0478][critical][equipment-inaccessible][sys/chassis-3/slot-1] left IOM 3/1 (A) is inaccessible
Apr 23 07:20:37 192.168.15.16 : 2014 Apr 23 07:20:37 CEST: %SATCTRL-FEX3  -2-SATCTRL: IOM-0   Module 1: Cold boot
Apr 23 07:20:44 192.168.15.16 : 2014 Apr 23 07:20:44 CEST: %PFMA-2-FEX_STATUS: Fex 3 is online
Apr 23 07:20:44 192.168.15.16 : 2014 Apr 23 07:20:44 CEST: %NOHMS-2-NOHMS_ENV_FEX_ONLINE: FEX-3 On-line
Apr 23 07:20:44 192.168.15.16 : 2014 Apr 23 07:20:44 CEST: %PFMA-2-FEX_STATUS: Fex 3 is online
Apr 23 07:20:45 192.168.15.16 : 2014 Apr 23 07:20:45 CEST: %UCSM-2-EQUIPMENT_INACCESSIBLE: [F0478][cleared][equipment-inaccessible][sys/chassis-3/slot-1] left IOM 3/1 (A) is inaccessible
Apr 23 08:21:14 192.168.15.17 : 2014 Apr 23 08:21:14 CEST: %PFMA-2-FEX_STATUS: Fex 3 is offline
Apr 23 08:21:14 192.168.15.17 : 2014 Apr 23 08:21:14 CEST: %NOHMS-2-NOHMS_ENV_FEX_OFFLINE: FEX-3 Off-line (Serial Number FCH1725J3LS)
Apr 23 08:21:38 192.168.15.16 : 2014 Apr 23 08:21:37 CEST: %UCSM-2-EQUIPMENT_INACCESSIBLE: [F0478][critical][equipment-inaccessible][sys/chassis-3/slot-2] right IOM 3/2 (B) is inaccessible
Apr 23 08:24:32 192.168.15.17 : 2014 Apr 23 08:24:32 CEST: %SATCTRL-FEX3  -2-SATCTRL: IOM-0   Module 1: Cold boot
Apr 23 08:24:40 192.168.15.17 : 2014 Apr 23 08:24:40 CEST: %PFMA-2-FEX_STATUS: Fex 3 is online
Apr 23 08:24:40 192.168.15.17 : 2014 Apr 23 08:24:40 CEST: %NOHMS-2-NOHMS_ENV_FEX_ONLINE: FEX-3 On-line
Apr 23 08:24:40 192.168.15.17 : 2014 Apr 23 08:24:40 CEST: %PFMA-2-FEX_STATUS: Fex 3 is online
Apr 23 08:25:08 192.168.15.16 : 2014 Apr 23 08:25:08 CEST: %UCSM-2-EQUIPMENT_INACCESSIBLE: [F0478][cleared][equipment-inaccessible][sys/chassis-3/slot-2] right IOM 3/2 (B) is inaccessible

1 Accepted Solution

Accepted Solutions

Those two bugs are pretty much the same, did you have any question about them?

-Kenny

View solution in original post

6 Replies 6

Walter Dey
VIP Alumni
VIP Alumni

Found this

IOM-2208 / 2204 : Unexpected Power down with sequencer FAULT
CSCuf18380

Symptom:
A Cisco 2204XP or 2208XP Fabric Extender (FEX) experiences a power-down condition while the rest of the chassis is still receiving power.

Conditions:
In very rare instances a Cisco UCSM chassis during normal system operation, regardless of the load on the blade or the system, can experience a power-down condition on one of its IO-Modules (IOM).

Workaround:
There is no work-around for this issue. Affected customers should call technical support to replace the part that failed.
 

Walter,

Have you considered this bug already ?  https://tools.cisco.com/bugsearch/bug/CSCtz27298/?reffering_site=dumpcr  ; <<check it despite of the UCSM version you mentioned

Gather a chassis show tech and go to the IOM in question> bmc> and look for a file called "mem_low_critical" (from the top of my mind).

Let me know if you find something similar.

 

-Kenny

UCS IOM bmcd memory leak can generate kernel core and crashes IOM
CSCuf61116
Symptom:
IOM crash due to memory leak.

Additionally upgrade process may be impacted due to low memory available.

Conditions:
This has been experienced in a UCS environment running 2.1(1a). The decodes of the crash point to a memory leak in the BMCD process:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 458 0.0 38.2 118240 98052 ? Sl 2012 481:44 bmcd -F

Workaround:
It's best to upgrade to get past this issue.

It's possible to restart the bmcd process to temporarily fix the memory leak. This can provide days/weeks and maybe months until the out of memory condition occurs.
Contact TAC to do this. This can be a strategy to avoid any operational impact until a software upgrade window.
Note: If free memory is too low it will impact software update.

Further Problem Description:
Information about CPU and memory usage can be obtained from UCS CLI via:
connect IOM X (where X is the number of chassis)
show platform software cmcctrl process info
show platform software cmcctrl process meminfo

Specifically for this bug:
show platform software cmcctrl process info | i bmc
memory utilization should show below 15% in a healthy environment. (4th column)

show platform software cmcctrl process meminfo | i MemFree
Should show above 16MB of memory free (~16384 kB) in a healthy environment.

Those two bugs are pretty much the same, did you have any question about them?

-Kenny

Thanks Kenny !

Customer opened a TAC case; for me it is important to know, that it is a software bug, which is fixed; therefore no hardware RMA necessary.

Cool.

It is the bmcd process in the IOM consuming memory that is never released.

Glad you have it figured out.

-Kenny

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community:

Review Cisco Networking products for a $25 gift card