6248 failed upgrade now at bash prompt

sdawson35 · ‎04-14-2014

Hi,

Hopefully somebody can help ?.

We were upgrading our UCS Cluster , one of the 6248's upgraded okay but the other one crashed at 51% and now boots into a bash-2.05b# hash prompt accompanied by the following messages

bash-2.05b# 2014 Apr 14 15:12:36 %$ VDC-1 %$ %CALLHOME-2-EVENT: svc_sam_extvmmAG crashed with crash type:134
2014 Apr 14 15:12:36 %$ VDC-1 %$ %CALLHOME-2-EVENT: SW_CRASH
2014 Apr 14 15:12:38 %$ VDC-1 %$ %CALLHOME-2-EVENT: svc_sam_statsAG crashed with crash type:134
2014 Apr 14 15:12:38 %$ VDC-1 %$ %CALLHOME-2-EVENT: SW_CRASH
2014 Apr 14 15:12:39 %$ VDC-1 %$ %CALLHOME-2-EVENT: svc_sam_portAG crashed with crash type:134
2014 Apr 14 15:12:39 %$ VDC-1 %$ %CALLHOME-2-EVENT: SW_CRASH
2014 Apr 14 15:15:34 %$ VDC-1 %$ %CALLHOME-2-EVENT: svc_sam_dcosAG crashed with crash type:134
2014 Apr 14 15:15:34 %$ VDC-1 %$ %CALLHOME-2-EVENT: SW_CRASH
2014 Apr 14 15:17:30 %$ VDC-1 %$ %CALLHOME-2-EVENT: svc_sam_extvmmAG crashed with crash type:134
2014 Apr 14 15:17:30 %$ VDC-1 %$ %CALLHOME-2-EVENT: SW_CRASH
2014 Apr 14 15:17:32 %$ VDC-1 %$ %CALLHOME-2-EVENT: svc_sam_portAG crashed with crash type:134
2014 Apr 14 15:17:32 %$ VDC-1 %$ %CALLHOME-2-EVENT: SW_CRASH
2014 Apr 14 15:17:33 %$ VDC-1 %$ %CALLHOME-2-EVENT: svc_sam_statsAG crashed with crash type:134
2014 Apr 14 15:17:33 %$ VDC-1 %$ %CALLHOME-2-EVENT: SW_CRASH

I cannot find how to recover from this as most documentation I have found does not quite cover this error.

Regards

Scott.

Walter Dey · ‎04-14-2014

From which UCS version did you upgrade, and what was the target version ?

Did you do a manual upgrade, or autoinstall infrastructure ?

Disable Call Home before Upgrading to Avoid Unnecessary Alerts (Optional)
When you upgrade a Cisco UCS domain, Cisco UCS Manager restarts the components to complete the upgrade process. This restart causes events that are identical to service disruptions and component failures that trigger Call Home alerts to be sent. If you do not disable Call Home before you begin the upgrade, you can ignore the alerts generated by the upgrade-related component restarts.

sdawson35 · ‎04-15-2014

Hi,

Upgrade was manual via the GUI, not Infrastructure. didnt disable Call Home :-(.

I think original was 2.0.2 and we were upgrading to 2.2.1b.

Any idea how we can recover the failed node ? So additional (helpful ?) extract from the log

2014 Apr 15 07:28:02 %$ VDC-1 %$ Apr 15 07:28:02 %KERN-2-SYSTEM_MSG: mts_acquire_q_space() failing - no space in dst sap 28, uuid 26, src sap 980, opcode 3176 - kernel
2014 Apr 15 07:28:02 %$ VDC-1 %$ Apr 15 07:28:02 %KERN-2-SYSTEM_MSG: mts_acquire_q_space() failing - no space in sap 28, uuid 26 send_opc 3176, pid 6441, proc_name stats_client - kernel
2014 Apr 15 07:28:02 %$ VDC-1 %$ Apr 15 07:28:02 %KERN-2-SYSTEM_MSG: node=4 sap=66 rq=0(0) lq=0(0) pq=801(2229984) nq=0(0) sq=0(0) buf_in_transit=0, bytes_in_transit=0 - kernel
2014 Apr 15 07:28:02 %$ VDC-1 %$ Apr 15 07:28:02 %KERN-2-SYSTEM_MSG: node=4 sap=980 rq=0(0) lq=0(0) pq=0(0) nq=1(0) sq=0(0) buf_in_transit=204, bytes_in_transit=19496688 - kernel
2014 Apr 15 07:28:02 %$ VDC-1 %$ Apr 15 07:28:02 %KERN-2-SYSTEM_MSG: node=4 sap=28 rq=209(19392060) lq=0(0) pq=0(0) nq=0(0) sq=0(0) buf_in_transit=0, bytes_in_transit=0 - kernel
2014 Apr 15 07:28:02 %$ VDC-1 %$ Apr 15 07:28:02 %KERN-2-SYSTEM_MSG: mts_deliver_local_atomic:mts_acquire_q_space failed for opcode 3176, src_sap = 980, num_dst 1, erro -16 - kernel
2014 Apr 15 07:28:53 %$ VDC-1 %$ %SYSMGR-2-SERVICE_CRASHED: Service "fwm" (PID 6440) hasn't caught signal 6 (core will be saved).

Walter Dey · ‎04-15-2014

Did you upgrade UCS Manager as the first step to 2.2.1 ?

Maybe this helps a bit

http://jeffsaidso.com/2013/01/when-disaster-strikes/

https://supportforums.cisco.com/discussion/11964846/6248u-fabric-interconnect-bootloader-prompt

However, I would recommend to open a TAC case

Walter Dey · ‎04-23-2014

Jeff has mentioned a similar case, see

http://jeffsaidso.com/2014/04/fabric-interconnect-booting-to-bash/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+JeffSaidSo+%28Jeff+Said+So%29

and his solution was very simple

bash# erase configuration

That should do it. The FI will reboot and come back up as if it were brand new and ask to create/join the cluster.

It goes without saying that this situation should not happen under normal circumstances, but I’ve heard rumblings of people seeing this here and there after upgrading to 2.2.x