CUCM hangs and is rebooted by ASR after 6.1.2 upgrade

astanislaus · ‎01-29-2009

We upgraded a cluster of 3 x CUCM Servers that were running 6.0.2 S/W version to version 6.1.2.

These are HP DL380 G5 boxes.

After the upgrade ASR (Automatic System recovery) caused two of the three Servers (one Publisher and one Subscriber) to reboot about once or twice a day. We then upgraded to 6.1.3 hoping that would resolve it. Yet the same problem.

The two servers (one Pub and One Sub or say one First Node and one Subsequent Node as the new terms are) with the reboot problem are in Site 1 and the stable Sub is in Site 2 about 20 Kms away. AC Power was elimintaed as a problem because this Data Centre has UPS and bucket load of CORE CAT 6500 switches and many many many other Windows / Linux Servers with no problem.

We disabled ASR and that just prevented the restart of the servers, but the servers just hung. When they hung we could still get to the servers through the ILO and click on reset to restrat the servers again.

The Pub abd one Sub never hang at the same time, but at different times.

Fix:

====

Site 1 also had a CUPS Presence Server running 6.0.2.

The interesting thing is that we rebooted this CUPS box and we haven't seen the problem for 1 week now.

I wonder if anyone else has seen this symptom.

Report Inappropriate Content · ‎02-04-2009

Since the server itself reboots, the issue might not be related to CallManager but more of a platform or hardware issue.

There is an issue with the HP ASR (Automatic System Recovery) agent that causes the server to reboot randomly.

Check the bug: CSCsi75567

astanislaus · ‎02-04-2009

CSCsi75567 was initially thought to be the problem and ciscocm.disable-hpasm.cop.sgn package used by TAC and still had the same problem. Rebooting CUPS Server fixed the problem. Something to do with pushing policies by CUPS to CUCM servers.

Johann Aicher · ‎02-06-2009

Be aware about the following bug in CUCM 6.1(2) - CSCsv49493

7828-H3 server goes down with Journal Aborted error

Symptom:

Phone services will go down, and server will only be semi-responsive. Local console access will show the following error constantly scrolling across the screen.

EXT3-fs error (device sd(8,6)) in start_transaction: Journal has aborted

Conditions:

During normal operation services will go down. Reboot will bring services back up for a while, anywhere between a couple hours and a couple days. Seen most frequently on

MCS7828-H3-K9/BE but has been reported on MCS7825-H2-IPC1 and MCS7825-H3.

Workaround:

Shut down the server, and remove the first hard drive until a final fix is available.

If server still fails, try switching to the other drive. Watch during boot up for any errors which might indicate hardware failure (SMART errors in particular).

If server stills fails on 2nd drive, leave one drive in, and reinstall CUCM.