Cisco Support Community
cancel
Showing results for 
Search instead for 
Did you mean: 

Initial actions to take when CUCM server failure occurs

 

Introduction:

This document describes the procedure (initial response measures) you should try
before rebooting the server when any failure occurs in the CUCM server.

Core Issue:

 

There is a need to provide detailed procedure on initial actions to take
when CUCM server failure occurs.

Resolution:

 

The failure primarily assumed here is a situation where services
are all disconnected and hardware failure is suspected.
(This is rather the operational procedure to be taken in our internal IT.
We appreciate your suggestions for improving procedures or any comment.)

 

0. Preparation

As preparation for any failure, take following actions in normal operations.

Periodic backup -- Use the scheduled backup feature and perform backup
at least once a week.
Setting information and change history management --
Document the CUCM settings, especially for the records of the number
of GW/phone sets, services running on each server, and external numbers.

Baseline management -- Use RTMT to record the usage of CPU, memory
and HDD in normal use.Further, grasp the number of registered GWs and phone sets.
Alarm setting -- For Critical service down, etc., e-mail alarm output

can be set in RTMT.
Application of maintenance patch -- Develop a periodic server maintenance
plan so that firmware update with the latest service pack, ES patch and
FWUCD is performed in the event of planned down.
Trace level setting -- For the Cisco CallManager service, set the trace
acquisition level to Detailed to make the analysis smooth in the event of any failure.
Pasting of emergency contact information -- To get in touch promptly in
the event of any emergency, provide the necessary contact information
including the administrator name, model number, serial number,
support contract number and TAC contact information, etc. in a noticeable location

Preparation of recovery disc -- Prepare a recovery disc, which is used in
the event of CUCM failure, according to the relevant version.For details of the recovery disc,

see the separate document.
https://supportforums.cisco.com/docs/DOC-12782

 

1. Failure occurrence and assessment of the situation

If an occurrence of any failure is reported, accurately assess the status of the failure.

Interrupted function(s), and whether all phone sets or only some services
are disconnected

Whether calls between local extensions are possible, whether calls to
external extensions
are possible, and whether incoming calls can be received

Whether the affected phone sets are of a specific department
or for the entire organization

The status of the phone set LCD screen display, and whether the CM service
down attempts
to obtain DHCP again

How long the failure has existed and what time it occurred

Whether there is any related action before the occurrence of the failure

Who found the failure first

The number of phone set with a problem, and the phone set model and MAC address

Whether the failure is intermittent or constant

 

2. Server status check

Check the server status with various management tools.

Make a connection in RTMT and if successful, check what failure exists
on the Alert screen

At the same time, check the CPU, memory and HDD performance.If possible,
retrieve the following logs

  • Eventlog-Application
  • Eventlog-System
  • Cisco CallManager
  • Cisco AMC Service
  • Cisco AMC Service DeviceLog
  • Cisco RIS Data Collector PerfMonLog
  • Logs necessary for each failure

Check if the access to Cisco Unified CM Administration screen and

Cisco Unified Serviceability screen is possible

If there are multiple nodes (servers) within the cluster,

check if the DB replication is running from the Cisco Unified Reporting screen.

Check if you can log into the CLI for administrator by SSH.When successfully
logged in
issue the following commands to retrieve information

  • show hardware
  • show status
  • show tech all
  • utils service list
  • utils create report hardware

Check if Ping to the server is possible

Visually check the server status

Check if there is any warning that indicates an abnormal condition
in server LEDs and HDD LEDs, etc.

Check if any error message is displayed on the server console.
If an error message is displayed,
records the message as accurate as possible.

It is recommended to take a picture of the message with your cell phone or compact camera.

 

3. Initiation of server restoration

If the services are all disconnected, take the restoration steps in

ascending order of impacts as described below.

A. If the Cisco Unified Serviceability screen is accessible
Check the service statis on Tools > Control Center.If any service
that should be running is down, attempt to reboot and restart it.
If the services are all disconnected, basically use the following procedures:
Control Center - Network Services - Cisco Database Layer Monitor
(rebooting the DB monitor that monitors the database change operations)
Control Center - Feature Services - Cisco CallManager
(rebooting the call manager service which is the core of the call processing)

Select service and click the restart button respectively.
If there is any other stopped service that should be running, attempt to reboot it.
When the service is successfully rebooted, check if the phone system operations are restored.

 

B. If the Cisco Unified Serviceability screen is inaccessible,
or services cannot be rebooted from the Serviceability screen

By issuing a command from CLI via the SSH connection, attempt to restore the services.
After login, use the utils service list command to grasp the services currently running.

For each service, [STARTED] indicates the running state, while [STOPPED] indicates
the stopped state. However, [STOPPED] Service not Activated is an indication for the service

that is originally not enabled by the administrator. So there is no problem.

Focus on services whose status is [STARTING...] or [STOPPING].

It may be that services may be really in the starting or stopping state, but there is also

a possibility that they have been hanged up during starting or stopping process.

 

Follow the procedure below to restart the service.
utils service restart Cisco Tomcat
(Restarting the Web server servlet engine Tomcat)

If this service can be restarted, it may be possible to restore the Web server functions.
So follow the procedure A to restart the service.
utils service restart Cisco Database Layer Monitor
(Rebooting the DB monitor that monitors the database change operations)

If the service cannot be resumed after the above procedure, issue the
following command to restart all the services within CUCM.
utils service restart Service Manager
All the services will be restarted, so it may take about 10 minutes.

When the prompt admin: is returned, restarting is completed.
Use the utils service list command to check the service status again.

If the phone services cannot be restored by the above service restarting,
use a command
to restart the server.

After issuing utils system restart, enter yes to execute restarting

After the server is restarted, check if the phone services are restored.

 

C. If access to CLI via SSH fails

If access to CLI via SSH fails, it is necessary to access the server console
physically.If you can access the CLI from the server console, first use
utils service restart System SSH

to check if the SSH service can be restarted.

If access via SSH is made possible by this restarting, return to the procedure B

to restart the service.If it does not work, try restarting the service from

the server console as with the procedure B.

 

D. If access to the console CLI also fails

Unfortunately, if access to the server failed even with all the procedures A, B and C,
it is necessary to force the server to be restarted.However, forced server restart may

corrupt the OS file system, so please follow the procedure below.

Visually check the situation and record error(s) displayed on the console,

as described in item 1

If any hardware failure can be confirmed from HDD's orange LED status,

arrange the replacement parts

Prepare the CUCM recovery disc

Press the server reset button to perform forced restart

After restarting, insert the recovery disc into the DVD drive

When the recovery disc is started, execute the file system check (FSCK)

After FSCK, remove the disc and restart the server

Collect logs and perform diagnostic test using HP Smart Start in the case
of the HP server,
and DSA in the case of the IBM server

Apply the latest FWUCD

Restart the server

 

4. After booting the server, check that the phone services are resumed

Use utils service list to check the service status

Check the service status from Cisco Unified Serviceability

Check the number of registered GW/phone sets from RTMT

Check basic calls between local extensions

Check outgoing and incoming external calls (for each GW line number)

Check various services, etc.

 

5. If necessary, retrieve logs and request TAC for analysis

Logs to be retrieved are those described in item 2 and HP Smart Start or IBM DSA log

Detailed occurrence situation

 

As described above.

 

(Additional note) Effect on service restart
(1) Restart of the DB management service: Several minutes
during this restart, change to the DB is not possible, but call processing continues
and is not affected.

(2) Restart of Cisco CallManager: Several minutes

During this restart, call processing (new outgoing, incoming and forwarding calls)
is not possible.Currently active calls are retained.

(3) Restart of Cisco Tomcat: Several minutes

Access to the Web services (Cisco Unified CM Administrator, Cisco Unified Serviceability,
Cisco Unified User, Extension Mobility, Web Dialer, and Click to Call)
is denied.
Call processing continues and is not affected.

(4) Restart of System SSH: Several minutes

New SSH sessions cannot be created.Existing SSH sessions are not affected.

Call processing continues and is not affected.
(5) Restart of Service Manager and all services: About 10 to 15 minutes
As all the services are restarted, call processing is also affected.
This is because Cisco CallManager Service, which controls the call processing, is involved.
However, currently active calls are retained.
Compared with server restart, only the OS boot time can be saved.

 

Related Information

Original Document: Cisco Support Community Japan DOC-12739

Author: Shigeomi Shibata

Posted on August 26, 2010

https://supportforums.cisco.com/docs/DOC-12739

Version history
Revision #:
2 of 2
Last update:
‎08-29-2017 02:52 PM
Updated by:
 
Labels (1)
Contributors