This document describes the procedure (initial response measures) you should try before rebooting the server when any failure occurs in the CUCM server.
There is a need to provide detailed procedure on initial actions to take when CUCM server failure occurs.
The failure primarily assumed here is a situation where services are all disconnected and hardware failure is suspected. (This is rather the operational procedure to be taken in our internal IT. We appreciate your suggestions for improving procedures or any comment.)
As preparation for any failure, take following actions in normal operations.
Periodic backup -- Use the scheduled backup feature and perform backup at least once a week. Setting information and change history management -- Document the CUCM settings, especially for the records of the number of GW/phone sets, services running on each server, and external numbers.
Baseline management -- Use RTMT to record the usage of CPU, memory and HDD in normal use.Further, grasp the number of registered GWs and phone sets. Alarm setting -- For Critical service down, etc., e-mail alarm output
can be set in RTMT. Application of maintenance patch -- Develop a periodic server maintenance plan so that firmware update with the latest service pack, ES patch and FWUCD is performed in the event of planned down. Trace level setting -- For the Cisco CallManager service, set the trace acquisition level to Detailed to make the analysis smooth in the event of any failure. Pasting of emergency contact information -- To get in touch promptly in the event of any emergency, provide the necessary contact information including the administrator name, model number, serial number, support contract number and TAC contact information, etc. in a noticeable location
Preparation of recovery disc -- Prepare a recovery disc, which is used in the event of CUCM failure, according to the relevant version.For details of the recovery disc,
1. Failure occurrence and assessment of the situation
If an occurrence of any failure is reported, accurately assess the status of the failure.
・ Interrupted function(s), and whether all phone sets or only some services are disconnected
・ Whether calls between local extensions are possible, whether calls to external extensionsare possible, and whether incoming calls can be received
・ Whether the affected phone sets are of a specific department or for the entire organization
・ The status of the phone set LCD screen display, and whether the CM service down attemptsto obtain DHCP again
・ How long the failure has existed and what time it occurred
・ Whether there is any related action before the occurrence of the failure
・ Who found the failure first
・ The number of phone set with a problem, and the phone set model and MAC address
・ Whether the failure is intermittent or constant
2. Server status check
Check the server status with various management tools.
・ Make a connection in RTMT and if successful, check what failure exists on the Alert screen
At the same time, check the CPU, memory and HDD performance.If possible, retrieve the following logs
Cisco AMC Service
Cisco AMC Service DeviceLog
Cisco RIS Data Collector PerfMonLog
Logs necessary for each failure
・ Check if the access to Cisco Unified CM Administration screen and
Cisco Unified Serviceability screen is possible
・ If there are multiple nodes (servers) within the cluster,
check if the DB replication is running from the Cisco Unified Reporting screen.
・ Check if you can log into the CLI for administrator by SSH.When successfully logged inissue the following commands to retrieve information
show tech all
utils service list
utils create report hardware
・ Check if Ping to the server is possible
Visually check the server status
・ Check if there is any warning that indicates an abnormal condition in server LEDs and HDD LEDs, etc.
・ Check if any error message is displayed on the server console. If an error message is displayed,records the message as accurate as possible.
It is recommended to take a picture of the message with your cell phone or compact camera.
3. Initiation of server restoration
If the services are all disconnected, take the restoration steps in
ascending order of impacts as described below.
A. If the Cisco Unified Serviceability screen is accessible Check the service statis on Tools > Control Center.If any service that should be running is down, attempt to reboot and restart it. If the services are all disconnected, basically use the following procedures: Control Center - Network Services - Cisco Database Layer Monitor (rebooting the DB monitor that monitors the database change operations) Control Center - Feature Services - Cisco CallManager (rebooting the call manager service which is the core of the call processing)
Select service and click the restart button respectively. If there is any other stopped service that should be running, attempt to reboot it. When the service is successfully rebooted, check if the phone system operations are restored.
B. If the Cisco Unified Serviceability screen is inaccessible, or services cannot be rebooted from the Serviceability screen
By issuing a command from CLI via the SSH connection, attempt to restore the services. After login, use the utils service list command to grasp the services currently running.
For each service, [STARTED] indicates the running state, while [STOPPED] indicates the stopped state. However, [STOPPED] Service not Activated is an indication for the service
that is originally not enabled by the administrator. So there is no problem.
Focus on services whose status is [STARTING...] or [STOPPING].
It may be that services may be really in the starting or stopping state, but there is also
a possibility that they have been hanged up during starting or stopping process.
Follow the procedure below to restart the service. utils service restart Cisco Tomcat (Restarting the Web server servlet engine Tomcat)
If this service can be restarted, it may be possible to restore the Web server functions. So follow the procedure A to restart the service. utils service restart Cisco Database Layer Monitor (Rebooting the DB monitor that monitors the database change operations)
If the service cannot be resumed after the above procedure, issue the following command to restart all the services within CUCM. utils service restart Service Manager All the services will be restarted, so it may take about 10 minutes.
When the prompt admin: is returned, restarting is completed. Use the utils service list command to check the service status again.
If the phone services cannot be restored by the above service restarting, use a commandto restart the server.
After issuing utils system restart, enteryes to execute restarting
After the server is restarted, check if the phone services are restored.
C. If access to CLI via SSH fails
If access to CLI via SSH fails, it is necessary to access the server console physically.If you can access the CLI from the server console, first use utils service restart System SSH
to check if the SSH service can be restarted.
If access via SSH is made possible by this restarting, return to the procedure B
to restart the service.If it does not work, try restarting the service from
the server console as with the procedure B.
D. If access to the console CLI also fails
Unfortunately, if access to the server failed even with all the procedures A, B and C, it is necessary to force the server to be restarted.However, forced server restart may
corrupt the OS file system, so please follow the procedure below.
・ Visually check the situation and record error(s) displayed on the console,
as described in item 1
・ If any hardware failure can be confirmed from HDD's orange LED status,
arrange the replacement parts
・ Prepare the CUCM recovery disc
・ Press the server reset button to perform forced restart
・ After restarting, insert the recovery disc into the DVD drive
・ When the recovery disc is started, execute the file system check (FSCK)
・ After FSCK, remove the disc and restart the server
・ Collect logs and perform diagnostic test using HP Smart Start in the case of the HP server,and DSA in the case of the IBM server
・ Apply the latest FWUCD
・ Restart the server
4. After booting the server, check that the phone services are resumed
・ Use utils service list to check the service status
・ Check the service status from Cisco Unified Serviceability
・ Check the number of registered GW/phone sets from RTMT
・ Check basic calls between local extensions
・ Check outgoing and incoming external calls (for each GW line number)
・ Check various services, etc.
5. If necessary, retrieve logs and request TAC for analysis
・ Logs to be retrieved are those described in item 2 and HP Smart Start orIBM DSA log
・ Detailed occurrence situation
As described above.
(Additional note) Effect on service restart (1) Restart of the DB management service: Several minutes during this restart, change to the DB is not possible, but call processing continues and is not affected.
(2) Restart of Cisco CallManager: Several minutes
During this restart, call processing (new outgoing, incoming and forwarding calls) is not possible.Currently active calls are retained.
(3) Restart of Cisco Tomcat: Several minutes
Access to the Web services (Cisco Unified CM Administrator, Cisco Unified Serviceability, Cisco Unified User, Extension Mobility, Web Dialer, and Click to Call)is denied. Call processing continues and is not affected.
(4) Restart of System SSH: Several minutes
New SSH sessions cannot be created.Existing SSH sessions are not affected.
Call processing continues and is not affected. (5) Restart of Service Manager and all services: About 10 to 15 minutes As all the services are restarted, call processing is also affected. This is because Cisco CallManager Service, which controls the call processing, is involved. However, currently active calls are retained. Compared with server restart, only the OS boot time can be saved.
Original Document: Cisco Support Community Japan DOC-12739