I have a MCS-7845-H3 running CUCM6.1.4 as a subscriber which has been performing Automated System Resets about twice a day. It looks like a hardware problem, possibly with the memory, as I have a lot of other servers in the same cluster stable and OK. We don't want to swap the entire server as the cluster is running in secure mode and that would require a complete reboot of all devices.
How do I investigate this problem in more detail? Can you use the HP Diagnostics CD directly on the MCS? Can you access the Linux OS logs?
All suggestions gratefully received!
Just remember, you can try from SSH
utils diagnose test' and you can upload
result output. If displayed, please provide me with the log in
Yes you can use HP Diagnostics CD to verify all HW components
ASR (Automatic Server Recovery)is a feature setup in the bios to allow
the box to recover from a hang situation (i.e. running out of resources
when an application does not close out memory as it should).
The quickest way to discover if the issue is with ASR is to either set
the ASR timeout really high or to just disable ASR. This can be done
through the RBSU. While Server booting, enter RBSU using F9 key, then
select "Server Availability", then select ASR Status to toggle ASR
enabled/disabled. Alternatively select ASR Timeout & increase the
timeout value to 30 minutes."
Therefore if the server shutdown/reboot is caused due to ASR it will give us more details.
How ASR work is that if it doesn't get response from the system for certain period of time it will reset the server. This also prevents
in identifying the root cause. In this scenario, a memory dump could have happened (which would have helped identifying the root cause) but
ASR took place and prevented the memory dump. Since you are interested in root cause analysis, my recommendation would be to disable ASR and make
sure CUCM logs are grabbed when the issue occurs
To disable ASR:
On post press F9 key
in the BIOS, scroll down to SERVER AVAILABILITY option
ASR will show ENABLE or DISABLE
Please Select DISABLE
Reboot server and monitor.
If it crashes it will lock up.
Things to check from a hardware perspective are POST errors at boot,
amber or red leds on the front of the box. The top-most led is the
internal health led (looks like a jagged sine wave inside a monitor). If
this is red or amber it indicates a problem with something internal. You
would then pull off the cover and on the underside of the cover is a
decal which identifies the "internal health" leds on the system board. One of
these may be red pin-pointing the problem component