RME server must be rebooted

kathegaj2 · ‎01-15-2009

When I run large Netconfig jobs my RME server freezes. I can still ping it, but I cannot RDP or browse to it. Can't use application at all until I do a hard boot on the server.

I have heard other people say I have to break my jobs into no more than 150 devices and run them sequentially but I am hoping that is not the case.

Any way to check what is going on and correct this?

Joe Clarke · ‎01-15-2009

Netconfig jobs can contain much more than 150 devices. Testing has been done out to the thousands. I have not heard of any reports of Netconfig hanging the server. What version of LMS is installed? How many devices are in your job? How much memory is in this server? What are its CPU specs? Can you collect CPU statistics from the server's console when this problem is occurring?

kathegaj2 · ‎01-15-2009

1. LMS 3.0.1

2. The first job that failed the server was 363 devices. The second job (which never ran) was over 3000. I had to reboot the server in order to even console back into the server.

3. CPU: Quad 2.6 GHz Dual-Core AMD Opteron Processor 2218 HE

Memory: 4094 MB Volumes:

C:\ 12.00 GB NTFS

D:\ 56.33 GB NTFS

Free Space:

C:\ 2.17 GB NTFS

D:\ 18.81 GB NTFS

OS Edition / SP: Microsoft(R) Windows(R) Server 2003, Standard Edition / Service Pack 2

System Type: Server, Stand-alone, Terminal Server

4. Yes, I can turn on Performance Monitoring. What specific CPU statistics should I look for while I run another large job.

Joe Clarke · ‎01-15-2009

What process is taking up the most CPU at the time? How much CPU time is it taking? Use the PID from Process Monitor to find the CiscoWorks daemon in the pdshow output.

Also, what version of DFM do you have installed?

kathegaj2 · ‎01-15-2009

I am not running DFM.

What is the CiscoWorks daemon called in pshow.

Joe Clarke · ‎01-15-2009

No, use pdshow to map the PID from Process Monitor to the daemon name. Pdshow will list about 70 daemons. Each one which is running will have a PID.

kathegaj2 · ‎01-15-2009

dbsrv9.exe 469,460 K PID 3952

tomcat.exe 349,972 K PID 9376

cwjava.exe 84,120 K 6580,9676,10096,9812,9648,9752,5944,3636,8792,5304

Total 20 times ranging to

cwjava.exe 10,160 K PID 7852

The order has not changed since I started the job. I have monitored for about 10 minutes. So far the server hasn't failed and CPU never breaks a sweat at 1%.

The devices that fail (and all did this on server down last night): Deploy Succeeded Synchronization of RME archive with Device Config failed

Joe Clarke · ‎01-15-2009

When the server does fail, you need to map those PIDs to daemon names using pdshow. Without that, I cannot say for certain what the problem daemon is. For example, PID 3952 will map to one of the DbEngine daemons in the pdshow output.

kathegaj2 · ‎01-16-2009

I have confirmed this server crashes every time I run a large job.

I requested perfomance monitoring be enabled and I turned it on before I logged in to RME and replicated the failure.

The attached log is what I got. I am confused because it only shows 2 PID and both of them are system processes 1208 is services.exe and 544 is explorer.exe The server person said it was more "sinister" than the network interface jumping off track. He called it a bugcheck that is destablizing the OS causing it to require reboot. He referenced a multitude of potential problems including a bug in the application, or a conflict of applications on the server or bad memory, but he retracted the latter saying that the memory checks would deny this conclusion. He also stated that the Ram caps at 2g although it says 4g. He called it standard edition 2g which has a barrier utilizing extended memory services. Software memory rather than hardware.

I have gone beyond my current skill limitations on this one but of course am willing to learn. I really need to get this fixed.

Attached is the capture that was configured by the server guy for "perfmon"

Joe Clarke · ‎01-16-2009

Nothing LMS does should crash the operating system. We do not operate at such a low level. However, system resources could become exhausted. But as you said, the problem processes here are system processes.

Windows 2003 Standard Edition can only support 4 GB of RAM. This is fine for managing up to 5000 devices. Each process will have a 2 GB memory limitation, but our individual processes don't come close to that.

Since I am not aware of any other stability problems with LMS on Windows (nothing like this), my gut says there is bad hardware (e.g. motherboard, CPU, cache, etc.). I suggest your sysadmin contact Microsoft, and have them help debugging where the root of the instability lies. If the problem does turn out to be related to LMS, then you can engage Cisco TAC who can work with MS to come to a resolution.

Joe Clarke · ‎01-16-2009

I should also ask, when you say you're not running DFM does that mean it's not installed, or you're not using it? Having it installed could definitely cause server instability unless it is at version 3.0.4. The problem will be an exhaustion of non-paged memory resulting in loss of network access as well as process crashes.

kathegaj2 · ‎01-19-2009

It is not installed.

I was talking to another Ciscoworks admin who had prior life at IBM and he said this is a known problem:

"We have encountered this situation before, where we run the job which has more than 150 devices to push.

What happens is once you are running the job with more than 150 devices, the job tend to freeze up the server resources and never release it, so if you run 6-10 jobs, all the resources will be consumed & will never be released, resulting in the communication loss with the Cisco Work server.

The solution recommended by a Cisco was to break the devices into the groups of no more than 150, so if you have 700 devices you need to make 5 groups & configure the 5th group in such a way that any device added to DCA, should be automatically add to the 5th group & when you run the job, run it in the sequential mode ( Not in Parallel mode )

On top of it you need to get with the server guy to schedule the job which should reboot the ciscoworks server every week, this will help in freeing up the cisco resources. ( Cisco will never agree on this but this is the truth )

We followed above practice in IBM for 120 cisco works server.

But you say you have never had this issue before so there must be a solution that doesn't involve breaking my large network into separate work orders...

Joe Clarke · ‎01-19-2009

This may have been the case for RME 3.5, but not for 4.0. In 4.0, Netconfig jobs are run as separate JVM processes. when the job completes, the process dies, and all resources are returned to the system (as with any other process which terminates).

As I said, it would be helpful to get Microsoft's insight into this so that the actual cause (i.e. the specific resource or resources being drained) can be identified. Since I'm not hearing about a rash of such problems, I strongly suspect a hardware issue which manifests itself under network, or disk load.

kathegaj2 · ‎01-19-2009

Thanks JC, the server guys are looking into it now. I will open TAC for the next step as needed...