Solved: LMS 3.2 (solaris 10); RME 4.3.1 - ArchivePurge Job continuously failing

Martin Ermel · ‎07-27-2010

currently there are 2134 devices in DCR; the customer originally had problems with UT reports and the error message "ogs_server_urn not found"; During troubleshooting I noticed that there were around 30000 instances in the job history; cleaning up the majority of these entries a problem with ArchivePurge job is left over:
this job always ends with a status of "Failed" and no files are purged; Around 20 - 30 mins after its start time a /opt/CSCOpx/java_pidxxxx.hprof file is generated but the job will stay in running state for the next 27 hours ... Then it ends with little information in the job log.

I attached some info which I collected during troubleshooting - and if necessary I also have trussed the PID at the very end before writing to the ResultSummary.obj until the process finishes.
Also the job was deleted and readded. The information collected is from this new job.

Does this point to a memory leak or is this just be a problem with the value for the ConfigJobManager.heapsize=512 in
/opt/CSCOpx/MDC/tomcat/webapps/rme/WEB-INF/classes/JobManager.properties ??

Joe Clarke · ‎07-30-2010

You will need to delete and recreate the job.

View solution in original post

Joe Clarke · ‎07-29-2010

Most likely, this is a memory exhaustion problem. Double the heap size for ConfigJobManager, then reschedule your purge job, and see if it completes successfully.

Martin Ermel · ‎07-30-2010

the change just doubled the time until the java_pidxxxx.hprof file was generated...
I collected some java thread dumps of the running PID and also a truss on that PID until the hprof file was generated. I opened SR615037705 and provided all the collected information...
the customer will follow-up this issue as I am on vacation the next 3weeks.. :-))

Joe Clarke · ‎07-30-2010

That seems to confirm it's memory related. Bumping the heap to 1536 may get the job done, but you could also try 1700 as a maximum.

Martin Ermel · ‎07-30-2010

Is it necessary to delete the ArchivePurge Job and recreate it or does the change of the heap is sufficient?

Joe Clarke · ‎07-30-2010

You will need to delete and recreate the job.

Martin Ermel · ‎09-03-2010

finally changing the heap size did not resolved the issue, but investigating this a little further showed why...
the archive files for around 2100 devices where never purged in the past and due to restore of the databases over a few LMS releases (i.e. years) there where about over 1.2 million files...
finaly with a wrapper script that purged the archive for the devices one by one for a specific time range the amount of files where dramatically reduced /opt/CSCOpx/bin/cwcli config delete -u admin -l doing$host.log -device $host -date 01/01/2000 01/01/2010
to get a feeling of the work that must be done: ...the script ran for 12 days ... (good, that this installation is running on solaris)
but now it is solved!