System Inventory collection and Inventory Collection jobs failed

yjdabear · ‎02-08-2008

Just noticed Inventory Changes being 0, which never happens. Looks like the two inventory collections have been working. What's wrong with them?

System Inventory Collection

/var/adm/CSCOpx/files/rme/jobs/ICServer/1391/

Inventory Collection

/var/adm/CSCOpx/files/rme/jobs/ICServer/1594/

Joe Clarke · ‎02-08-2008

Looks like some of your RME daemons may have crashed. The daemons to check are RMECSTMServer and ICServer.

yjdabear · ‎02-12-2008

Seems fine in pdshow. Should I restart ICServer?

Process= RMECSTMServer

State = Running normally

Pid = 20654

RC = 0

Signo = 0

Start = 01/25/08 15:26:26

Stop = Not applicable

Core = Not applicable

Info = RMECSTMServer started.

Process= ICServer

State = Administrator has shut down this server

Pid = 0

RC = 1

Signo = 0

Start = 01/25/08 15:26:30

Stop = 01/26/08 01:23:52

Core = Not applicable

Info = ICServer started.

Joe Clarke · ‎02-12-2008

Yes, but you should check the daemons.log (ICServer.log on Windows) for any indication of why it crashed in the first place. Note: a pdexec might not fix this. You may have to restart dmgtd.

yjdabear · ‎02-12-2008

Seems a number of tables were locked.

Joe Clarke · ‎02-12-2008

While these errors would prevent inventory from being successfully collected, they would not crash ICServer. Additionally, if you have a process which is stuck, and holding the locks on these tables, you will definitely need to restart dmgtd to recover.

Martin Ermel · ‎02-12-2008

there is a 'java.lang.OutOfMemoryError' at the very end in line 7987 which I think forced ICServer to exit:

[ Sat Jan 26 01:23:49 EST 2008 ],FATAL,[Thread-18],com.cisco.nm.rmeng.inventory.ics.server.InvDataProcessor,481,Fatal Error has Occured, exiting ICServer java.lang.OutOfMemoryError

but why did it occur? Could it be the process locks the tables?

Martin Ermel · ‎02-12-2008

there is a 'java.lang.OutOfMemoryError' at the very end in line 7987 which I think forced ICServer to exit:

[ Sat Jan 26 01:23:49 EST 2008 ],FATAL,[Thread-18],com.cisco.nm.rmeng.inventory.ics.server.InvDataProcessor,481,Fatal Error has Occured, exiting ICServer java.lang.OutOfMemoryError

but why did it occur? Could it be the process locks the tables?

Joe Clarke · ‎02-12-2008

Yeah, one of the threads hit that error, then it exited. I doubt the locks caused this. If you look, the thread that encountered the OOME did not encounter the lock problem. But there does appear to be an issue with the 192.168.8.44 device. It takes 355 seconds to process this device, and there could be a problem in the CISCO-STACK-MIB implementation. It would be beneficial to look at a sniffer trace of the inventory collection for this device to rule out any bugs on the device side.

Martin Ermel · ‎02-13-2008

Yes, the thread that incountered the OOME did not encounter the lock problem, but if I interprete the log correct it has yet finished processing and was in a state of just giving the last information about its runtime.

perhaps it is a more widely spread problem... :-(

If you say that 355 is a long time for processing a device, there are several devices for which it takes longer (up to 876 sec). But as I see, they all (except the one with the OOME) finished processing (a few are showing the lock prbl also). Could it be, that for some of these devices the memory does not get properly freed?

It could be of interest if they are all of the same device type...

yjdabear, perhaps this list is somewhat useful for you....

it contains the IPs with processing time > 300s

172.19.10.1

172.19.10.74

172.19.20.102

172.19.20.111

172.19.20.212

172.19.20.232

172.19.25.1 (842s)

172.19.26.2

172.19.29.1

172.19.3.1

172.19.32.1

172.19.42.3

192.168.11.28

192.168.110.71

192.168.116.28

192.168.254.29

192.168.254.30

192.168.26.20

192.168.26.36

192.168.26.44

192.168.28.4 (DP time:863s, Total time: 876s)

192.168.29.12

192.168.29.36

192.168.3.36

192.168.32.44

192.168.37.36 (DP time: 793, Total time: 854s)

192.168.4.28

192.168.5.12

192.168.52.36

192.168.53.76

192.168.8.36

192.168.8.44

Joe Clarke · ‎02-13-2008

The reason this device was interesting to me is that it also had an SNMP access error in it. However, given network latency, size of device, etc. 355 seconds may not be that long. That's why I suggested a sniffer trace to rule out a problem with device instrumentation.

All that said, it could be that there is a memory leak that is encountered by this thread. This would not be the first time that we've seen an ICServer leak. Profiling ICServer is not an easy task, though, so it would be good to rule out obvious problems first.