System Inventory collection and Inventory Collection jobs failed

Unanswered Question
Feb 8th, 2008
User Badges:
  • Gold, 750 points or more

Just noticed Inventory Changes being 0, which never happens. Looks like the two inventory collections have been working. What's wrong with them?


System Inventory Collection

/var/adm/CSCOpx/files/rme/jobs/ICServer/1391/


Inventory Collection

/var/adm/CSCOpx/files/rme/jobs/ICServer/1594/




  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 0 (0 ratings)
Loading.
Joe Clarke Fri, 02/08/2008 - 22:29
User Badges:
  • Cisco Employee,
  • Hall of Fame,

    Founding Member

Looks like some of your RME daemons may have crashed. The daemons to check are RMECSTMServer and ICServer.

yjdabear Tue, 02/12/2008 - 06:04
User Badges:
  • Gold, 750 points or more

Seems fine in pdshow. Should I restart ICServer?


Process= RMECSTMServer

State = Running normally

Pid = 20654

RC = 0

Signo = 0

Start = 01/25/08 15:26:26

Stop = Not applicable

Core = Not applicable

Info = RMECSTMServer started.


Process= ICServer

State = Administrator has shut down this server

Pid = 0

RC = 1

Signo = 0

Start = 01/25/08 15:26:30

Stop = 01/26/08 01:23:52

Core = Not applicable

Info = ICServer started.


Joe Clarke Tue, 02/12/2008 - 08:21
User Badges:
  • Cisco Employee,
  • Hall of Fame,

    Founding Member

Yes, but you should check the daemons.log (ICServer.log on Windows) for any indication of why it crashed in the first place. Note: a pdexec might not fix this. You may have to restart dmgtd.

Joe Clarke Tue, 02/12/2008 - 08:23
User Badges:
  • Cisco Employee,
  • Hall of Fame,

    Founding Member

While these errors would prevent inventory from being successfully collected, they would not crash ICServer. Additionally, if you have a process which is stuck, and holding the locks on these tables, you will definitely need to restart dmgtd to recover.

Martin Ermel Tue, 02/12/2008 - 09:13
User Badges:
  • Blue, 1500 points or more

there is a 'java.lang.OutOfMemoryError' at the very end in line 7987 which I think forced ICServer to exit:


[ Sat Jan 26 01:23:49 EST 2008 ],FATAL,[Thread-18],com.cisco.nm.rmeng.inventory.ics.server.InvDataProcessor,481,Fatal Error has Occured, exiting ICServer java.lang.OutOfMemoryError


but why did it occur? Could it be the process locks the tables?


Martin Ermel Tue, 02/12/2008 - 09:45
User Badges:
  • Blue, 1500 points or more

there is a 'java.lang.OutOfMemoryError' at the very end in line 7987 which I think forced ICServer to exit:


[ Sat Jan 26 01:23:49 EST 2008 ],FATAL,[Thread-18],com.cisco.nm.rmeng.inventory.ics.server.InvDataProcessor,481,Fatal Error has Occured, exiting ICServer java.lang.OutOfMemoryError


but why did it occur? Could it be the process locks the tables?


Joe Clarke Tue, 02/12/2008 - 10:07
User Badges:
  • Cisco Employee,
  • Hall of Fame,

    Founding Member

Yeah, one of the threads hit that error, then it exited. I doubt the locks caused this. If you look, the thread that encountered the OOME did not encounter the lock problem. But there does appear to be an issue with the 192.168.8.44 device. It takes 355 seconds to process this device, and there could be a problem in the CISCO-STACK-MIB implementation. It would be beneficial to look at a sniffer trace of the inventory collection for this device to rule out any bugs on the device side.

Martin Ermel Wed, 02/13/2008 - 03:36
User Badges:
  • Blue, 1500 points or more

Yes, the thread that incountered the OOME did not encounter the lock problem, but if I interprete the log correct it has yet finished processing and was in a state of just giving the last information about its runtime.

perhaps it is a more widely spread problem... :-(

If you say that 355 is a long time for processing a device, there are several devices for which it takes longer (up to 876 sec). But as I see, they all (except the one with the OOME) finished processing (a few are showing the lock prbl also). Could it be, that for some of these devices the memory does not get properly freed?

It could be of interest if they are all of the same device type...


yjdabear, perhaps this list is somewhat useful for you....

it contains the IPs with processing time > 300s


172.19.10.1

172.19.10.74

172.19.20.102

172.19.20.111

172.19.20.212

172.19.20.232

172.19.25.1 (842s)

172.19.26.2

172.19.29.1

172.19.3.1

172.19.32.1

172.19.42.3

192.168.11.28

192.168.110.71

192.168.116.28

192.168.254.29

192.168.254.30

192.168.26.20

192.168.26.36

192.168.26.44

192.168.28.4 (DP time:863s, Total time: 876s)

192.168.29.12

192.168.29.36

192.168.3.36

192.168.32.44

192.168.37.36 (DP time: 793, Total time: 854s)

192.168.4.28

192.168.5.12

192.168.52.36

192.168.53.76

192.168.8.36

192.168.8.44



Joe Clarke Wed, 02/13/2008 - 09:53
User Badges:
  • Cisco Employee,
  • Hall of Fame,

    Founding Member

The reason this device was interesting to me is that it also had an SNMP access error in it. However, given network latency, size of device, etc. 355 seconds may not be that long. That's why I suggested a sniffer trace to rule out a problem with device instrumentation.


All that said, it could be that there is a memory leak that is encountered by this thread. This would not be the first time that we've seen an ICServer leak. Profiling ICServer is not an easy task, though, so it would be good to rule out obvious problems first.

Actions

This Discussion