Every Inventory Collection job has failed since LMS 3.2 upgrade

Unanswered Question
Apr 15th, 2010

I'm wondering if this is a known or new bug with LMS 3.2.

It started off with investigating a number of ASAs that remain stuck in unknown device type state (question mark) in DCR. Their sysObjectID can be walked just fine in Device Center using DCR credentials. Config archive works fine for them too, according to dcmaservice.log. I even re-added an existing ASA and a lab router that had been known in DCR/RME, and they too now show up as unknown device types. TAC notes the scheduled Inventory Collection has been running for a long time. I went back and realized every IC job had failed the exact same way since after upgrading to LMS 3.2 with assorted patches applied: All devices stay "not attempted" for ~9 hours, then the entire job is declared "failed".

dcmaservice.log upon device addition:

...

[ Thu Apr 15  10:12:11 EDT 2010 ],WARN ,[Thread-5],com.cisco.nm.rmeng.util.rmedaa.RMEDeviceContext,setSysObjectId,455,syObjectId from DCR is NULL or Unknown
[ Thu Apr 15  10:12:12 EDT 2010 ],INFO ,[Thread-11],com.cisco.nm.rmeng.dcma.configmanager.ConfigManager,updateArchive,498,Number of devices in fetch Q = 0
[ Thu Apr 15  10:12:12 EDT 2010 ],INFO ,[Thread-11],com.cisco.nm.rmeng.dcma.configmanager.CfgThreadManager,compareDeviceWithDevicesinRunningThreads,59,inside compareDeviceWithDevicesinRunningThreads method
[ Thu Apr 15  10:12:12 EDT 2010 ],INFO ,[Thread-11],com.cisco.nm.rmeng.dcma.configmanager.CfgThreadManager,compareDeviceWithDevicesinRunningThreads,60,Total running threads:5
[ Thu Apr 15  10:12:12 EDT 2010 ],INFO ,[Thread-11],com.cisco.nm.rmeng.dcma.configmanager.ConfigManager,updateArchiveIfRequired,1028,Compared the device with running thread devices.Adding to Fetch Q
[ Thu Apr 15  10:12:12 EDT 2010 ],INFO ,[Thread-11],com.cisco.nm.rmeng.dcma.configmanager.CfgThreadManager,triggerConfigFetch,52,#### Start of Sweep Thu Apr 15 10:12:12 EDT 2010 ####
[ Thu Apr 15  10:12:12 EDT 2010 ],WARN ,[Thread-27075],com.cisco.nm.rmeng.util.rmedaa.RMEDeviceContext,setSysObjectId,455,syObjectId from DCR is NULL or Unknown
...

I have this problem too.
0 votes
  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 0 (0 ratings)
Loading.
Michel Hegeraat Thu, 04/15/2010 - 09:37

I have see a couple of times now, the first phenomena you mention, that devices that I add to the DCR remain in this unknown state, eventhough I can do an snmpwalk from the devicecenter without any problem.

But so far putting in the right devicetype manually, resolves the issue. It can be a hassle to match the sysobjectID to the type though

Once this is the done the devices appear to work fine in RME, Campus anf DFM

I believe however the RME doesn't act upon a device if the sysobectid in the DCR is empty, but you do see it making config backup?

That I haven't yet noticed so far,

Have you considered removing all devices and doing a bulk import?

Cheers,

Michel

Joe Clarke Thu, 04/15/2010 - 13:57

Restart Daemon Manager, and reproduce the problem.  Run "kill -QUIT" on the PID of ICServer.  Post the daemons.log.

yjdabear Fri, 04/23/2010 - 07:21

Do I need to "reproduce" all the symptoms above and wait out the natural failure (i.e., not manually stopping the IC job), or is there a single one among those that should be sufficient?

TAC has pointed firs to RME 4.3.1 and then CSCtb16925, but both patches have been in place since the LMS 3.2 upgrade.

I finally got to restart dmgtd last night, thinking I could kill two birds with one stone:

stopped dmgtd

reset the dbpassword on rmeng (in order to manually remove an Alias Device in dbreader)

started dmgtd

retried one of the ad-hoc IC jobs against two devices (one known, one not IC'ed); again the job sat in "pending" with the following in IC_Server.log, until I stopped it.

Then this morning, I got users reporting all the devices were grayed out while attempting to create NetConfig jobs.

[ Fri Apr 23  09:04:56 EDT 2010 ],DEBUG,[Thread-25],com.cisco.nm.rmeng.inventory.ics.core.Request,75,Adding device 648 to request 1271990558368
[ Fri Apr 23  09:04:56 EDT 2010 ],DEBUG,[Thread-25],com.cisco.nm.rmeng.inventory.ics.core.ICSCore,189,Returning from startInvCollection
[ Fri Apr 23  09:07:38 EDT 2010 ],DEBUG,[Timer-0],com.cisco.nm.rmeng.inventory.ics.threadpool.ThreadPool,69, Timer Tasks Run method is called.
[ Fri Apr 23  09:12:38 EDT 2010 ],DEBUG,[Timer-0],com.cisco.nm.rmeng.inventory.ics.threadpool.ThreadPool,69, Timer Tasks Run method is called.
[ Fri Apr 23  09:17:38 EDT 2010 ],DEBUG,[Timer-0],com.cisco.nm.rmeng.inventory.ics.threadpool.ThreadPool,69, Timer Tasks Run method is called.
[ Fri Apr 23  09:22:38 EDT 2010 ],DEBUG,[Timer-0],com.cisco.nm.rmeng.inventory.ics.threadpool.ThreadPool,69, Timer Tasks Run method is called.
[ Fri Apr 23  09:27:38 EDT 2010 ],DEBUG,[Timer-0],com.cisco.nm.rmeng.inventory.ics.threadpool.ThreadPool,69, Timer Tasks Run method is called.
[ Fri Apr 23  09:32:38 EDT 2010 ],DEBUG,[Timer-0],com.cisco.nm.rmeng.inventory.ics.threadpool.ThreadPool,69, Timer Tasks Run method is called.
[ Fri Apr 23  09:37:38 EDT 2010 ],DEBUG,[Timer-0],com.cisco.nm.rmeng.inventory.ics.threadpool.ThreadPool,69, Timer Tasks Run method is called.
[ Fri Apr 23  09:42:38 EDT 2010 ],DEBUG,[Timer-0],com.cisco.nm.rmeng.inventory.ics.threadpool.ThreadPool,69, Timer Tasks Run method is called.
[ Fri Apr 23  09:44:25 EDT 2010 ],DEBUG,[Thread-27],com.cisco.nm.rmeng.inventory.ics.core.ICSCore,257,Entering method status
[ Fri Apr 23  09:44:25 EDT 2010 ],DEBUG,[Thread-27],com.cisco.nm.rmeng.inventory.ics.core.ICSCore,258,Received Request to stop Request with ID : 1271990558368
[ Fri Apr 23  09:44:25 EDT 2010 ],DEBUG,[Thread-27],com.cisco.nm.rmeng.inventory.ics.core.ICSCore,281,Entering method getRequest
[ Fri Apr 23  09:44:25 EDT 2010 ],DEBUG,[Thread-27],com.cisco.nm.rmeng.inventory.ics.core.ICSCore,264,Result of status is : true
[ Fri Apr 23  09:44:25 EDT 2010 ],DEBUG,[Thread-27],com.cisco.nm.rmeng.inventory.ics.core.ICSCore,265,Returning from status
[ Fri Apr 23  09:44:25 EDT 2010 ],DEBUG,[Thread-25],com.cisco.nm.rmeng.inventory.ics.core.ICSCore,244,Entering method stopInvCollection
[ Fri Apr 23  09:44:25 EDT 2010 ],DEBUG,[Thread-25],com.cisco.nm.rmeng.inventory.ics.core.ICSCore,245,Stop for Request ID: 1271990558368
[ Fri Apr 23  09:44:25 EDT 2010 ],DEBUG,[Thread-25],com.cisco.nm.rmeng.inventory.ics.core.ICSCore,281,Entering method getRequest
[ Fri Apr 23  09:44:25 EDT 2010 ],DEBUG,[Thread-25],com.cisco.nm.rmeng.inventory.ics.core.PriorityQueue,163,Removing, 858
[ Fri Apr 23  09:44:25 EDT 2010 ],DEBUG,[Thread-25],com.cisco.nm.rmeng.inventory.ics.core.PriorityQueue,171,got device, 858
[ Fri Apr 23  09:44:25 EDT 2010 ],DEBUG,[Thread-25],com.cisco.nm.rmeng.inventory.ics.core.PriorityQueue,174,Removed dev, 858from  Priority of the ENUM is : 0
[ Fri Apr 23  09:44:25 EDT 2010 ],DEBUG,[Thread-25],com.cisco.nm.rmeng.inventory.ics.core.Request,115,Removing device 858from request 1271990558368
[ Fri Apr 23  09:44:25 EDT 2010 ],DEBUG,[Thread-25],com.cisco.nm.rmeng.inventory.ics.core.PriorityQueue,163,Removing, 648
[ Fri Apr 23  09:44:25 EDT 2010 ],DEBUG,[Thread-25],com.cisco.nm.rmeng.inventory.ics.core.PriorityQueue,171,got device, 648
[ Fri Apr 23  09:44:25 EDT 2010 ],DEBUG,[Thread-25],com.cisco.nm.rmeng.inventory.ics.core.Device,51,Removing request 1271990558368 for device 648
[ Fri Apr 23  09:44:25 EDT 2010 ],DEBUG,[Thread-25],com.cisco.nm.rmeng.inventory.ics.core.PriorityQueue,177,Removed2 dev, 648from  Priority of the ENUM is : 0
[ Fri Apr 23  09:44:25 EDT 2010 ],DEBUG,[Thread-25],com.cisco.nm.rmeng.inventory.ics.core.Request,115,Removing device 648from request 1271990558368
[ Fri Apr 23  09:44:25 EDT 2010 ],DEBUG,[Thread-25],com.cisco.nm.rmeng.inventory.ics.core.ICSCore,272,Entering method requestID
[ Fri Apr 23  09:44:25 EDT 2010 ],DEBUG,[Thread-25],com.cisco.nm.rmeng.inventory.ics.core.ICSCore,273,Request to remove :1271990558368
[ Fri Apr 23  09:44:25 EDT 2010 ],DEBUG,[Thread-25],com.cisco.nm.rmeng.inventory.ics.core.ICSCore,276,Leaving method requestID
[ Fri Apr 23  09:44:25 EDT 2010 ],DEBUG,[Thread-25],com.cisco.nm.rmeng.inventory.ics.core.ICSCore,252,Returning from stopInvCollection
[ Fri Apr 23  09:44:26 EDT 2010 ],DEBUG,[ICServer],com.cisco.nm.rmeng.util.logger.RMELogger,734,com.cisco.nm.rmeng.inventory.ics.server.DMgtRunner,run,456,Got Dmgt message and type is :1111708500
[ Fri Apr 23  09:47:38 EDT 2010 ],DEBUG,[Timer-0],com.cisco.nm.rmeng.inventory.ics.threadpool.ThreadPool,69, Timer Tasks Run method is called.
[ Fri Apr 23  09:52:38 EDT 2010 ],DEBUG,[Timer-0],com.cisco.nm.rmeng.inventory.ics.threadpool.ThreadPool,69, Timer Tasks Run method is called.
[ Fri Apr 23  09:57:38 EDT 2010 ],DEBUG,[Timer-0],com.cisco.nm.rmeng.inventory.ics.threadpool.ThreadPool,69, Timer Tasks Run method is called.
[ Fri Apr 23  10:02:38 EDT 2010 ],DEBUG,[Timer-0],com.cisco.nm.rmeng.inventory.ics.threadpool.ThreadPool,69, Timer Tasks Run method is called.
[ Fri Apr 23  10:07:38 EDT 2010 ],DEBUG,[Timer-0],com.cisco.nm.rmeng.inventory.ics.threadpool.ThreadPool,69, Timer Tasks Run method is called.
[ Fri Apr 23  10:12:38 EDT 2010 ],DEBUG,[Timer-0],com.cisco.nm.rmeng.inventory.ics.threadpool.ThreadPool,69, Timer Tasks Run method is called.

yjdabear Fri, 04/23/2010 - 08:23

I issued "kill -QUIT" against the ICserver anyway and attached the daemons.log to the case notes of SR614139283. Attached is an abridged version.

Attachment: 
Joe Clarke Sun, 04/25/2010 - 15:50

There is nothing obviously wrong here.  I recently saw another similar thread dump, and the problem there appeared to be a bad RME device package.  There was an additional error seen in the log which pointed to that.  This daemons.log looks truncated.  Can you restart Daemon Manager, then immediately reproduce the problem, and post the full daemons.log?

yjdabear Tue, 04/27/2010 - 13:12

Joe, regarding this other thread dump you mentioned, would it be from the case in which the customer was running LMS 3.1 on Windows, with symptoms such as all devices being grayed out in Netconfig, and a reboot of the physical server "fixed" it?

Joe Clarke Tue, 04/27/2010 - 13:15

No, this was a tangible package problem.  A clean up-to-date package database was provided, and the problem was resolved.

yjdabear Wed, 04/28/2010 - 17:44

Apparently rebooting the server is the fix: Not only do IC jobs complete fine, those grayed out devices are again selectable too, whereas merely restarting dmgtd alone made the problem worse (introducing the symptom of grayed out devices). If the other customer was really on LMS 3.1 on Windows, this is a cross-platform issue still present in the latest LMS 3.2, but according to TAC there's no bug filed since this is fixable by rebooting the host server.

Joe Clarke Wed, 04/28/2010 - 23:48

There is no bug filed because we have yet to see a reproducible problem that wasn't tracked back to a bad package repository.  Certainly something which can be reproduced yet "fixed" with a reboot is still a bug.  The fact that a reboot changes the behavior vs. a Daemon Manager restart leads me to believe that either a daemon isn't being terminated properly, or there is a resource issue on the server.

If you can reproduce this, open a TAC service request BEFORE restarting anything.  It would also be good to have ICServer debugging enabled before trying to reproduce so the logs will have some good meat to them at the time the problem occurs.

Actions

This Discussion