LMS 3.1 RME 4.2 NetConfig job status oddity

yjdabear · ‎12-09-2008

Since upgrading from LMS 2.6, all the NetConfig jobs remain in perpetual "Running" status, despite having completed the config changes on the target devices. Each job's "Job Summary" shows "Status: Job Running", while "Device Details" shows "Update Pending". Each job has to be manually stopped, which transioned them into "Stop initiated" state. The exact same type of jobs exhibited no such behavior

Failure Policy: Ignore failure and continue

E-mail Notification: [blah@fqdn.com]

Execution Policy: Parallel

Fail on Mismatch of Config Versions: Disabled

Copy Running Config to Startup: Enabled

Sync Archive before Job Execution: Disabled

Job Password: Disabled

The jobs are just adding a bunch interface config commands. I wonder if this behavior might be because RME 4.2's NetConfig now expects explicit "exit" in the job itself from config mode.

Joe Clarke · ‎12-09-2008

This was reported before, and was found to be protocol-specific. If the user switched from SSH to telnet, the problem went away. Can you try the same here just to see if that is the problem? Since you're on Solaris, you could also try sending a SIGQUIT to the job's JVM process. This should dump a stacktrace from each thread into daemons.log or the job's log. That would indicate why the job has hung.

yjdabear · ‎12-09-2008

Telnet is no longer allowed so I can't test that. Are these the unique JVM procs for each NetConfig job? Do I need to turn on debug on any log files?

casuser 4022 7333 0 Dec 06 ? 13:42 CSCO.2005 -cw:jre lib/jre -cp /opt/CSCOpx/MDC/tomcat/shared/lib/MICE.jar:/opt/C

casuser 5344 7333 0 Dec 06 ? 3:27 CSCO.2007 -cw:jre lib/jre -cp /opt/CSCOpx/MDC/tomcat/shared/lib/MICE.jar:/opt/C

casuser 3411 7333 0 Dec 06 ? 3:30 CSCO.2003 -cw:jre lib/jre -cp /opt/CSCOpx/MDC/tomcat/shared/lib/MICE.jar:/opt/C

casuser 6145 7333 0 Dec 06 ? 12:01 CSCO.2012 -cw:jre lib/jre -cp /opt/CSCOpx/MDC/tomcat/shared/lib/MICE.jar:/opt/C

casuser 11897 7333 0 Dec 06 ? 12:00 CSCO.2019 -cw:jre lib/jre -cp /opt/CSCOpx/MDC/tomcat/shared/lib/MICE.jar:/opt/C

casuser 11246 7333 0 Dec 06 ? 3:24 CSCO.2017 -cw:jre lib/jre -cp /opt/CSCOpx/MDC/tomcat/shared/lib/MICE.jar:/opt/C

casuser 7047 7333 0 Dec 06 ? 14:04 CSCO.2015 -cw:jre lib/jre -cp /opt/CSCOpx/MDC/tomcat/shared/lib/MICE.jar:/opt/C

casuser 6685 7333 0 Dec 06 ? 12:24 CSCO.2014 -cw:jre lib/jre -cp /opt/CSCOpx/MDC/tomcat/shared/lib/MICE.jar:/opt/C

casuser 6143 7333 0 Dec 06 ? 22:45 CSCO.2013 -cw:jre lib/jre -cp /opt/CSCOpx/MDC/tomcat/shared/lib/MICE.jar:/opt/C

casuser 23545 7333 0 Dec 05 ? 4:27 CSCO.2010 -cw:jre lib/jre -cp /opt/CSCOpx/MDC/tomcat/shared/lib/MICE.jar:/opt/C

Of particular note, jobID 2010 has a "Stop initiated" status, but it's in the list above.

Joe Clarke · ‎12-09-2008

Yes, these are the job processes. No, debugging does not need to be enabled.

yjdabear · ‎12-09-2008

The user also reported two additional issues:

1) When there were more than 10 NetConfig jobs scheduled for the same time, only 10 kicked off. Since these 10 remained "running", the rest did not start until an equal number out of the 10 were manually stopped. Is there an implicit limit of 10 concurrent NetConfig jobs?

2) On devices where a large number of interfaces were configured, the "copy running to startup config" checkbox in the NetConfig scheduler GUI did not consistently have an effect. So manual "copy run start" was required on these devices. The problem does not appear on devices with fewer interfaces being configured by NetConfig.

yjdabear · ‎12-09-2008

I did SIGQUIT against PID 23545 (jobID 2010 in Stop Initiated state), 11897 (jobID 2019 in Running state), and the parent PID 7333. Thread dump seems to have gone into daemons.log, as tail -f /var/adm/CSCOpx/files/rme/jobs/NetConfigJob/2019/log does not reveal any change:

[ Sat Dec 06 19:25:22 EST 2008 ],INFO ,[Thread-9],com.cisco.nm.xms.xdi.pkgs.LibDcma.persistor.ExecuteCmdOperation,performOperation,96,Trying to executeCmds using the Protocol SSH

[ Sat Dec 06 19:25:22 EST 2008 ],INFO ,[Thread-9],com.cisco.nm.xms.xdi.pkgs.LibDcma.persistor.ExecuteCmdOperation,performOperation,103,Trying to execute CmdSet No.0 on device 11.22.33.44

[ Sat Dec 06 19:41:06 EST 2008 ],ERROR,[Thread-12],com.cisco.nm.xms.xdi.transport.cmdsvc.LogAdapter,error,19,Unexpected Ssh2Exception stacktrace:

[ Sat Dec 06 19:41:06 EST 2008 ],ERROR,[Thread-12],com.cisco.nm.xms.xdi.transport.cmdsvc.LogAdapter,error,19,Terminating Ssh2Engine.run()...

[ Sat Dec 06 20:30:58 EST 2008 ],INFO ,[ActiveMQ Session Task],com.cisco.nm.rmeng.config.ccjs.executor.CfgJobExecutionESSListener,onMessage,66,Listener waiting for message :

[ Sat Dec 06 20:31:23 EST 2008 ],INFO ,[ActiveMQ Session Task],com.cisco.nm.rmeng.config.ccjs.executor.CfgJobExecutionESSListener,onMessage,66,Listener waiting for message :

[ Sat Dec 06 20:31:51 EST 2008 ],INFO ,[ActiveMQ Session Task],com.cisco.nm.rmeng.config.ccjs.executor.CfgJobExecutionESSListener,onMessage,66,Listener waiting for message :

[ Sat Dec 06 20:32:45 EST 2008 ],INFO ,[ActiveMQ Session Task],com.cisco.nm.rmeng.config.ccjs.executor.CfgJobExecutionESSListener,onMessage,66,Listener waiting for message :

[ Mon Dec 08 14:14:21 EST 2008 ],INFO ,[ActiveMQ Session Task],com.cisco.nm.rmeng.config.ccjs.executor.CfgJobExecutionESSListener,onMessage,66,Listener waiting for message :

[ Mon Dec 08 14:14:47 EST 2008 ],INFO ,[ActiveMQ Session Task],com.cisco.nm.rmeng.config.ccjs.executor.CfgJobExecutionESSListener,onMessage,66,Listener waiting for message :

Joe Clarke · ‎12-09-2008

This is what I thought. I filed a bug on this for the WLC platform, but my fix will apply to other platforms that use SSHv2, and send an initial remote window size of 0. The bug is CSCsv95235. A fix can be made available from TAC.

yjdabear · ‎12-09-2008

Thanks. Would the fix address the two additional issues above?

Joe Clarke · ‎12-09-2008

This would allow jobs to complete, that's all.

yjdabear · ‎01-15-2009

Is this debug log indicative of anything? That port-channel does not exist on that router, but one'd think NetConfig would be able to handle something like that.

[ Wed Jan 14 22:16:26 EST 2009 ],DEBUG,[Thread-9],com.cisco.nm.xms.xdi.pkgs.LibDcma.persistor.CliOperator,exec

uteCommandSet,606,Current device prompt is= ios-rtr(config)#

[ Wed Jan 14 22:16:26 EST 2009 ],DEBUG,[Thread-9],com.cisco.nm.xms.xdi.pkgs.LibDcma.persistor.CliOperator,exec

uteCommandSet,614,Command-1 : interface Port-channel55

[ Wed Jan 14 22:16:26 EST 2009 ],DEBUG,[Thread-9],com.cisco.nm.xms.xdi.pkgs.LibDcma.persistor.CliOperator,exec

uteCommandSet,636,Calling send

[ Wed Jan 14 22:16:26 EST 2009 ],DEBUG,[Thread-9],com.cisco.nm.xms.xdi.transport.cmdsvc.LogAdapter,debug,31, >

In LocalProxy.process( OpSend )

[ Wed Jan 14 22:16:26 EST 2009 ],DEBUG,[Thread-9],com.cisco.nm.xms.xdi.transport.cmdsvc.LogAdapter,debug,31,SessionContext(1) found

[ Wed Jan 14 22:16:27 EST 2009 ],DEBUG,[Thread-9],com.cisco.nm.xms.xdi.transport.cmdsvc.LogAdapter,debug,31,<1>_session._stack.size() == 3

[ Wed Jan 14 22:16:27 EST 2009 ],DEBUG,[Thread-9],com.cisco.nm.xms.xdi.transport.cmdsvc.LogAdapter,debug,31,<1>(local)_stack.size() == 3

[ Wed Jan 14 22:16:27 EST 2009 ],DEBUG,[Thread-9],com.cisco.nm.xms.xdi.transport.cmdsvc.LogAdapter,debug,31,<1>leveling stacks to index == 2

[ Wed Jan 14 22:16:27 EST 2009 ],DEBUG,[Thread-9],com.cisco.nm.xms.xdi.transport.cmdsvc.LogAdapter,debug,31,Exiting levelStacks()

[ Wed Jan 14 22:31:28 EST 2009 ],ERROR,[Thread-13],com.cisco.nm.xms.xdi.transport.cmdsvc.LogAdapter,error,19,Unexpected Ssh2Exception stacktrace:

[ Wed Jan 14 22:31:28 EST 2009 ],DEBUG,[Thread-13],com.cisco.nm.xms.xdi.transport.cmdsvc.LogAdapter,printStack

Trace,51,stacktracecom.cisco.nm.lib.cmdsvc.ssh2.Ssh2Exception: Disconnected from remote host

at com.cisco.nm.lib.cmdsvc.ssh2.StreamPair.readBytes(StreamPair.java:332)

at com.cisco.nm.lib.cmdsvc.ssh2.StreamPair.readPacket(StreamPair.java:183)

at com.cisco.nm.lib.cmdsvc.ssh2.Ssh2Engine.run(Ssh2Engine.java:217)

[ Wed Jan 14 22:31:28 EST 2009 ],ERROR,[Thread-13],com.cisco.nm.xms.xdi.transport.cmdsvc.LogAdapter,error,19,T

erminating Ssh2Engine.run()...

Joe Clarke · ‎01-15-2009

This looks like CSCsw39641. It is believed to be related to my new bug, CSCsw88378. I have written a patch which definitely fixes my bug, but I am waiting on the feedback as to whether or not it fixes the RME bug. If you contact the TAC, you can get the patch for CSCsw88378.

yjdabear · ‎01-15-2009

Thanks for the info. Are these two bugs related to CSCsv95235? Are all three bugs new to LMS 3.1/RME 4.2? We had deployed these configs fine with RME 4.0.6.

Joe Clarke · ‎01-15-2009

The symptoms are similar, but the code paths are very different. The bugs could affect RME 4.0.6 through 4.2. However, with the introduction of newer device support, and newer device code, buggy code can be newly exposed.

yjdabear · ‎01-15-2009

Another piece of info just became known to me: After the CSCsv95235 patch was installed, a test job configuring those port-channels added those previously non-existent port-channels on that router, on Jan 12. Those port-channels were then manually removed from the running config, on Jan 12. Last night, Jan 14, with debug on, I had this exact same job scheduled again. It did NOT create those port-channels on the router. The debug log I posted was from the second run on Jan 14. Were the two runs (the same config) hitting two different issues, such that one's debug might not be illustrative of another?

Joe Clarke · ‎01-15-2009

If you've already patched for CSCsv95235, then it could have illuminated the new bug. The bug you showed my is definitely from this new bug.