Solved: RME: Internal error appearing retrieving Startup Config - Revisited

Jeff Law · ‎03-04-2008

Hi,

I have asked about this once before, and didnt get a chance to complete the questions in the responses - my apologies for that, so if you can bear with me as I ask again...

have RME 4.0.5 running on a Windows 2003 platform.

Over the last couple of days, we have a device which we manage appearing on the RME->Config Management->Archive Management Partially Successful list. Prior to this it was archiving configuration successfully.

The message that comes up for the Primary Startup is: Internal error.

Now, I have gone through the suggestions made last time, and this is what I am now seeing in the dcmaservice.log when I do a manual sync archive on this device:

----------------------------

[ Wed Mar 05 14:26:30 NZDT 2008 ],ERROR,[Thread-113],com.cisco.nm.rmeng.dcma.configmanager.DeviceArchiveManager,deleteStartupConfig,1817,CM0003: Version $1 does not exist in archive $2 Cause: Version may have been deleted

[ Wed Mar 05 14:26:30 NZDT 2008 ],ERROR,[Thread-113],com.cisco.nm.rmeng.dcma.configmanager.ConfigManager,updateArchiveForDevice,1419,Error archiving config for akrtr1

[ Wed Mar 05 14:26:31 NZDT 2008 ],INFO ,[Thread-113],com.cisco.nm.rmeng.dcma.configmanager.DeviceArchiveManager,getSysObjectID,420,SYS OID = .1.3.6.1.4.1.9.1.45

[ Wed Mar 05 14:26:31 NZDT 2008 ],ERROR,[Thread-113],com.cisco.nm.rmeng.genericarchive.ArchiveUtils,close,268,Exception during closejava.sql.SQLException: JZ0S2: Statement object has already been closed.

at com.sybase.jdbc2.jdbc.ErrorMessage.raiseError(ErrorMessage.java:549)

at com.sybase.jdbc2.jdbc.SybStatement.checkDead(SybStatement.java:1949)

at com.sybase.jdbc2.jdbc.SybStatement.close(SybStatement.java:451)

at com.sybase.jdbc2.jdbc.SybStatement.close(SybStatement.java:439)

at com.cisco.nm.rmeng.genericarchive.ArchiveUtils.close(ArchiveUtils.java:266)

at com.cisco.nm.rmeng.genericarchive.Version.createVersion(Version.java:647)

at com.cisco.nm.rmeng.genericarchive.Branch.checkIn(Branch.java:116)

at com.cisco.nm.rmeng.config.archive.ConfigFileTypeBranch.checkIn(ConfigFileTypeBranch.java:223)

at com.cisco.nm.rmeng.config.archive.ConfigFileTypeBranch.addNewConfigFileVersion(ConfigFileTypeBranch.java:150)

at com.cisco.nm.rmeng.dcma.configmanager.DeviceArchiveManager.addNewConfigFileVersion(DeviceArchiveManager.java:984)

at com.cisco.nm.rmeng.dcma.configmanager.DeviceArchiveManager.archiveNewVersionIfNeeded(DeviceArchiveManager.java:1178)

at com.cisco.nm.rmeng.dcma.configmanager.ConfigManager.updateArchiveForDevice(ConfigManager.java:1331)

at com.cisco.nm.rmeng.dcma.configmanager.ConfigManager.performCollection(ConfigManager.java:3057)

at com.cisco.nm.rmeng.dcma.configmanager.CfgUpdateThread.run(CfgUpdateThread.java:29)

[ Wed Mar 05 14:26:31 NZDT 2008 ],ERROR,[Thread-113],com.cisco.nm.rmeng.dcma.configmanager.DeviceArchiveManager,getLatestConfigFileVersion,163,CM0021: Version does not exist in archive $1 Cause: Version may have been deleted

[ Wed Mar 05 14:26:31 NZDT 2008 ],INFO ,[Thread-113],com.cisco.nm.rmeng.dcma.configmanager.ConfigManager,updateArchiveForDevice,1379,CM0060 PRIMARY RUNNING Config fetch SUCCESS for akrtr1, version number 9 archived.

----------------------------

It looks like a previous configuration file is missing, but I dont understand why/how?

In the past, and looking at previous posts about this, deleting and readding is the only way to fix this. This does not go down too well here as the change history is required.

I would like to see if there is anything that can be done to prevent this from happening in the first place.

I am scheduling an outage for an update to RME from 4.0.5 to 4.0.6 shortly. Will this fix the problem?

Regards

Jeff

Joe Clarke · ‎03-05-2008

Yes, LMS should be manually shutdown before rebooting or shutting down the server. This will give the databases the time they need to properly stop.

Windows doesn't wait long enough for Daemon Manager to stop before cycling the CPU. On Solaris, this is easily fixed by adding a proper kill script. On Windows, there does not appear to be a way to tell the reboot/shutdown process to wait for our processes to fully terminate.

View solution in original post

Joe Clarke · ‎03-04-2008

Typically these messages are seen when the file system is manipulated without a corresponding database update (i.e. direct user interaction with the archive directory structure). Of course, this error is an after-the-fact error which doesn't say whether or not the problem had something to do with a bug in RME. Yes, removing and re-adding the device in RME is the only way to get the archive working again (since that re-syncs the database and the file system).

RME 4.0.6 does have a few archive management bug fixes, but nothing specifically addressing this error as we have not found an actual RME cause for the archive directory becoming out-of-sync with the database.

Jeff Law · ‎03-04-2008

Thanks for your response.

Thinking about things that might have happened to cause this, I cant think of any manual intervention. However, the servers that CiscoWorks runs on did have a set of MS updates installed recently. I am wondering if in the process of an automatic update being done for this device, the server was restarted and maybe lost the file somwhere.

Clutching at straws I know.

I will check on the process being used during the updates. As a best practice, should CiscoWorks be shutdown manually before MS updates are installed and the servers rebooted?

I have a feeling that they are not at the moment, and the shutdown process is assumed to shutdown CiscoWorks.

Thanks

Jeff

Martin Ermel · ‎03-05-2008

I do not assume that filesystem and DB are out of sync.

I just had the same failure. The problem was that for this special device there was no entry anymore in /var/adm/CSCOpx/files/rme/dcma/devfiles//PRIMARY/STARTUP

i.e there was no directory /1 and no dir /1/assoc

Interesting thing was that the according entry in Config_Device_Version of RME DB with Version_ID = 1 was also missing. I only found an entry with Version_ID = 0 for that device;

That said, DB and filesystem wasn't out-of-sync it is just that the SQL query which is done to get the entry for that device with Version_ID = 1 just returns nothing (doing the same query manual and changeing the Version_ID to '0' in the WHERE statement gives a result)

I rebuilt the filesystem (/1 and /1/assoc and startup-config file) for that device and added the according row (with Version_ID = 1) in Config_Device_Version of the RME DB and the job completed successfully.

This points to an internal process that produces this failure (as DB and filesystem are in snyc) like a job that gets interrupted at some point (e.g. reboot, stopping of dmgtd); Perhaps a test to check if /1 and /1/assoc are removed when the query does not have a result and then force the update for that device could be a solution - but as always, sometimes it sounds easier then it is ...

Joe Clarke · ‎03-05-2008

Yes, LMS should be manually shutdown before rebooting or shutting down the server. This will give the databases the time they need to properly stop.

Windows doesn't wait long enough for Daemon Manager to stop before cycling the CPU. On Solaris, this is easily fixed by adding a proper kill script. On Windows, there does not appear to be a way to tell the reboot/shutdown process to wait for our processes to fully terminate.

Jeff Law · ‎03-05-2008

Thanks for the responses. I will make sure that the server shutdown process includes the stopping of the CiscoWorks daemons BEFORE doing the shutdown/restart.

Martin, Im glad your solution works for you, but the steps are a bit too complicated for me. I'll see if the manual shutdown prevents the problems from occuring again.

Many thanks