09-12-2007 01:30 PM
100% of my RME devices failed during Config Archive (Periodic Collection) this morning, yet strangely most of them didn't fail during the Inventory (Periodic Polling) job at 4pm yesterday. Most Config Archive failures look like this:
PRIMARY STARTUP Sep 12 2007 06:17:58 CM0020: Error creating archive
VLAN RUNNING Sep 12 2007 06:18:43 Insufficient no. of interactive responses(or timeout) for command: copy const_nvram:vlan.dat tftp:.VLAN Config fetch is not supported using TFTP.VLAN Config fetch is not supported using SCP.TELNET: Failed to establish TELNET connection to 10.xx.xx.xx - Cause: Connection refused.
PRIMARY RUNNING Sep 12 2007 06:18:23 CM0020: Error creating archive
Transport setting for Archive Mgmt is: SSH, TFTP, SCP, TELNET.
I'm not sure the Inventory job was really mostly "success" either, since I'm getting this error when clicking on a Diff in the Out-of-Sync Summary, of a device that has Startup and Running configs with yesterday afternoon's timestamp:
CM0021: Version does not exist in archive device-name Cause: Version may have been deleted.
Edit: Is this because of this permission issue?
(/var/adm/CSCOpx/files/rme)->ls -ltr
total 16
drwxrwx--- 2 casuser casusers 96 Apr 23 23:32 cwconfig
drwxrwx--- 2 casuser casusers 96 Apr 23 23:35 changeaudit
drwxrwxr-x 2 casuser casusers 96 Apr 23 23:36 archive
drwxr-xr-x 3 casuser casusers 96 Apr 23 23:36 netconfig
drwxrwxr-x 2 casuser casusers 96 Apr 23 23:40 temp
drwxr-xr-x 3 casuser casusers 96 Sep 11 08:17 swim
drwxr-xr-x 3 casuser casusers 96 Sep 11 08:19 NetShow
drwxr-xr-x 19 casuser casusers 1024 Sep 11 08:39 jobs
drwxr-xr-x 3 casuser casusers 96 Sep 11 08:39 cri
drwxr-xr-x 2 casuser casusers 96 Sep 11 17:41 cfgedit
drwxr-xr-x 2 casuser casusers 96 Sep 11 17:41 repository
drwxr-xr-x 2 root other 96 Sep 11 18:53 dcma
drwxrwx--- 2 casuser casusers 7168 Sep 12 01:00 syslog
How did it come about? Did the restoredb.pl run under root blow away the original dcma directory?
09-12-2007 02:50 PM
Does look like permissions are messed up. dcma is the key subdirectory here. You might try doing a:
chown -R casuser:casusers /var/adm/CSCOpx/files/rme/dcma
And see if that helps.
09-13-2007 07:51 AM
Today, after fixing the permission on dcma, one device succeeded, the rest still failed with these errors:
IOS
PRIMARY STARTUP Sep 13 2007 06:14:42 CM0022: Archive already exits Cause: Archive names should be unique Action: Please provide a different name
VLAN RUNNING Sep 13 2007 06:15:23 CM0005: Archive does not exist for device-name
PRIMARY RUNNING Sep 13 2007 06:15:05 CM0002: Could not archive config Cause: Device may not be reachable, may be in suspended state or credentials may be incorrect. Action: Verify that device is managed, credentials are correct and file system has correct permissions. Increase timeout value, if required.
CatOS
PRIMARY STARTUP Sep 13 2007 06:14:42 CM0022: Archive already exits Cause: Archive names should be unique Action: Please provide a different name
09-13-2007 08:49 AM
It sounds like the RME database and the file system are out-of-sync. You said you restored this system from a backup?
09-13-2007 10:09 AM
It's actually a db backup from another server (host B), who had a db backup from this server (host A) restored to it, so I resume they should be in sync. I don't see why the file system should be out of sync, when the dcma directory is empty. Can I solve this issue by reinitializing RME?
09-13-2007 10:12 AM
If the dcma directory is empty, then the database is the only other place where the stale config entries could be. So, yes, reinitializing the rmeng database should clear this up.
09-13-2007 05:51 PM
It looks like a bug of some sort with restorebackup.pl (though I don't think I've seen this behavior before this series of events): I just finished restoring that same copy of backup on host A after initializing RME db, and dcma has become owned by root again. I looked on host B where the backup came from--dcma is owned by casuser over there.
hostA-(/var/adm/CSCOpx/files/rme)->ls -lr
total 16
drwxrwxr-x 2 casuser casusers 96 Apr 23 23:40 temp
drwxrwx--- 2 casuser casusers 7168 Sep 13 20:15 syslog
drwxr-xr-x 3 casuser casusers 96 Sep 11 08:17 swim
drwxr-xr-x 2 casuser casusers 96 Sep 11 17:41 repository
drwxr-xr-x 3 casuser casusers 96 Apr 23 23:36 netconfig
drwxr-xr-x 19 casuser casusers 1024 Sep 11 08:39 jobs
drwxr-xr-x 2 root other 96 Sep 13 20:55 dcma
drwxrwx--- 2 casuser casusers 96 Apr 23 23:32 cwconfig
09-13-2007 06:26 PM
What are the permissions within the filebackup.tar for this backup? It may be an issue with the backup, and not the restore.
09-17-2007 08:17 AM
The permissions on filebackup.tar look correct. Everything inside looks fine too. In fact, I can't find anything not owned by casuser.
-rw-r----- 1 casuser casusers 67072 Sep 11 17:07 filebackup.tar
09-17-2007 08:20 AM
The Config Archive issue is still there after initializing RME: 1 success, the rest failure. Most devices are "normal" in Device Mgmt.
CatOS
PRIMARY RUNNING Sep 17 2007 06:13:49 CM0002: Could not archive config Cause: Device may not be reachable, may be in suspended state or credentials may be incorrect. Action: Verify that device is managed, credentials are correct and file system has correct permissions. Increase timeout value, if required.
IOS
PRIMARY STARTUP Sep 17 2007 06:07:54 Internal error
VLAN RUNNING Sep 17 2007 10:31:13 CM0005: Archive does not exist for cat6509device1
PRIMARY RUNNING Sep 17 2007 10:30:54 CM0002: Could not archive config Cause: Device may not be reachable, may be in suspended state or credentials may be incorrect. Action: Verify that device is managed, credentials are correct and file system has correct permissions. Increase timeout value, if required.
The ArchiveUpdate job has today's scheduled timestamp on the RME homepage, but actually shows the timestamp from the restored backup's last run in the job summary.
dcmaservice.log:
[ Mon Sep 17 05:07:39 EDT 2007 ],ERROR,[Thread-7396],com.cisco.nm.rmeng.dcma.co
nfigmanager.DeviceArchiveManager,deleteStartupConfig,1817,CM0003: Version $1 does not exist in archive $2 Cause: Version may have been deleted
[ Mon Sep 17 05:07:39 EDT 2007 ],ERROR,[Thread-7396],com.cisco.nm.rmeng.dcma.co
nfigmanager.ConfigManager,updateArchiveForDevice,710,Error archiving config for cat6509device1
[ Mon Sep 17 05:07:39 EDT 2007 ],INFO ,[Thread-7396],com.cisco.nm.rmeng.util.rm
edaa.RMEDeviceContext,getCmdSvc,763,Inside RMEDeviceContext's getCmdSvc
[ Mon Sep 17 05:07:39 EDT 2007 ],INFO ,[Thread-7396],com.cisco.nm.rmeng.util.rm
edaa.RMEDeviceContext,getCmdSvc,768,Protocol and Platforms passed = SSH , RMEIOS
[ Mon Sep 17 05:07:39 EDT 2007 ],INFO ,[Thread-7396],com.cisco.nm.rmeng.util.rm
edaa.RMEDeviceContext,getSshCmdSvc,791,inside getSshCmdSvc
[ Mon Sep 17 05:07:41 EDT 2007 ],INFO ,[Thread-7395],com.cisco.nm.rmeng.util.rm
edaa.RMEDeviceContext,getSshCmdSvc,798,SSH2 is running
[ Mon Sep 17 05:07:47 EDT 2007 ],INFO ,[Thread-7396],com.cisco.nm.rmeng.util.rm
edaa.RMEDeviceContext,getSshCmdSvc,798,SSH2 is running
[ Mon Sep 17 05:07:50 EDT 2007 ],ERROR,[Thread-7399],com.cisco.nm.rmeng.dcma.co
nfigmanager.DeviceArchiveManager,getLatestConfigFileVersion,163,CM0021: Version
does not exist in archive $1 Cause: Version may have been deleted
[ Mon Sep 17 05:07:50 EDT 2007 ],INFO ,[Thread-7399],com.cisco.nm.rmeng.dcma.co
nfigmanager.DeviceArchiveManager,getSysObjectID,420,SYS OID = .1.3.6.1.4.1.9.1.2
83
[ Mon Sep 17 05:07:51 EDT 2007 ],ERROR,[Thread-7399],com.cisco.nm.rmeng.dcma.co
dcmaservice.log
[ Mon Sep 17 05:07:51 EDT 2007 ],ERROR,[Thread-7399],com.cisco.nm.rmeng.dcma.co
nfigmanager.DeviceArchiveManager,archiveNewVersionIfNeeded,1226,CM0002: Could no
t archive config Cause: Device may not be reachable, may be in suspended state o
r credentials may be incorrect. Action: Verify that device is managed, credentia
ls are correct and file system has correct permissions. Increase timeout value,
if required.CM0005: Archive does not exist for cat6509device2 at com.cisco.nm.rmeng.dcma.configmanager.DeviceArchiveManager.addNewConf
igFileVersion(DeviceArchiveManager.java:996) at com.cisco.nm.rmeng.dcma.configmanager.DeviceArchiveManager.archiveNew
VersionIfNeeded(DeviceArchiveManager.java:1178) at com.cisco.nm.rmeng.dcma.configmanager.ConfigManager.updateArchiveForD
evice(ConfigManager.java:666) at com.cisco.nm.rmeng.dcma.configmanager.ConfigManager.performCollection
(ConfigManager.java:1529) at com.cisco.nm.rmeng.dcma.configmanager.CfgUpdateThread.run(CfgUpdateThread.java:29)
[ Mon Sep 17 05:07:51 EDT 2007 ],ERROR,[Thread-7399],com.cisco.nm.rmeng.dcma.co
nfigmanager.ConfigManager,updateArchiveForDevice,710,Error archiving config for
cat6509device2
[ Mon Sep 17 05:07:51 EDT 2007 ],INFO ,[Thread-7399],com.cisco.nm.rmeng.util.rm
09-17-2007 08:28 AM
Please post your NMSROOT/MDC/etc/regdaemon.xml file. Note, if the data in this file is too sensitive for an open forum, please open a TAC service request with the same information.
09-17-2007 10:19 AM
09-17-2007 10:35 AM
There's nothing wrong here, so the problem must be with the backup. If you reinitialize the RME database, clean out /var/adm/CSCOpx/files/rme/dcma, then test a few devices, do they work?
09-17-2007 11:07 AM
What I posted two posts ago was the outcome, after I had reinitialized RME and restored. "dcma" became owned by "root" and empty after restoring, which was successful everytime (twice so far). I cleaned up after restore by chaning the ownership of "dcma" to casuser:casusers every time too.
The difference reinitializing RME made was:
before:
CatOS
PRIMARY STARTUP Sep 13 2007 06:14:42 CM0022: Archive already exits Cause: Archive names should be unique Action: Please provide a different name
IOS
PRIMARY STARTUP Sep 13 2007 06:14:42 CM0022: Archive already exits Cause: Archive names should be unique Action: Please provide a different name
VLAN RUNNING Sep 13 2007 06:15:23 CM0005: Archive does not exist for cat6509ios1
PRIMARY RUNNING Sep 13 2007 06:15:05 CM0002: Could not archive config Cause: Device may not be reachable, may be in suspended state or credentials may be incorrect. Action: Verify that device is managed, credentials are correct and file system has correct permissions. Increase timeout value, if required.
after:
CatOS
PRIMARY RUNNING Sep 17 2007 06:13:49 CM0002: Could not archive config Cause: Device may not be reachable, may be in suspended state or credentials may be incorrect. Action: Verify that device is managed, credentials are correct and file system has correct permissions. Increase timeout value, if required.
IOS
PRIMARY STARTUP Sep 17 2007 06:07:54 Internal error
VLAN RUNNING Sep 17 2007 10:31:13 CM0005: Archive does not exist for cat6509ios1
PRIMARY RUNNING Sep 17 2007 10:30:54 CM0002: Could not archive config Cause: Device may not be reachable, may be in suspended state or credentials may be incorrect. Action: Verify that device is managed, credentials are correct and file system has correct permissions. Increase timeout value, if required.
09-17-2007 11:19 AM
I didn't ask you to restore the backup. I specifically want you to test WITHOUT restoring the backup AFTER reinitializing the RME database. I want to rule out everything but the backup.
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: