cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1604
Views
12
Helpful
21
Replies

100% Config Archive failure

yjdabear
VIP Alumni
VIP Alumni

100% of my RME devices failed during Config Archive (Periodic Collection) this morning, yet strangely most of them didn't fail during the Inventory (Periodic Polling) job at 4pm yesterday. Most Config Archive failures look like this:

PRIMARY STARTUP Sep 12 2007 06:17:58 CM0020: Error creating archive

VLAN RUNNING Sep 12 2007 06:18:43 Insufficient no. of interactive responses(or timeout) for command: copy const_nvram:vlan.dat tftp:.VLAN Config fetch is not supported using TFTP.VLAN Config fetch is not supported using SCP.TELNET: Failed to establish TELNET connection to 10.xx.xx.xx - Cause: Connection refused.

PRIMARY RUNNING Sep 12 2007 06:18:23 CM0020: Error creating archive

Transport setting for Archive Mgmt is: SSH, TFTP, SCP, TELNET.

I'm not sure the Inventory job was really mostly "success" either, since I'm getting this error when clicking on a Diff in the Out-of-Sync Summary, of a device that has Startup and Running configs with yesterday afternoon's timestamp:

CM0021: Version does not exist in archive device-name Cause: Version may have been deleted.

Edit: Is this because of this permission issue?

(/var/adm/CSCOpx/files/rme)->ls -ltr

total 16

drwxrwx--- 2 casuser casusers 96 Apr 23 23:32 cwconfig

drwxrwx--- 2 casuser casusers 96 Apr 23 23:35 changeaudit

drwxrwxr-x 2 casuser casusers 96 Apr 23 23:36 archive

drwxr-xr-x 3 casuser casusers 96 Apr 23 23:36 netconfig

drwxrwxr-x 2 casuser casusers 96 Apr 23 23:40 temp

drwxr-xr-x 3 casuser casusers 96 Sep 11 08:17 swim

drwxr-xr-x 3 casuser casusers 96 Sep 11 08:19 NetShow

drwxr-xr-x 19 casuser casusers 1024 Sep 11 08:39 jobs

drwxr-xr-x 3 casuser casusers 96 Sep 11 08:39 cri

drwxr-xr-x 2 casuser casusers 96 Sep 11 17:41 cfgedit

drwxr-xr-x 2 casuser casusers 96 Sep 11 17:41 repository

drwxr-xr-x 2 root other 96 Sep 11 18:53 dcma

drwxrwx--- 2 casuser casusers 7168 Sep 12 01:00 syslog

How did it come about? Did the restoredb.pl run under root blow away the original dcma directory?

21 Replies 21

Joe Clarke
Cisco Employee
Cisco Employee

Does look like permissions are messed up. dcma is the key subdirectory here. You might try doing a:

chown -R casuser:casusers /var/adm/CSCOpx/files/rme/dcma

And see if that helps.

Today, after fixing the permission on dcma, one device succeeded, the rest still failed with these errors:

IOS

PRIMARY STARTUP Sep 13 2007 06:14:42 CM0022: Archive already exits Cause: Archive names should be unique Action: Please provide a different name

VLAN RUNNING Sep 13 2007 06:15:23 CM0005: Archive does not exist for device-name

PRIMARY RUNNING Sep 13 2007 06:15:05 CM0002: Could not archive config Cause: Device may not be reachable, may be in suspended state or credentials may be incorrect. Action: Verify that device is managed, credentials are correct and file system has correct permissions. Increase timeout value, if required.

CatOS

PRIMARY STARTUP Sep 13 2007 06:14:42 CM0022: Archive already exits Cause: Archive names should be unique Action: Please provide a different name

It sounds like the RME database and the file system are out-of-sync. You said you restored this system from a backup?

It's actually a db backup from another server (host B), who had a db backup from this server (host A) restored to it, so I resume they should be in sync. I don't see why the file system should be out of sync, when the dcma directory is empty. Can I solve this issue by reinitializing RME?

If the dcma directory is empty, then the database is the only other place where the stale config entries could be. So, yes, reinitializing the rmeng database should clear this up.

It looks like a bug of some sort with restorebackup.pl (though I don't think I've seen this behavior before this series of events): I just finished restoring that same copy of backup on host A after initializing RME db, and dcma has become owned by root again. I looked on host B where the backup came from--dcma is owned by casuser over there.

hostA-(/var/adm/CSCOpx/files/rme)->ls -lr

total 16

drwxrwxr-x 2 casuser casusers 96 Apr 23 23:40 temp

drwxrwx--- 2 casuser casusers 7168 Sep 13 20:15 syslog

drwxr-xr-x 3 casuser casusers 96 Sep 11 08:17 swim

drwxr-xr-x 2 casuser casusers 96 Sep 11 17:41 repository

drwxr-xr-x 3 casuser casusers 96 Apr 23 23:36 netconfig

drwxr-xr-x 19 casuser casusers 1024 Sep 11 08:39 jobs

drwxr-xr-x 2 root other 96 Sep 13 20:55 dcma

drwxrwx--- 2 casuser casusers 96 Apr 23 23:32 cwconfig

What are the permissions within the filebackup.tar for this backup? It may be an issue with the backup, and not the restore.

The permissions on filebackup.tar look correct. Everything inside looks fine too. In fact, I can't find anything not owned by casuser.

-rw-r----- 1 casuser casusers 67072 Sep 11 17:07 filebackup.tar

The Config Archive issue is still there after initializing RME: 1 success, the rest failure. Most devices are "normal" in Device Mgmt.

CatOS

PRIMARY RUNNING Sep 17 2007 06:13:49 CM0002: Could not archive config Cause: Device may not be reachable, may be in suspended state or credentials may be incorrect. Action: Verify that device is managed, credentials are correct and file system has correct permissions. Increase timeout value, if required.

IOS

PRIMARY STARTUP Sep 17 2007 06:07:54 Internal error

VLAN RUNNING Sep 17 2007 10:31:13 CM0005: Archive does not exist for cat6509device1

PRIMARY RUNNING Sep 17 2007 10:30:54 CM0002: Could not archive config Cause: Device may not be reachable, may be in suspended state or credentials may be incorrect. Action: Verify that device is managed, credentials are correct and file system has correct permissions. Increase timeout value, if required.

The ArchiveUpdate job has today's scheduled timestamp on the RME homepage, but actually shows the timestamp from the restored backup's last run in the job summary.

dcmaservice.log:

[ Mon Sep 17 05:07:39 EDT 2007 ],ERROR,[Thread-7396],com.cisco.nm.rmeng.dcma.co

nfigmanager.DeviceArchiveManager,deleteStartupConfig,1817,CM0003: Version $1 does not exist in archive $2 Cause: Version may have been deleted

[ Mon Sep 17 05:07:39 EDT 2007 ],ERROR,[Thread-7396],com.cisco.nm.rmeng.dcma.co

nfigmanager.ConfigManager,updateArchiveForDevice,710,Error archiving config for cat6509device1

[ Mon Sep 17 05:07:39 EDT 2007 ],INFO ,[Thread-7396],com.cisco.nm.rmeng.util.rm

edaa.RMEDeviceContext,getCmdSvc,763,Inside RMEDeviceContext's getCmdSvc

[ Mon Sep 17 05:07:39 EDT 2007 ],INFO ,[Thread-7396],com.cisco.nm.rmeng.util.rm

edaa.RMEDeviceContext,getCmdSvc,768,Protocol and Platforms passed = SSH , RMEIOS

[ Mon Sep 17 05:07:39 EDT 2007 ],INFO ,[Thread-7396],com.cisco.nm.rmeng.util.rm

edaa.RMEDeviceContext,getSshCmdSvc,791,inside getSshCmdSvc

[ Mon Sep 17 05:07:41 EDT 2007 ],INFO ,[Thread-7395],com.cisco.nm.rmeng.util.rm

edaa.RMEDeviceContext,getSshCmdSvc,798,SSH2 is running

[ Mon Sep 17 05:07:47 EDT 2007 ],INFO ,[Thread-7396],com.cisco.nm.rmeng.util.rm

edaa.RMEDeviceContext,getSshCmdSvc,798,SSH2 is running

[ Mon Sep 17 05:07:50 EDT 2007 ],ERROR,[Thread-7399],com.cisco.nm.rmeng.dcma.co

nfigmanager.DeviceArchiveManager,getLatestConfigFileVersion,163,CM0021: Version

does not exist in archive $1 Cause: Version may have been deleted

[ Mon Sep 17 05:07:50 EDT 2007 ],INFO ,[Thread-7399],com.cisco.nm.rmeng.dcma.co

nfigmanager.DeviceArchiveManager,getSysObjectID,420,SYS OID = .1.3.6.1.4.1.9.1.2

83

[ Mon Sep 17 05:07:51 EDT 2007 ],ERROR,[Thread-7399],com.cisco.nm.rmeng.dcma.co

dcmaservice.log

[ Mon Sep 17 05:07:51 EDT 2007 ],ERROR,[Thread-7399],com.cisco.nm.rmeng.dcma.co

nfigmanager.DeviceArchiveManager,archiveNewVersionIfNeeded,1226,CM0002: Could no

t archive config Cause: Device may not be reachable, may be in suspended state o

r credentials may be incorrect. Action: Verify that device is managed, credentia

ls are correct and file system has correct permissions. Increase timeout value,

if required.CM0005: Archive does not exist for cat6509device2 at com.cisco.nm.rmeng.dcma.configmanager.DeviceArchiveManager.addNewConf

igFileVersion(DeviceArchiveManager.java:996) at com.cisco.nm.rmeng.dcma.configmanager.DeviceArchiveManager.archiveNew

VersionIfNeeded(DeviceArchiveManager.java:1178) at com.cisco.nm.rmeng.dcma.configmanager.ConfigManager.updateArchiveForD

evice(ConfigManager.java:666) at com.cisco.nm.rmeng.dcma.configmanager.ConfigManager.performCollection

(ConfigManager.java:1529) at com.cisco.nm.rmeng.dcma.configmanager.CfgUpdateThread.run(CfgUpdateThread.java:29)

[ Mon Sep 17 05:07:51 EDT 2007 ],ERROR,[Thread-7399],com.cisco.nm.rmeng.dcma.co

nfigmanager.ConfigManager,updateArchiveForDevice,710,Error archiving config for

cat6509device2

[ Mon Sep 17 05:07:51 EDT 2007 ],INFO ,[Thread-7399],com.cisco.nm.rmeng.util.rm

Please post your NMSROOT/MDC/etc/regdaemon.xml file. Note, if the data in this file is too sensitive for an open forum, please open a TAC service request with the same information.

sanitized for posting

There's nothing wrong here, so the problem must be with the backup. If you reinitialize the RME database, clean out /var/adm/CSCOpx/files/rme/dcma, then test a few devices, do they work?

What I posted two posts ago was the outcome, after I had reinitialized RME and restored. "dcma" became owned by "root" and empty after restoring, which was successful everytime (twice so far). I cleaned up after restore by chaning the ownership of "dcma" to casuser:casusers every time too.

The difference reinitializing RME made was:

before:

CatOS

PRIMARY STARTUP Sep 13 2007 06:14:42 CM0022: Archive already exits Cause: Archive names should be unique Action: Please provide a different name

IOS

PRIMARY STARTUP Sep 13 2007 06:14:42 CM0022: Archive already exits Cause: Archive names should be unique Action: Please provide a different name

VLAN RUNNING Sep 13 2007 06:15:23 CM0005: Archive does not exist for cat6509ios1

PRIMARY RUNNING Sep 13 2007 06:15:05 CM0002: Could not archive config Cause: Device may not be reachable, may be in suspended state or credentials may be incorrect. Action: Verify that device is managed, credentials are correct and file system has correct permissions. Increase timeout value, if required.

after:

CatOS

PRIMARY RUNNING Sep 17 2007 06:13:49 CM0002: Could not archive config Cause: Device may not be reachable, may be in suspended state or credentials may be incorrect. Action: Verify that device is managed, credentials are correct and file system has correct permissions. Increase timeout value, if required.

IOS

PRIMARY STARTUP Sep 17 2007 06:07:54 Internal error

VLAN RUNNING Sep 17 2007 10:31:13 CM0005: Archive does not exist for cat6509ios1

PRIMARY RUNNING Sep 17 2007 10:30:54 CM0002: Could not archive config Cause: Device may not be reachable, may be in suspended state or credentials may be incorrect. Action: Verify that device is managed, credentials are correct and file system has correct permissions. Increase timeout value, if required.

I didn't ask you to restore the backup. I specifically want you to test WITHOUT restoring the backup AFTER reinitializing the RME database. I want to rule out everything but the backup.

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community:

Innovations in Cisco Full Stack Observability - A new webinar from Cisco