100% Config Archive failure

Unanswered Question
Sep 12th, 2007

100% of my RME devices failed during Config Archive (Periodic Collection) this morning, yet strangely most of them didn't fail during the Inventory (Periodic Polling) job at 4pm yesterday. Most Config Archive failures look like this:

PRIMARY STARTUP Sep 12 2007 06:17:58 CM0020: Error creating archive

VLAN RUNNING Sep 12 2007 06:18:43 Insufficient no. of interactive responses(or timeout) for command: copy const_nvram:vlan.dat tftp:.VLAN Config fetch is not supported using TFTP.VLAN Config fetch is not supported using SCP.TELNET: Failed to establish TELNET connection to 10.xx.xx.xx - Cause: Connection refused.

PRIMARY RUNNING Sep 12 2007 06:18:23 CM0020: Error creating archive

Transport setting for Archive Mgmt is: SSH, TFTP, SCP, TELNET.

I'm not sure the Inventory job was really mostly "success" either, since I'm getting this error when clicking on a Diff in the Out-of-Sync Summary, of a device that has Startup and Running configs with yesterday afternoon's timestamp:

CM0021: Version does not exist in archive device-name Cause: Version may have been deleted.

Edit: Is this because of this permission issue?

(/var/adm/CSCOpx/files/rme)->ls -ltr

total 16

drwxrwx--- 2 casuser casusers 96 Apr 23 23:32 cwconfig

drwxrwx--- 2 casuser casusers 96 Apr 23 23:35 changeaudit

drwxrwxr-x 2 casuser casusers 96 Apr 23 23:36 archive

drwxr-xr-x 3 casuser casusers 96 Apr 23 23:36 netconfig

drwxrwxr-x 2 casuser casusers 96 Apr 23 23:40 temp

drwxr-xr-x 3 casuser casusers 96 Sep 11 08:17 swim

drwxr-xr-x 3 casuser casusers 96 Sep 11 08:19 NetShow

drwxr-xr-x 19 casuser casusers 1024 Sep 11 08:39 jobs

drwxr-xr-x 3 casuser casusers 96 Sep 11 08:39 cri

drwxr-xr-x 2 casuser casusers 96 Sep 11 17:41 cfgedit

drwxr-xr-x 2 casuser casusers 96 Sep 11 17:41 repository

drwxr-xr-x 2 root other 96 Sep 11 18:53 dcma

drwxrwx--- 2 casuser casusers 7168 Sep 12 01:00 syslog

How did it come about? Did the restoredb.pl run under root blow away the original dcma directory?

I have this problem too.
0 votes
  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 4 (3 ratings)
Loading.
Joe Clarke Wed, 09/12/2007 - 14:50

Does look like permissions are messed up. dcma is the key subdirectory here. You might try doing a:

chown -R casuser:casusers /var/adm/CSCOpx/files/rme/dcma

And see if that helps.

yjdabear Thu, 09/13/2007 - 07:51

Today, after fixing the permission on dcma, one device succeeded, the rest still failed with these errors:

IOS

PRIMARY STARTUP Sep 13 2007 06:14:42 CM0022: Archive already exits Cause: Archive names should be unique Action: Please provide a different name

VLAN RUNNING Sep 13 2007 06:15:23 CM0005: Archive does not exist for device-name

PRIMARY RUNNING Sep 13 2007 06:15:05 CM0002: Could not archive config Cause: Device may not be reachable, may be in suspended state or credentials may be incorrect. Action: Verify that device is managed, credentials are correct and file system has correct permissions. Increase timeout value, if required.

CatOS

PRIMARY STARTUP Sep 13 2007 06:14:42 CM0022: Archive already exits Cause: Archive names should be unique Action: Please provide a different name

Joe Clarke Thu, 09/13/2007 - 08:49

It sounds like the RME database and the file system are out-of-sync. You said you restored this system from a backup?

yjdabear Thu, 09/13/2007 - 10:09

It's actually a db backup from another server (host B), who had a db backup from this server (host A) restored to it, so I resume they should be in sync. I don't see why the file system should be out of sync, when the dcma directory is empty. Can I solve this issue by reinitializing RME?

Joe Clarke Thu, 09/13/2007 - 10:12

If the dcma directory is empty, then the database is the only other place where the stale config entries could be. So, yes, reinitializing the rmeng database should clear this up.

yjdabear Thu, 09/13/2007 - 17:51

It looks like a bug of some sort with restorebackup.pl (though I don't think I've seen this behavior before this series of events): I just finished restoring that same copy of backup on host A after initializing RME db, and dcma has become owned by root again. I looked on host B where the backup came from--dcma is owned by casuser over there.

hostA-(/var/adm/CSCOpx/files/rme)->ls -lr

total 16

drwxrwxr-x 2 casuser casusers 96 Apr 23 23:40 temp

drwxrwx--- 2 casuser casusers 7168 Sep 13 20:15 syslog

drwxr-xr-x 3 casuser casusers 96 Sep 11 08:17 swim

drwxr-xr-x 2 casuser casusers 96 Sep 11 17:41 repository

drwxr-xr-x 3 casuser casusers 96 Apr 23 23:36 netconfig

drwxr-xr-x 19 casuser casusers 1024 Sep 11 08:39 jobs

drwxr-xr-x 2 root other 96 Sep 13 20:55 dcma

drwxrwx--- 2 casuser casusers 96 Apr 23 23:32 cwconfig

Joe Clarke Thu, 09/13/2007 - 18:26

What are the permissions within the filebackup.tar for this backup? It may be an issue with the backup, and not the restore.

yjdabear Mon, 09/17/2007 - 08:17

The permissions on filebackup.tar look correct. Everything inside looks fine too. In fact, I can't find anything not owned by casuser.

-rw-r----- 1 casuser casusers 67072 Sep 11 17:07 filebackup.tar

yjdabear Mon, 09/17/2007 - 08:20

The Config Archive issue is still there after initializing RME: 1 success, the rest failure. Most devices are "normal" in Device Mgmt.

CatOS

PRIMARY RUNNING Sep 17 2007 06:13:49 CM0002: Could not archive config Cause: Device may not be reachable, may be in suspended state or credentials may be incorrect. Action: Verify that device is managed, credentials are correct and file system has correct permissions. Increase timeout value, if required.

IOS

PRIMARY STARTUP Sep 17 2007 06:07:54 Internal error

VLAN RUNNING Sep 17 2007 10:31:13 CM0005: Archive does not exist for cat6509device1

PRIMARY RUNNING Sep 17 2007 10:30:54 CM0002: Could not archive config Cause: Device may not be reachable, may be in suspended state or credentials may be incorrect. Action: Verify that device is managed, credentials are correct and file system has correct permissions. Increase timeout value, if required.

The ArchiveUpdate job has today's scheduled timestamp on the RME homepage, but actually shows the timestamp from the restored backup's last run in the job summary.

dcmaservice.log:

[ Mon Sep 17 05:07:39 EDT 2007 ],ERROR,[Thread-7396],com.cisco.nm.rmeng.dcma.co

nfigmanager.DeviceArchiveManager,deleteStartupConfig,1817,CM0003: Version $1 does not exist in archive $2 Cause: Version may have been deleted

[ Mon Sep 17 05:07:39 EDT 2007 ],ERROR,[Thread-7396],com.cisco.nm.rmeng.dcma.co

nfigmanager.ConfigManager,updateArchiveForDevice,710,Error archiving config for cat6509device1

[ Mon Sep 17 05:07:39 EDT 2007 ],INFO ,[Thread-7396],com.cisco.nm.rmeng.util.rm

edaa.RMEDeviceContext,getCmdSvc,763,Inside RMEDeviceContext's getCmdSvc

[ Mon Sep 17 05:07:39 EDT 2007 ],INFO ,[Thread-7396],com.cisco.nm.rmeng.util.rm

edaa.RMEDeviceContext,getCmdSvc,768,Protocol and Platforms passed = SSH , RMEIOS

[ Mon Sep 17 05:07:39 EDT 2007 ],INFO ,[Thread-7396],com.cisco.nm.rmeng.util.rm

edaa.RMEDeviceContext,getSshCmdSvc,791,inside getSshCmdSvc

[ Mon Sep 17 05:07:41 EDT 2007 ],INFO ,[Thread-7395],com.cisco.nm.rmeng.util.rm

edaa.RMEDeviceContext,getSshCmdSvc,798,SSH2 is running

[ Mon Sep 17 05:07:47 EDT 2007 ],INFO ,[Thread-7396],com.cisco.nm.rmeng.util.rm

edaa.RMEDeviceContext,getSshCmdSvc,798,SSH2 is running

[ Mon Sep 17 05:07:50 EDT 2007 ],ERROR,[Thread-7399],com.cisco.nm.rmeng.dcma.co

nfigmanager.DeviceArchiveManager,getLatestConfigFileVersion,163,CM0021: Version

does not exist in archive $1 Cause: Version may have been deleted

[ Mon Sep 17 05:07:50 EDT 2007 ],INFO ,[Thread-7399],com.cisco.nm.rmeng.dcma.co

nfigmanager.DeviceArchiveManager,getSysObjectID,420,SYS OID = .1.3.6.1.4.1.9.1.2

83

[ Mon Sep 17 05:07:51 EDT 2007 ],ERROR,[Thread-7399],com.cisco.nm.rmeng.dcma.co

dcmaservice.log

[ Mon Sep 17 05:07:51 EDT 2007 ],ERROR,[Thread-7399],com.cisco.nm.rmeng.dcma.co

nfigmanager.DeviceArchiveManager,archiveNewVersionIfNeeded,1226,CM0002: Could no

t archive config Cause: Device may not be reachable, may be in suspended state o

r credentials may be incorrect. Action: Verify that device is managed, credentia

ls are correct and file system has correct permissions. Increase timeout value,

if required.CM0005: Archive does not exist for cat6509device2 at com.cisco.nm.rmeng.dcma.configmanager.DeviceArchiveManager.addNewConf

igFileVersion(DeviceArchiveManager.java:996) at com.cisco.nm.rmeng.dcma.configmanager.DeviceArchiveManager.archiveNew

VersionIfNeeded(DeviceArchiveManager.java:1178) at com.cisco.nm.rmeng.dcma.configmanager.ConfigManager.updateArchiveForD

evice(ConfigManager.java:666) at com.cisco.nm.rmeng.dcma.configmanager.ConfigManager.performCollection

(ConfigManager.java:1529) at com.cisco.nm.rmeng.dcma.configmanager.CfgUpdateThread.run(CfgUpdateThread.java:29)

[ Mon Sep 17 05:07:51 EDT 2007 ],ERROR,[Thread-7399],com.cisco.nm.rmeng.dcma.co

nfigmanager.ConfigManager,updateArchiveForDevice,710,Error archiving config for

cat6509device2

[ Mon Sep 17 05:07:51 EDT 2007 ],INFO ,[Thread-7399],com.cisco.nm.rmeng.util.rm

Joe Clarke Mon, 09/17/2007 - 08:28

Please post your NMSROOT/MDC/etc/regdaemon.xml file. Note, if the data in this file is too sensitive for an open forum, please open a TAC service request with the same information.

Joe Clarke Mon, 09/17/2007 - 10:35

There's nothing wrong here, so the problem must be with the backup. If you reinitialize the RME database, clean out /var/adm/CSCOpx/files/rme/dcma, then test a few devices, do they work?

yjdabear Mon, 09/17/2007 - 11:07

What I posted two posts ago was the outcome, after I had reinitialized RME and restored. "dcma" became owned by "root" and empty after restoring, which was successful everytime (twice so far). I cleaned up after restore by chaning the ownership of "dcma" to casuser:casusers every time too.

The difference reinitializing RME made was:

before:

CatOS

PRIMARY STARTUP Sep 13 2007 06:14:42 CM0022: Archive already exits Cause: Archive names should be unique Action: Please provide a different name

IOS

PRIMARY STARTUP Sep 13 2007 06:14:42 CM0022: Archive already exits Cause: Archive names should be unique Action: Please provide a different name

VLAN RUNNING Sep 13 2007 06:15:23 CM0005: Archive does not exist for cat6509ios1

PRIMARY RUNNING Sep 13 2007 06:15:05 CM0002: Could not archive config Cause: Device may not be reachable, may be in suspended state or credentials may be incorrect. Action: Verify that device is managed, credentials are correct and file system has correct permissions. Increase timeout value, if required.

after:

CatOS

PRIMARY RUNNING Sep 17 2007 06:13:49 CM0002: Could not archive config Cause: Device may not be reachable, may be in suspended state or credentials may be incorrect. Action: Verify that device is managed, credentials are correct and file system has correct permissions. Increase timeout value, if required.

IOS

PRIMARY STARTUP Sep 17 2007 06:07:54 Internal error

VLAN RUNNING Sep 17 2007 10:31:13 CM0005: Archive does not exist for cat6509ios1

PRIMARY RUNNING Sep 17 2007 10:30:54 CM0002: Could not archive config Cause: Device may not be reachable, may be in suspended state or credentials may be incorrect. Action: Verify that device is managed, credentials are correct and file system has correct permissions. Increase timeout value, if required.

Joe Clarke Mon, 09/17/2007 - 11:19

I didn't ask you to restore the backup. I specifically want you to test WITHOUT restoring the backup AFTER reinitializing the RME database. I want to rule out everything but the backup.

yjdabear Wed, 09/19/2007 - 10:12

Reinitializing RME without a db restore got about 95% of the devices polled/collected. The failures were strangely exclusively CatOS devices at every remote WAN site, even though there was no known connectivity issues at the time.

majority of CatOS failures:

Failed to fetch config using TFTP. Failed to get the start tag-begin in the configuration Failed to establish TELNET connection to xx.xx.xx.xx -Cause: Authentication failed on device 3 times

the rest (minority) of CatOS failures

PRIMARY RUNNING Sep 18 2007 21:34:02 PRIMARY-RUNNING config Fetch Operation failed for TFTP. Failed to get the start tag-begin in the configuration Failed to establish TELNET connection to xx.xx.xx.xx -Cause: Authentication failed on device 3 times

The transport setting is as follows:

tftp - ssh - telnet

RME didn't report attempting SSH to the failed CatOS devices, both in the Config Archive drill-down above, and the following:

device detail in job summary:

rme - admin - config mgmt - archive mgmt- collection - periodic polling

CM0062 Polling cat6509catos3 for changes to configuration. CM0149 PRIMARY RUNNING config changed. CM0151 PRIMARY RUNNING Config fetch failed for cat6509catos3 Cause: Failed to fetch config using TFTP. Failed to get the start tag-begin in the configuration Failed to establish TELNET connection to xx.xx.xx.xx -Cause: Authentication failed on device 3 times Action: Check if protocol is supported by device and required device package is installed. Check device credentials. Increase timeout value, if required.

device detail in job summary:

rme - admin - config mgmt - archive mgmt- collection - periodic collection

CM0151 PRIMARY RUNNING Config fetch failed for cat6509catos1 Cause: Failed to fetch config using TFTP. Failed to get the start tag-begin in the configuration Failed to establish TELNET connection to xx.xx.xx.xx -Cause: Authentication failed on device 3 times Action: Check if protocol is supported by device and required device package is installed. Check device credentials. Increase timeout value, if required.

CM0151 PRIMARY RUNNING Config fetch failed for cat6509catos2 Cause: Failed to fetch config using TFTP. Failed to get the start tag-begin in the configuration Failed to establish TELNET connection to xx.xx.xx.xx -Cause: Connection reset by peer Action: Check if protocol is supported by device and required device package is installed. Check device credentials. Increase timeout value, if required.

In addition, I find the results of RME initialization inconsistent. For example, RME > Admin > Config Mgmt > Transport Settings surprisingly survived the reinit, syslog filters and job purge jobs did not.

Joe Clarke Wed, 09/19/2007 - 10:33

The TFTP problem with CatOS is most likely a timing issue. You should try increasing the device TFTP timeout (up to 90 seconds or maybe even higher) to see if that fixes the problem. If not, a sniffer trace will likely be required to see what is going on.

The reinit is completely consistent. Some settings are stored in the RME database while others (like the transport settings) are stored in regdaemon.xml.

yjdabear Wed, 09/19/2007 - 11:12

This is really not an ideal situation (reinit'ing RME and not restore) because now I have to reconstruct many RME configs. How do I find a truly usable db backup, without having to actually load it up successfully first AND then putting it through its paces for at least 24 hrs before finding out it's "good" or "bad"? Every failed attempt results in RME collections that have to be discarded, because the restore is "all-or-nothing". One would assume when a db is backed up successfully or loaded successfully, it's "good".

I didn't see timeout being an issue in this scale, specifically to CatOS on WAN sites, before this issue cropped up. I'm still mystified by why RME doesn't indicate it tried SSH. The only difference is it used to be SSH-TFTP-TELNET, where now it's TFTP-SSH-TELNET.

Joe Clarke Wed, 09/19/2007 - 11:18

The backup log attempts to weed out all of the obvious problems that could occur with the backup. But there is no guaranteed way to know a backup is good except to restore it to a test machine, and put it through its paces.

That said, you might have better luck with a restoration from another server if you first reinit the RME database AND clean out /var/adm/CSCOpx/files/rme/dcma before doing the restore.

As for the config archive problem, a sniffer trace and the dcmaservice.log with ArchiveMgmt Service debugging will be required to isolate the cause of the problem.

yjdabear Wed, 09/19/2007 - 11:40

Silly me--having SSH first before must have masked the latency issue that TFTP is so sensitive to over WAN links now that it's first in line. Would DCMAService pick up the new order of transports, if I change the setting while a Periodic Polling or Collection is going on?

Would the "all-or-nothing" restore of LMS backup be changed any time soon?

Joe Clarke Wed, 09/19/2007 - 15:24

Yes, the new collection order would be used.

I have heard of no plans to restore the LMS 2.2 partial restore framework. The problem is that all apps are now tied inextricably to DCR.

Actions

This Discussion