Re: 100% Config Archive failure - Page 2

yjdabear · ‎09-12-2007

100% of my RME devices failed during Config Archive (Periodic Collection) this morning, yet strangely most of them didn't fail during the Inventory (Periodic Polling) job at 4pm yesterday. Most Config Archive failures look like this:

PRIMARY STARTUP Sep 12 2007 06:17:58 CM0020: Error creating archive

VLAN RUNNING Sep 12 2007 06:18:43 Insufficient no. of interactive responses(or timeout) for command: copy const_nvram:vlan.dat tftp:.VLAN Config fetch is not supported using TFTP.VLAN Config fetch is not supported using SCP.TELNET: Failed to establish TELNET connection to 10.xx.xx.xx - Cause: Connection refused.

PRIMARY RUNNING Sep 12 2007 06:18:23 CM0020: Error creating archive

Transport setting for Archive Mgmt is: SSH, TFTP, SCP, TELNET.

I'm not sure the Inventory job was really mostly "success" either, since I'm getting this error when clicking on a Diff in the Out-of-Sync Summary, of a device that has Startup and Running configs with yesterday afternoon's timestamp:

CM0021: Version does not exist in archive device-name Cause: Version may have been deleted.

Edit: Is this because of this permission issue?

(/var/adm/CSCOpx/files/rme)->ls -ltr

total 16

drwxrwx--- 2 casuser casusers 96 Apr 23 23:32 cwconfig

drwxrwx--- 2 casuser casusers 96 Apr 23 23:35 changeaudit

drwxrwxr-x 2 casuser casusers 96 Apr 23 23:36 archive

drwxr-xr-x 3 casuser casusers 96 Apr 23 23:36 netconfig

drwxrwxr-x 2 casuser casusers 96 Apr 23 23:40 temp

drwxr-xr-x 3 casuser casusers 96 Sep 11 08:17 swim

drwxr-xr-x 3 casuser casusers 96 Sep 11 08:19 NetShow

drwxr-xr-x 19 casuser casusers 1024 Sep 11 08:39 jobs

drwxr-xr-x 3 casuser casusers 96 Sep 11 08:39 cri

drwxr-xr-x 2 casuser casusers 96 Sep 11 17:41 cfgedit

drwxr-xr-x 2 casuser casusers 96 Sep 11 17:41 repository

drwxr-xr-x 2 root other 96 Sep 11 18:53 dcma

drwxrwx--- 2 casuser casusers 7168 Sep 12 01:00 syslog

How did it come about? Did the restoredb.pl run under root blow away the original dcma directory?

yjdabear · ‎09-17-2007

Got it. I'll test that out tonight.

yjdabear · ‎09-19-2007

Reinitializing RME without a db restore got about 95% of the devices polled/collected. The failures were strangely exclusively CatOS devices at every remote WAN site, even though there was no known connectivity issues at the time.

majority of CatOS failures:

Failed to fetch config using TFTP. Failed to get the start tag-begin in the configuration Failed to establish TELNET connection to xx.xx.xx.xx -Cause: Authentication failed on device 3 times

the rest (minority) of CatOS failures

PRIMARY RUNNING Sep 18 2007 21:34:02 PRIMARY-RUNNING config Fetch Operation failed for TFTP. Failed to get the start tag-begin in the configuration Failed to establish TELNET connection to xx.xx.xx.xx -Cause: Authentication failed on device 3 times

The transport setting is as follows:

tftp - ssh - telnet

RME didn't report attempting SSH to the failed CatOS devices, both in the Config Archive drill-down above, and the following:

device detail in job summary:

rme - admin - config mgmt - archive mgmt- collection - periodic polling

CM0062 Polling cat6509catos3 for changes to configuration. CM0149 PRIMARY RUNNING config changed. CM0151 PRIMARY RUNNING Config fetch failed for cat6509catos3 Cause: Failed to fetch config using TFTP. Failed to get the start tag-begin in the configuration Failed to establish TELNET connection to xx.xx.xx.xx -Cause: Authentication failed on device 3 times Action: Check if protocol is supported by device and required device package is installed. Check device credentials. Increase timeout value, if required.

device detail in job summary:

rme - admin - config mgmt - archive mgmt- collection - periodic collection

CM0151 PRIMARY RUNNING Config fetch failed for cat6509catos1 Cause: Failed to fetch config using TFTP. Failed to get the start tag-begin in the configuration Failed to establish TELNET connection to xx.xx.xx.xx -Cause: Authentication failed on device 3 times Action: Check if protocol is supported by device and required device package is installed. Check device credentials. Increase timeout value, if required.

CM0151 PRIMARY RUNNING Config fetch failed for cat6509catos2 Cause: Failed to fetch config using TFTP. Failed to get the start tag-begin in the configuration Failed to establish TELNET connection to xx.xx.xx.xx -Cause: Connection reset by peer Action: Check if protocol is supported by device and required device package is installed. Check device credentials. Increase timeout value, if required.

In addition, I find the results of RME initialization inconsistent. For example, RME > Admin > Config Mgmt > Transport Settings surprisingly survived the reinit, syslog filters and job purge jobs did not.

Joe Clarke · ‎09-19-2007

The TFTP problem with CatOS is most likely a timing issue. You should try increasing the device TFTP timeout (up to 90 seconds or maybe even higher) to see if that fixes the problem. If not, a sniffer trace will likely be required to see what is going on.

The reinit is completely consistent. Some settings are stored in the RME database while others (like the transport settings) are stored in regdaemon.xml.

yjdabear · ‎09-19-2007

This is really not an ideal situation (reinit'ing RME and not restore) because now I have to reconstruct many RME configs. How do I find a truly usable db backup, without having to actually load it up successfully first AND then putting it through its paces for at least 24 hrs before finding out it's "good" or "bad"? Every failed attempt results in RME collections that have to be discarded, because the restore is "all-or-nothing". One would assume when a db is backed up successfully or loaded successfully, it's "good".

I didn't see timeout being an issue in this scale, specifically to CatOS on WAN sites, before this issue cropped up. I'm still mystified by why RME doesn't indicate it tried SSH. The only difference is it used to be SSH-TFTP-TELNET, where now it's TFTP-SSH-TELNET.

Joe Clarke · ‎09-19-2007

The backup log attempts to weed out all of the obvious problems that could occur with the backup. But there is no guaranteed way to know a backup is good except to restore it to a test machine, and put it through its paces.

That said, you might have better luck with a restoration from another server if you first reinit the RME database AND clean out /var/adm/CSCOpx/files/rme/dcma before doing the restore.

As for the config archive problem, a sniffer trace and the dcmaservice.log with ArchiveMgmt Service debugging will be required to isolate the cause of the problem.

yjdabear · ‎09-19-2007

Silly me--having SSH first before must have masked the latency issue that TFTP is so sensitive to over WAN links now that it's first in line. Would DCMAService pick up the new order of transports, if I change the setting while a Periodic Polling or Collection is going on?

Would the "all-or-nothing" restore of LMS backup be changed any time soon?

Joe Clarke · ‎09-19-2007

Yes, the new collection order would be used.

I have heard of no plans to restore the LMS 2.2 partial restore framework. The problem is that all apps are now tied inextricably to DCR.