09-12-2007 01:30 PM
100% of my RME devices failed during Config Archive (Periodic Collection) this morning, yet strangely most of them didn't fail during the Inventory (Periodic Polling) job at 4pm yesterday. Most Config Archive failures look like this:
PRIMARY STARTUP Sep 12 2007 06:17:58 CM0020: Error creating archive
VLAN RUNNING Sep 12 2007 06:18:43 Insufficient no. of interactive responses(or timeout) for command: copy const_nvram:vlan.dat tftp:.VLAN Config fetch is not supported using TFTP.VLAN Config fetch is not supported using SCP.TELNET: Failed to establish TELNET connection to 10.xx.xx.xx - Cause: Connection refused.
PRIMARY RUNNING Sep 12 2007 06:18:23 CM0020: Error creating archive
Transport setting for Archive Mgmt is: SSH, TFTP, SCP, TELNET.
I'm not sure the Inventory job was really mostly "success" either, since I'm getting this error when clicking on a Diff in the Out-of-Sync Summary, of a device that has Startup and Running configs with yesterday afternoon's timestamp:
CM0021: Version does not exist in archive device-name Cause: Version may have been deleted.
Edit: Is this because of this permission issue?
(/var/adm/CSCOpx/files/rme)->ls -ltr
total 16
drwxrwx--- 2 casuser casusers 96 Apr 23 23:32 cwconfig
drwxrwx--- 2 casuser casusers 96 Apr 23 23:35 changeaudit
drwxrwxr-x 2 casuser casusers 96 Apr 23 23:36 archive
drwxr-xr-x 3 casuser casusers 96 Apr 23 23:36 netconfig
drwxrwxr-x 2 casuser casusers 96 Apr 23 23:40 temp
drwxr-xr-x 3 casuser casusers 96 Sep 11 08:17 swim
drwxr-xr-x 3 casuser casusers 96 Sep 11 08:19 NetShow
drwxr-xr-x 19 casuser casusers 1024 Sep 11 08:39 jobs
drwxr-xr-x 3 casuser casusers 96 Sep 11 08:39 cri
drwxr-xr-x 2 casuser casusers 96 Sep 11 17:41 cfgedit
drwxr-xr-x 2 casuser casusers 96 Sep 11 17:41 repository
drwxr-xr-x 2 root other 96 Sep 11 18:53 dcma
drwxrwx--- 2 casuser casusers 7168 Sep 12 01:00 syslog
How did it come about? Did the restoredb.pl run under root blow away the original dcma directory?
09-17-2007 11:32 AM
Got it. I'll test that out tonight.
09-19-2007 10:12 AM
Reinitializing RME without a db restore got about 95% of the devices polled/collected. The failures were strangely exclusively CatOS devices at every remote WAN site, even though there was no known connectivity issues at the time.
majority of CatOS failures:
Failed to fetch config using TFTP. Failed to get the start tag-begin in the configuration Failed to establish TELNET connection to xx.xx.xx.xx -Cause: Authentication failed on device 3 times
the rest (minority) of CatOS failures
PRIMARY RUNNING Sep 18 2007 21:34:02 PRIMARY-RUNNING config Fetch Operation failed for TFTP. Failed to get the start tag-begin in the configuration Failed to establish TELNET connection to xx.xx.xx.xx -Cause: Authentication failed on device 3 times
The transport setting is as follows:
tftp - ssh - telnet
RME didn't report attempting SSH to the failed CatOS devices, both in the Config Archive drill-down above, and the following:
device detail in job summary:
rme - admin - config mgmt - archive mgmt- collection - periodic polling
CM0062 Polling cat6509catos3 for changes to configuration. CM0149 PRIMARY RUNNING config changed. CM0151 PRIMARY RUNNING Config fetch failed for cat6509catos3 Cause: Failed to fetch config using TFTP. Failed to get the start tag-begin in the configuration Failed to establish TELNET connection to xx.xx.xx.xx -Cause: Authentication failed on device 3 times Action: Check if protocol is supported by device and required device package is installed. Check device credentials. Increase timeout value, if required.
device detail in job summary:
rme - admin - config mgmt - archive mgmt- collection - periodic collection
CM0151 PRIMARY RUNNING Config fetch failed for cat6509catos1 Cause: Failed to fetch config using TFTP. Failed to get the start tag-begin in the configuration Failed to establish TELNET connection to xx.xx.xx.xx -Cause: Authentication failed on device 3 times Action: Check if protocol is supported by device and required device package is installed. Check device credentials. Increase timeout value, if required.
CM0151 PRIMARY RUNNING Config fetch failed for cat6509catos2 Cause: Failed to fetch config using TFTP. Failed to get the start tag-begin in the configuration Failed to establish TELNET connection to xx.xx.xx.xx -Cause: Connection reset by peer Action: Check if protocol is supported by device and required device package is installed. Check device credentials. Increase timeout value, if required.
In addition, I find the results of RME initialization inconsistent. For example, RME > Admin > Config Mgmt > Transport Settings surprisingly survived the reinit, syslog filters and job purge jobs did not.
09-19-2007 10:33 AM
The TFTP problem with CatOS is most likely a timing issue. You should try increasing the device TFTP timeout (up to 90 seconds or maybe even higher) to see if that fixes the problem. If not, a sniffer trace will likely be required to see what is going on.
The reinit is completely consistent. Some settings are stored in the RME database while others (like the transport settings) are stored in regdaemon.xml.
09-19-2007 11:12 AM
This is really not an ideal situation (reinit'ing RME and not restore) because now I have to reconstruct many RME configs. How do I find a truly usable db backup, without having to actually load it up successfully first AND then putting it through its paces for at least 24 hrs before finding out it's "good" or "bad"? Every failed attempt results in RME collections that have to be discarded, because the restore is "all-or-nothing". One would assume when a db is backed up successfully or loaded successfully, it's "good".
I didn't see timeout being an issue in this scale, specifically to CatOS on WAN sites, before this issue cropped up. I'm still mystified by why RME doesn't indicate it tried SSH. The only difference is it used to be SSH-TFTP-TELNET, where now it's TFTP-SSH-TELNET.
09-19-2007 11:18 AM
The backup log attempts to weed out all of the obvious problems that could occur with the backup. But there is no guaranteed way to know a backup is good except to restore it to a test machine, and put it through its paces.
That said, you might have better luck with a restoration from another server if you first reinit the RME database AND clean out /var/adm/CSCOpx/files/rme/dcma before doing the restore.
As for the config archive problem, a sniffer trace and the dcmaservice.log with ArchiveMgmt Service debugging will be required to isolate the cause of the problem.
09-19-2007 11:40 AM
Silly me--having SSH first before must have masked the latency issue that TFTP is so sensitive to over WAN links now that it's first in line. Would DCMAService pick up the new order of transports, if I change the setting while a Periodic Polling or Collection is going on?
Would the "all-or-nothing" restore of LMS backup be changed any time soon?
09-19-2007 03:24 PM
Yes, the new collection order would be used.
I have heard of no plans to restore the LMS 2.2 partial restore framework. The problem is that all apps are now tied inextricably to DCR.
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: