cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
2048
Views
5
Helpful
13
Replies

RME 4.0.x RMEDbEngine crash

yjdabear
VIP Alumni
VIP Alumni

Looks like RMEDbEngine crashed at 5:27am yesterday morning. I see in dmgtd.log the following bits of interest:

Sep 10 05:24:16 nms SQLAnywhere(anieng): [ID 702911 local1.notice] Database

server started at Mon Sep 10 2007 05:24

Sep 10 05:24:16 nms SQLAnywhere(anieng): [ID 702911 local1.notice] Trying t

o start SharedMemory link ...

Sep 10 05:24:17 nms SQLAnywhere(anieng): [ID 702911 local1.notice] Shar

edMemory link started successfully

Sep 10 05:26:12 nms SQLAnywhere(cmfeng): [ID 702911 local1.notice] Note: Du

plicate index "mykey" for table "CampusUserGroupAssociationTable" in database "c

mfDb"

Sep 10 05:26:12 nms SQLAnywhere(cmfeng): [ID 702911 local1.notice] Note: Du

plicate index "mykey" for table "CampusOgsTagTable" in database "cmfDb"

Sep 10 05:26:13 nms dmgt[20552]: [ID 804273 local1.info] #3007:TYPE=INFO:St

arted application(RmeGatekeeper) "/opt/CSCOpx/lib/vbroker/bin/rungk.sh 42342" pi

d=20883.

Sep 10 05:26:15 nms SQLAnywhere(cmfeng): [ID 702911 local1.notice] Connecti

on terminated abnormally

Sep 10 05:26:17 nms dmgt[20552]: [ID 587862 local1.info] #3007:TYPE=INFO:St

arted application(EDS) "/opt/CSCOpx/bin/cwjava -cp:a /opt/CSCOpx:/product/CSCO/C

SCOpx/lib/classpath:/product/CSCO/CSCOpx/www/classpath:/opt/CSCOpx/campus/lib/cl

asspath:/opt/CSCOpx/campus/www/classpath com.cisco.nm.cmf.eds.system.EDSMain" pi

d=20884.

Sep 10 05:26:30 nms SQLAnywhere(cmfeng): [ID 702911 local1.notice] Note: Du

plicate index "Index1" for table "OgsDcrDeviceTable" in database "cmfDb"

Sep 10 05:26:30 nms SQLAnywhere(cmfeng): [ID 702911 local1.notice] Note: Du

plicate index "mykey" for table "OgsDcrDeviceTable" in database "cmfDb"

Sep 10 05:26:30 nms last message repeated 1 time

Sep 10 05:26:33 nms dmgt[20552]: [ID 228802 local1.info] #3007:TYPE=INFO:St

arted application(diskWatcher) "/opt/CSCOpx/objects/db/bin/diskWatcher diskWatch

er" pid=20962.

Sep 10 05:26:33 nms dmgt[20552]: [ID 438444 local1.info] #3007:TYPE=INFO:St

arted application(RMEDbEngine) "/opt/CSCOpx/objects/db/bin/dbsrv9 -x tcpip{HOST=

localhost;DOBROADCAST=NO;ServerPort=43455} -q -m -ti 0 -gm 100 -gn 50 -ch 50P -s

local1 -c 16M -n rmengEng /opt/CSCOpx/databases/rmeng/rmeng.db -n rmengdb" pid=

20963.

Sep 10 05:26:33 nms SQLAnywhere(rmengeng): [ID 702911 local1.info] Running

on SunOS 5.8 Generic_117350-46

Sep 10 05:26:34 nms SQLAnywhere(rmengeng): [ID 702911 local1.error] Unknown

error (-29968)

Sep 10 05:26:34 nms SQLAnywhere(cmfeng): [ID 702911 local1.notice] Note: No

unique index or primary key for table "CSUserGroupAssociationTable" in database

"cmfDb"

Sep 10 05:26:36 nms SQLAnywhere(cmfeng): [ID 702911 local1.notice] Connecti

on terminated abnormally

One user trying to configure a NetShow job reported:

NS0033 : Error occured while processing.

This may be because of URN_NOT_FOUND : urn "JRM" : Not found !!

Users trying to use RME reported 500 Internal Server error.

The "unknown error" appears repeatedly after restarting dmgtd, but then stops. Currently:

tail -f dmgtd.log

Sep 11 00:22:09 pssva030 SQLAnywhere(rmengeng): [ID 702911 local1.notice] Cache size adjusted to 477632K

Sep 11 00:22:10 pssva030 SQLAnywhere(cmfeng): [ID 702911 local1.notice] Connection terminated abnormally

Sep 11 00:23:09 pssva030 SQLAnywhere(rmengeng): [ID 702911 local1.notice] Cache size adjusted to 452456K

Sep 11 00:24:10 pssva030 SQLAnywhere(rmengeng): [ID 702911 local1.notice] Cache size adjusted to 412304K

Sep 11 00:25:10 pssva030 SQLAnywhere(rmengeng): [ID 702911 local1.notice] Cache size adjusted to 399280K

tail -f daemons.log

Cache size adjusted to 412304K

Cache size adjusted to 399280K

Cache size adjusted to 393072K

Cache size adjusted to 388088K

The attached pdshow output reflect the current process status. Looks like DCR and RME are both out of service.

1 Accepted Solution

Accepted Solutions

Joe Clarke
Cisco Employee
Cisco Employee

It looks like more than just the RME database crashed. Looks like everything restarted. The RME database is in bad shape, though. It may be completely corrupt, or it may just be a bad transaction log. If the latter, you can attempt a rollback by deleting /opt/CSCOpx/databases/rmeng/rmeng.log (if it exists), then run:

# csh

# setenv LD_LIBRARY_PATH /opt/CSCOpx/lib:/opt/CSCOpx/objects/db/lib

# setenv ASANY /opt/CSCOpx/objects/db

# /opt/CSCOpx/objects/db/bin/dbsrv9 -f /opt/CSCOpx/databases/rmeng/rmeng.db

Then restart dmgtd.

If the problem is a corrupt database, you will either have to restore LMS from a known good backup, or reinitialize the RME database.

View solution in original post

13 Replies 13

Joe Clarke
Cisco Employee
Cisco Employee

It looks like more than just the RME database crashed. Looks like everything restarted. The RME database is in bad shape, though. It may be completely corrupt, or it may just be a bad transaction log. If the latter, you can attempt a rollback by deleting /opt/CSCOpx/databases/rmeng/rmeng.log (if it exists), then run:

# csh

# setenv LD_LIBRARY_PATH /opt/CSCOpx/lib:/opt/CSCOpx/objects/db/lib

# setenv ASANY /opt/CSCOpx/objects/db

# /opt/CSCOpx/objects/db/bin/dbsrv9 -f /opt/CSCOpx/databases/rmeng/rmeng.db

Then restart dmgtd.

If the problem is a corrupt database, you will either have to restore LMS from a known good backup, or reinitialize the RME database.

Thanks for your prompt reply. I went ahead with a db restore. I do have a few questions though:

1. In situations like this, can I restore just the RME and CS portions while bypassing the ANI/CM? Would it be any more beneficial to reinit the RME or even the CS db, before proceeding with the restore? Does the CS db need reinit/restore, since you didn't mention it as a reinit candidate?

2. What caused the corruption, and in which db (RME or CS)? Is there anything that can be done to prevent that?

3. I left out the -force parameter I had used with LMS 2.2 while issuing the restorebackup.pl. I don't see it mentioned in "-help" anymore. Is it no longer available in LMS 2.6? If it's there, does it force-overwrite the server certificate as it used to?

Update: the restoredb failed. It was restored without reinit RME or CS.

[Tue Sep 11 09:05:25 2007] Executing /opt/CSCOpx/bin/perl /opt/CSCOpx/dbupdate/

dbupdate.pl /opt/CSCOpx/rigel/scripts/rme/dbupgrade/createDb.txt

Starting database engine rmengEng

Running on SunOS 5.8 Generic_117350-46

Note: The size of '/opt/CSCOpx/databases/rmeng/rmeng.db' is larger than expected

Fatal error: database error

Abort - core dumped

Opening /opt/CSCOpx/dbupdate/CSCOdbupdate.log for logging

[Tue Sep 11 09:13:10 2007] Failed to update the db tables

[Tue Sep 11 09:13:10 2007] -----------------------------------------------------

-------------------------

[Tue Sep 11 09:13:10 2007] Execution of doRestore rmeng :: restore.pm, completed

.

[Tue Sep 11 09:13:10 2007] Unloading Campus Manager restore.pm's functions

[Tue Sep 11 09:13:10 2007]

ERROR: doRestore of [Resource Manager Essentials] had some errors.

[Tue Sep 11 09:13:10 2007] Check the log file for details

I see in /opt/CSCOpx/dbupdate/CSCOdbupdate.log:

Translating environment variable "NMSROOT" to "/opt/CSCOpx": At /opt/CSCOpx/dbup

date/dbupdate.pl, #103: Time - 2007/07/17 23:02:38

Launching "/opt/CSCOpx/bin/perl /opt/CSCOpx/rigel/scripts/rme/dbupgrade/main.pl

rmeng dsn=rmeng,$NMSROOT/rigel/scripts/rme/dbupgrade/schemachange,schema": At /o

pt/CSCOpx/dbupdate/dbupdate.pl, #105: Time - 2007/07/17 23:02:38

Beginning DB Update

Start /opt/CSCOpx/dbupdate/dbupdate.pl : At /opt/CSCOpx/dbupdate/dbupdate.pl, #4

8: Time - 2007/09/11 09:05:25

Starting DB ...: Time - 2007/09/11 09:05:25

DSN PRoperties: rmengEng: At /opt/CSCOpx/dbupdate/dbupdateutils.pl, #119: Time -

2007/09/11 09:05:25

Executing - StartDb(DSN=rmeng): At /opt/CSCOpx/dbupdate/dbupdateutils.pl, #132:

Time - 2007/09/11 09:05:25

ERROR: Unable to start DB for dsn=rmeng - Could not connect to engine 'rmengEng'

on 'rmengdb' in Bulk mode. Check user id and passwords: At /opt/CSCOpx/dbupdate

/dbupdateutils.pl, #135: Time - 2007/09/11 09:13:10

1. You must restore everything or nothing. Reinitializing databases prior to a restore is not required since the data will be overwritten anyway.

2. No idea. From the look of pdshow, everything restarted, so maybe the server was improperly rebooted. Regular good backups are the only way to truly prevent catastrophic failure due to corruption.

3. Looks like your backup may be bad.

At 5:05am I have a cronjob calling:

"/etc/init.d/dmgtd stop

sleep for 10 mins

/opt/CSCOpx/bin/perl /opt/CSCOpx/cgi-bin/admin/logBackup.pl -force -dir /path/to/backup

sleep for 3 mins

/etc/init.d/dmgtd start"

The LMS restart seems to correspond to that. The sleep 10 mins thing was because I've had problem with SyslogCollector not collecting if "dmgtd start" does not get ample time between itself and "dmgtd stop". But if the backup is bad, the dmgtd restart seems to just precipitate the problem that's going to surface anyway.

You should consider moving to logrot.pl which does not require a dmgtd restart for the majority of its operations.

Yes, if the database was already bad, the dmgtd restart is not going to make it any worse. However, it won't corrupt a database unless the stop was aborted, and the database engines were forcibly terminated.

I first restored a copy of db backup that seems to be the skeleton of an empty LMS 2.6 install. Then I restore a copy from another LMS 2.6 server. Both seems to generate the following dmgtd.log. Should I be concerned about these?

Sep 11 19:04:12 nms SQLAnywhere(cmfeng): [ID 702911 local1.notice] Connection terminated abnormally

Sep 11 19:04:15 nms SQLAnywhere(cmfeng): [ID 702911 local1.notice] Note: No unique index or primary key for table "CSUserGroupAssociationTable" in database "cmfDb"

These messages are safe to ignore.

Also, I imported an up-to-date list of devices into DCR. The upshot was about 70% duplicates and 30% conflicting (DCA not changed). It doesn't seem to honor the default option of using the import source for conflict resolution. How do I resolve these conflicting dupes? RME doesn't seem to recognize any new ones imported.

Duplicates cannot be added to DCR. DCR will either update the device attributes, or ignore the duplicates. RME won't notice any changes unless genuinely new devices are added.

Would these messages be safe to ignore as well?

ING:An application that is not registered sent message ChangeAuditService to the daemon manager. The message was ignored.

Sep 11 19:04:58 nms dmgt[25782]: [ID 743478 local1.warning] #2053:TYPE=WARNING:An application that is not registered sent a status message SyslogAnalyzerService to the daemon manager. The message was ignored.

Sep 11 19:04:58 nms dmgt[25782]: [ID 797257 local1.warning] #2055:TYPE=WARNING:An application that is not registered sent message SyslogAnalyzerService to the daemon manager. The message was ignored.

Sep 11 19:05:33 nms SQLAnywhere(anieng): [ID 702911 local1.notice] Cache size adjusted to 45896K

Sep 11 19:07:33 nms SQLAnywhere(anieng): [ID 702911 local1.notice] Cache size adjusted to 55280K

Sep 11 19:09:48 nms SQLAnywhere(cmfeng): [ID 702911 local1.notice] Note: No unique index or primary key for table "DCR_transaction_log" in database "cmfDb"

Sep 11 19:09:49 nms SQLAnywhere(rmengeng): [ID 702911 local1.notice] Note: No unique index or primary key for table "RMEUserGroupAssociationTable" in database "rmengdb"

Sep 11 19:15:48 nms SQLAnywhere(cmfeng): [ID 702911 local1.notice] Connection terminated abnormally

Sep 11 19:15:52 nms SQLAnywhere(cmfeng): [ID 702911 local1.notice] Note: Duplicate index "mykey" for table "CampusOgsGroupCacheTable" in database "cmfDb"

Sep 11 19:19:54 nms SQLAnywhere(rmengeng): [ID 702911 local1.notice] Connection terminated abnormally

Sep 11 19:19:54 nms SQLAnywhere(rmengeng): [ID 702911 local1.notice] Connection terminated abnormally

Sep 11 19:19:54 nms last message repeated 2 times

Sep 11 19:25:44 nms SQLAnywhere(rmengeng): [ID 702911 local1.notice] Note: No unique index or primary key for table "DM_Dev_State" in database "rmengdb"

Sep 11 19:25:46 nms SQLAnywhere(rmengeng): [ID 702911 local1.notice] Note: No unique index or primary key for table "NetConfig_Device_Template" in database "rmengdb"

Sep 11 19:25:49 nms SQLAnywhere(rmengeng): [ID 702911 local1.notice] Note: No unique index or primary key for table "SoftwareElementRequirements" in database "rmengdb"

Yes these can be safely ignored as well.

It looks like I'm not totally out of the woods yet: 100% of my RME devices failed during Config Archive (Periodic Collection) this morning, yet strangely most of them didn't fail during the Inventory (Periodic Polling) job at 4pm yesterday. Most Config Archive failures look like this:

PRIMARY STARTUP Sep 12 2007 06:17:58 CM0020: Error creating archive

VLAN RUNNING Sep 12 2007 06:18:43 Insufficient no. of interactive responses(or timeout) for command: copy const_nvram:vlan.dat tftp:.VLAN Config fetch is not supported using TFTP.VLAN Config fetch is not supported using SCP.TELNET: Failed to establish TELNET connection to 10.xx.xx.xx - Cause: Connection refused.

PRIMARY RUNNING Sep 12 2007 06:18:23 CM0020: Error creating archive

Transport setting for Archive Mgmt is: SSH, TFTP, SCP, TELNET.

I'm not sure the Inventory job was really mostly "success" either, since I'm getting this error when clicking on a Diff in the Out-of-Sync Summary, of a device that has Startup and Running configs with yesterday afternoon's timestamp:

CM0021: Version does not exist in archive device-name Cause: Version may have been deleted.

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: