Solved: Re: LMS 2.6: restoredb.pl taking too long? - Page 2

yjdabear · ‎05-04-2007

It's been almost 75 mins since the last update in the restore log. Is there something going wrong with this restore (from one LMS 2.6 server to another)?

tail -f /var/adm/CSCOpx/log/migration.log

04/May/2007 15:01:29:695 208911 [main] INFO com.cisco.nm.rmeng.migration.MigrMain 135 - Unmapping 717@nms of RME from PIDM

04/May/2007 15:01:29:700 208916 [main] INFO com.cisco.nm.rmeng.migration.MigrMain 160 - UpdatePIDMEntries are done

04/May/2007 15:01:29:701 208917 [main] INFO com.cisco.nm.rmeng.migration.MigrMain 161 - UpdatePIDMEntries are done got to

04/May/2007 15:01:29:701 208917 [main] INFO com.cisco.nm.rmeng.migration.MigrMain 70 - Exiting now... from do mig

04/May/2007 15:01:29:701 208917 [main] INFO com.cisco.nm.rmeng.migration.MigrMain 71 - Exiting doMigration

04/May/2007 15:01:29:702 208918 [main] DEBUG com.cisco.nm.rmeng.migration.MigrMain ? - Calling postMigration() of class: com.cisco.nm.rmeng.migration.DcmaDataMigrator@1ff7a1e

04/May/2007 15:01:29:702 208918 [main] DEBUG com.cisco.nm.rmeng.migration.MigrMain ? - Calling postMigration() of class: com.cisco.nm.rmeng.sa.migration.SaDataMigrator@d19bc8

04/May/2007 15:01:29:703 208919 [main] DEBUG com.cisco.nm.rmeng.migration.MigrMain 291 - SaDataMigrator::postMigration() - No post migration to be done.

04/May/2007 15:01:29:703 208919 [main] DEBUG com.cisco.nm.rmeng.migration.MigrMain ? - Calling postMigration() of class: com.cisco.nm.rmeng.migration.SWIMDataMigrator@1630ab9

04/May/2007 15:01:29:704 208920 [main] DEBUG com.cisco.nm.rmeng.migration.MigrMain ? - Calling postMigration() of class: com.cisco.nm.rmeng.migration.RMEDataHandler@1551f60

Joe Clarke · ‎05-09-2007

Okay, I think I've isolated where the problem is occurring. It looks like a problem with /etc/init.d/dmgtd stop. This points to a problem with my previous patch, but I have not been able to reproduce it. First, capture a current ps -efl, and attach to the SR. Then run exactly the following command:

/etc/init.d/dmgtd stop

Capture any output, and let me know if the restore continues.

yjdabear · ‎05-09-2007

That seems to be the jolt it needed:

Restoration of the RME data is in progress...

See the log file /var/adm/CSCOpx/log/migration.log for progress...

[insert "/etc/init.d/dmgtd stop" here]

doRestore of [Resource Manager Essentials] has completed.

postRestore of [Common Services] has started.

postRestore of [Common Services] has completed.

postRestore of [Campus Manager] has started.

postRestore of [Campus Manager] has completed.

postRestore of [Resource Manager Essentials] has started.

postRestore of [Resource Manager Essentials] has completed.

I attached ps-efl2.txt, the "new" restorebackup.log and migration.log.zip to the Case.

Joe Clarke · ‎05-09-2007

And just to be absolutely clear, you ran "/etc/init.d/dmgtd stop" to shutdown dmgtd?

yjdabear · ‎05-09-2007

Yes.

Joe Clarke · ‎05-09-2007

Okay, so this is what's happening, and this is why I'm confused. The RME restore framework tries to clean up after itself. Part of this involves unregistering the old RME 3.x database. In order to do this, it needs to wait for dmgtd to stop. So it issues the command:

/etc/init.d/dmgtd stop

And then waits forever for it to stop. Based on your previous logs, it looks like this execution of dmgtd reports that dmgtd is not running when it really is. BUT, when YOU run the SAME command manually, it stops.

So, what is different between the session where you run "/etc/init.d/dmgtd stop" and the session that is running the restorebackup.pl script?

Joe Clarke · ‎05-09-2007

If you can't find anything different between the two sessions, you can change line 605 of /opt/CSCOpx/bin/RestoreTools/rmeng/rmengRestoreLib.pm to:

my $sol_script = "sh -x /etc/init.d/dmgtd stop";

Then reproduce the problem. The output should be much more verbose.

yjdabear · ‎05-09-2007

My turn to get confused: Why is RME 3.x still involved in this RME 4.x to 4.x restore? As far as I can think of, there's no difference. Both are (were) SSH sessions via Putty. I was under the non-root UID 0 account when I issued the "restoredb.pl" or "dmgtd stop" command.

If "restoredb.pl" calls "dmgtd stop" after all, why does it terminate and prompt for manually performing "dmgtd stop"?

Joe Clarke · ‎05-09-2007

Sorry. The restore framework will stop dmgtd after the RME restoration regardless. Stopping it before unconfiguring the RME 3.x database is just another time it will call this function.

The restore doesn't terminate, it loops until dmgtd stop is called again. This is what is confusing me. The fact that "dmgtd stop" reports that dmgtd is not running the first time doesn't make any sense. I've been over the code numerous times, and I just can't see why that would happen. I am hoping the sh -x output sheds more light on this.

yjdabear · ‎05-10-2007

The third restore has stalled. Restorebackup.log seems to be the verbose one. Attached below is the tailend of

restorebackup.log. I attached the full restorebackup.log, migration.log, and a current ps -efl out to Case 605960929.

-rw-r--r-- 1 root other 20319015 May 10 15:23 migration.log

-rw-r--r-- 1 casuser casusers 1855820 May 10 15:23 restorebackup.log

-rw-rw---- 1 casuser casusers 1731073 May 10 15:28 pidm.log

-rw-r----- 1 casuser casusers 101469 May 10 15:28 CampusOGSServer.log

-rw-rw---- 1 casuser casusers 449038 May 10 15:28 license.log

-rw-rw---- 1 casuser casusers 798 May 10 15:48 diskWatcher.log

-rw-r----- 1 casuser casusers 3002 May 10 15:54 RMEDbMonitor.log

-rw-r----- 1 casuser casusers 3002 May 10 15:54 CmfDbMonitor.log

-rw-rw---- 1 casuser casusers 36707 May 10 15:54 CTMJrmServer.log

-rw-rw---- 1 casuser casusers 405463 May 10 15:54 NCTemplateMgr.log

-rw-rw---- 1 root casusers 31753 May 10 15:54 rme_base.log

-rw-rw---- 1 root casusers 8222720 May 10 15:54 rme_ctm.log

-rw-rw---- 1 casuser casusers 5442462 May 10 15:54 IC_Server.log

-rw-rw---- 1 casuser casusers 1720983 May 10 15:54 dmgtd.log

-rw-rw---- 1 casuser casusers 75387 May 10 15:54 daemons.log

-rw-rw---- 1 casuser casusers 174526 May 10 15:54 RMECSTMServerlog

-rw-rw---- 1 casuser casusers 39813665 May 10 15:57 dmgtDbg.log

tail -f daemons.log

[ Thu May 10 15:54:56 EDT 2007 ]EssMonitor: Bcst Msg next line: jrm;jrm;1;0

TomcatMonitor: non-handled Msg from dmgtd:1111708500

[ Thu May 10 15:54:56 EDT 2007 ]EssMonitor: Bcst Msg next line: jrm;EDS-GCF;0;0

[ Thu May 10 15:54:56 EDT 2007 ]EssMonitor: Bcst Msg next line: jrm;CampusOGSServer;0;0

TomcatMonitor: non-handled Msg from dmgtd:1111708500

[ Thu May 10 15:54:56 EDT 2007 ]EssMonitor: Bcst Msg next line: jrm;CMFOGSServer;0;0

[ Thu May 10 15:54:56 EDT 2007 ]EssMonitor: Bcst Msg next line: jrm;DCRServer;0;0

Joe Clarke · ‎05-10-2007

Ah-ha! Found it. NMSROOT is reset by the fact that restorebackup.pl uses Perl. Instead of being the non-canonical /opt/CSCOpx, it is /product/CSCO/CSCOpx. Since dmgtd.sol always runs as /opt/CSCOpx/objects/dmgt/dmgtd.sol, the dmgtd script can't find it.

If you modify /etc/init.d/dmgtd and change line 30 to:

if echo ${i} | /usr/bin/cut -d, -f2 | /usr/bin/egrep "^/opt/CSCOpx/objects/dmgt/${DMGRFILE}$" >/dev/null 2>&1 ; then

This problem will go away. I don't like using hardcoded values like, but I think it's safe in this particular case. Sorry for the mess.

yjdabear · ‎05-10-2007

Thanks. I'll give that a try.

yjdabear · ‎05-11-2007

Yeah, that did it. The fourth restore finished in under 50 mins. Is there no better way to make it cope with the $NMSROOT variable?

Joe Clarke · ‎05-11-2007

The other solution would be to add:

unset PKG

Before obtaining NMSROOT in the dmgtd script. However, I think we're okay with the hardcoded /opt/CSCOpx here since dmgtd.sol will always be run as /opt/CSCOpx/objects/dmgt/dmgtd.sol.