LMS 2.6: restoredb.pl taking too long?

Answered Question
May 4th, 2007

It's been almost 75 mins since the last update in the restore log. Is there something going wrong with this restore (from one LMS 2.6 server to another)?

tail -f /var/adm/CSCOpx/log/migration.log

04/May/2007 15:01:29:695 208911 [main] INFO com.cisco.nm.rmeng.migration.MigrMain 135 - Unmapping [email protected] of RME from PIDM

04/May/2007 15:01:29:700 208916 [main] INFO com.cisco.nm.rmeng.migration.MigrMain 160 - UpdatePIDMEntries are done

04/May/2007 15:01:29:701 208917 [main] INFO com.cisco.nm.rmeng.migration.MigrMain 161 - UpdatePIDMEntries are done got to

04/May/2007 15:01:29:701 208917 [main] INFO com.cisco.nm.rmeng.migration.MigrMain 70 - Exiting now... from do mig

04/May/2007 15:01:29:701 208917 [main] INFO com.cisco.nm.rmeng.migration.MigrMain 71 - Exiting doMigration

04/May/2007 15:01:29:702 208918 [main] DEBUG com.cisco.nm.rmeng.migration.MigrMain ? - Calling postMigration() of class: [email protected]

04/May/2007 15:01:29:702 208918 [main] DEBUG com.cisco.nm.rmeng.migration.MigrMain ? - Calling postMigration() of class: [email protected]

04/May/2007 15:01:29:703 208919 [main] DEBUG com.cisco.nm.rmeng.migration.MigrMain 291 - SaDataMigrator::postMigration() - No post migration to be done.

04/May/2007 15:01:29:703 208919 [main] DEBUG com.cisco.nm.rmeng.migration.MigrMain ? - Calling postMigration() of class: [email protected]

04/May/2007 15:01:29:704 208920 [main] DEBUG com.cisco.nm.rmeng.migration.MigrMain ? - Calling postMigration() of class: [email protected]

I have this problem too.
0 votes
Correct Answer by Joe Clarke about 9 years 7 months ago

Ah-ha! Found it. NMSROOT is reset by the fact that restorebackup.pl uses Perl. Instead of being the non-canonical /opt/CSCOpx, it is /product/CSCO/CSCOpx. Since dmgtd.sol always runs as /opt/CSCOpx/objects/dmgt/dmgtd.sol, the dmgtd script can't find it.

If you modify /etc/init.d/dmgtd and change line 30 to:

if echo ${i} | /usr/bin/cut -d, -f2 | /usr/bin/egrep "^/opt/CSCOpx/objects/dmgt/${DMGRFILE}$" >/dev/null 2>&1 ; then

This problem will go away. I don't like using hardcoded values like, but I think it's safe in this particular case. Sorry for the mess.

  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 5 (3 ratings)
Loading.
Joe Clarke Fri, 05/04/2007 - 12:26

This is the migration.log. Check the restore log to see if there's anything happening in there. At this point, I wouldn't be too worried. The migration will take time (could be multiple hours). The major bugs with migration taking days should all be fixed in LMS 2.6, and I see no indication of problems in this log snippet.

yjdabear Fri, 05/04/2007 - 13:40

tail -f restorebackup.log

[Fri May 4 14:57:57 2007] Value of PX_DATA_DIR is: /var/adm/CSCOpx/files

[Fri May 4 14:57:57 2007] CMF Version is : 3.0.5

[Fri May 4 14:57:57 2007] No need to update the incremental ccr

[Fri May 4 14:57:57 2007] No need to update the ccr

[Fri May 4 14:57:57 2007] This is Restore

[Fri May 4 14:57:57 2007] Restoration of the RME data is in progress...

[Fri May 4 14:57:57 2007]

[Fri May 4 14:57:57 2007] See the log file /var/adm/CSCOpx/log/migration.log for progress...

Daemon Management is not running.

INFO : Stopping DBEngine processes registered to Daemon Manager

Just curious: Does LMS 2.5+ implement more/better compression for the backup archive? My LMS 2.2 backup used to eat up >7GB, while LMS 2.6 now only consumes ~1.9GB, though the latter does need >6GB temp space to perform the restore.

Joe Clarke Fri, 05/04/2007 - 14:10

Nothing is different in terms of compression. After a migration, the resulting data may actually be smaller since the RME database is not restored as-is, but rather the data is transfered into the new database.

yjdabear Sat, 05/05/2007 - 07:35

It started at 14:14 yesterday and is still not finished:

ps -ef | grep restore

casuser 4743 4382 0 11:29:53 pts/2 0:00 grep restore

root 23562 384 0 14:14:03 pts/9 0:07 /opt/CSCOpx/objects/perl/bin/perl /opt/CSCOpx/bin/restorebackup.pl -d /product/

-rw-r--r-- 1 root other 6102085 May 4 15:01 migration.log

-rw-r--r-- 1 casuser casusers 1560530 May 4 15:01 restorebackup.log

-rw-r----- 1 casuser casusers 36062 May 4 15:07 CampusOGSServer.log

-rw-rw---- 1 casuser casusers 1730241 May 4 15:07 pidm.log

-rw-rw---- 1 casuser casusers 444056 May 4 15:08 license.log

-rw-r----- 1 casuser casusers 3002 May 4 15:33 RMEDbMonitor.log

-rw-r----- 1 casuser casusers 3002 May 4 15:33 CmfDbMonitor.log

-rw-rw---- 1 casuser casusers 34329 May 4 15:33 CTMJrmServer.log

-rw-rw---- 1 casuser casusers 402895 May 4 15:34 NCTemplateMgr.log

-rw-rw---- 1 root casusers 30787 May 4 15:34 rme_base.log

-rw-rw---- 1 root casusers 8021736 May 4 15:34 rme_ctm.log

-rw-rw---- 1 casuser casusers 5438912 May 4 15:34 IC_Server.log

-rw-rw---- 1 casuser casusers 74597 May 4 15:34 daemons.log

-rw-rw---- 1 casuser casusers 161318 May 4 15:34 RMECSTMServerlog

-rw-rw---- 1 casuser casusers 1650290 May 4 15:34 dmgtd.log

-rw-rw---- 1 casuser casusers 1839918 May 5 03:04 dbbackup.log

-rw-rw---- 1 casuser casusers 11458 May 5 11:27 diskWatcher.log

-rw-rw---- 1 casuser casusers 9197825 May 5 11:28 dmgtDbg.log

Two points of curiosity: 1) The nightly backup ran as scheduled at 3:05 this morning. The archive comes out to 280MB, compared to ~1.9GB on the other server. I don't know it backed up what's originally on this server, or the other one's db which I was trying to restore here. 2) dmgtDbg.log has gotten a lot more chatty ever since 15:01 yesterday, logging every minute, right after migration.log stoped updating:

[Fri May 04 14:58:40 EDT 2007][dmgrDbg][dmgrDbg] getenv(PX_DBG_LEVEL)=NULL

[Fri May 04 14:58:40 EDT 2007][dmgrDbg]##### INFO ##### re-evaluate DbgLevel=0x0

[Fri May 04 15:01:24 EDT 2007][dmgrDbg][dmgrDbg] getenv(PX_DBG)=NULL

[Fri May 04 15:01:24 EDT 2007][dmgrDbg][dmgrDbg] getenv(PX_MY_DEBUG)=NULL

[Fri May 04 15:01:24 EDT 2007][dmgrDbg][dmgrDbg] getenv(PX_MY_TRACE)=NULL

[Fri May 04 15:01:24 EDT 2007][dmgrDbg][dmgrDbg] getenv(PX_DBG_LEVEL)=NULL

[Fri May 04 15:01:24 EDT 2007][dmgrDbg]##### INFO ##### re-evaluate DbgLevel=0x0

[Fri May 04 15:01:24 EDT 2007][dmgrDbg][dmgrDbg] getenv(PX_DBG)=NULL

[Fri May 04 15:01:24 EDT 2007][dmgrDbg][dmgrDbg] getenv(PX_MY_DEBUG)=NULL

[Fri May 04 15:01:24 EDT 2007][dmgrDbg][dmgrDbg] getenv(PX_MY_TRACE)=NULL

[Fri May 04 15:01:24 EDT 2007][dmgrDbg][dmgrDbg] getenv(PX_DBG_LEVEL)=NULL

Joe Clarke Sat, 05/05/2007 - 10:06

Your migration may be failing due to the symlinks you said you were using previously. Doing a backup of this partially migrated data will not yield any usable results, so I wouldn't trust that backup. The full migration.log will be needed to see if using symlinks is the problem.

Joe Clarke Sat, 05/05/2007 - 10:14

I'd also like to know if dmgtd is running at this point? The restore log says it is not, but it looks like the script may be stuck in a loop trying to shutdown the database servers. If dmgtd is still running, it may be trying to respawn these processes.

Please provide the full output of ps -efl.

yjdabear Tue, 05/08/2007 - 07:32

Could you take a look at TAC case 605960929 that I just opened for this issue? Thanks.

Joe Clarke Tue, 05/08/2007 - 07:55

I assume this is still running now, and this ps output is current. What is the cksum on /etc/init.d/dmgtd? I want to make sure you have the correct version of my patch. It looks like the dmgtd script is not seeing the dmgtd.sol process. Also, if you run /etc/init.d/dmgtd stop now, does it also report that dmgtd is not running?

yjdabear Tue, 05/08/2007 - 08:01

cksum dmgtd

333832757 23182 dmgtd

/etc/init.d/dmgtd stop

Daemon Management stopping. This may take a few minutes.

BTW, the only symlink in use is NMSROOT (/opt/CSCOpx/) pointing to /product/CSCO/CSCOpx/, but I think that's supported.

Joe Clarke Tue, 05/08/2007 - 08:05

That's the right patch, and that's what you should see when dmgtd goes down. With dmgtd down, can you try the restore again? If it hangs again, I'd like to see the new ps output, and we may need to get a truss.

yjdabear Tue, 05/08/2007 - 08:12

Actually the restorebackup.pl script did prompt me to perform "/etc/init.d/dmgtd stop" before it could start, so I suppose it'd probably just hang again. I'll go ahead give it another try and see what happens.

yjdabear Tue, 05/08/2007 - 15:11

Looks like this second restore is going down the exact same path as the first one. I should note that the restore target server has two LMS patches the restore source does not: the patched dmgtd for CSCsi74887 and the fix for CSCsg26840 user must be Network Admin to use CiscoView. Would these discrepancies cause this restore failure?

ps -ef | grep restore

casuser 18973 16548 0 19:03:08 pts/2 0:00 grep restore

root 28868 27578 0 14:37:38 pts/4 0:11 /opt/CSCOpx/objects/perl/bin/perl /opt/CSCOpx/bin/restorebackup.pl -d /product/

-rw-r--r-- 1 root other 2693888 May 8 14:37 restorebackup.log.old

-rw-rw---- 1 root casusers 68234 May 8 15:19 CCRImport.log

-rw-r----- 1 casuser casusers 36059 May 8 15:23 ess.log

-rw-r----- 1 casuser casusers 981749 May 8 15:23 SyslogCollector.log

-rw-rw---- 1 casuser casusers 131257 May 8 15:25 cri.log

-rw-r----- 1 casuser casusers 4874 May 8 15:26 stpeng.log

-rw-rw---- 1 casuser casusers 1740435 May 8 15:26 EssentialsDM_Server.log

-rw-rw---- 1 casuser casusers 2580742 May 8 15:26 dcmaservice.log

-rw-rw---- 1 casuser casusers 15486 May 8 15:26 ca.log

-rw-r----- 1 casuser casusers 9572 May 8 15:27 CMFOGSServer.log

-rw-rw---- 1 casuser casusers 25736 May 8 15:27 AnalyzerDebug.log

-rw-r----- 1 casuser casusers 49719 May 8 15:27 lwms.log

-rw-rw---- 1 casuser casusers 5419488 May 8 15:29 NetShowServer.log

-rw-r--r-- 1 root other 13007803 May 8 15:31 migration.log

-rw-r--r-- 1 casuser casusers 1755759 May 8 15:31 restorebackup.log

-rw-r----- 1 casuser casusers 67085 May 8 15:36 CampusOGSServer.log

-rw-rw---- 1 casuser casusers 1730657 May 8 15:36 pidm.log

-rw-rw---- 1 casuser casusers 446547 May 8 15:36 license.log

-rw-r----- 1 casuser casusers 3002 May 8 16:02 RMEDbMonitor.log

-rw-r----- 1 casuser casusers 3002 May 8 16:02 CmfDbMonitor.log

-rw-rw---- 1 casuser casusers 35518 May 8 16:02 CTMJrmServer.log

-rw-rw---- 1 casuser casusers 404179 May 8 16:02 NCTemplateMgr.log

-rw-rw---- 1 root casusers 31270 May 8 16:02 rme_base.log

-rw-rw---- 1 root casusers 8124634 May 8 16:02 rme_ctm.log

-rw-rw---- 1 casuser casusers 5440687 May 8 16:02 IC_Server.log

-rw-rw---- 1 casuser casusers 1686819 May 8 16:03 dmgtd.log

-rw-rw---- 1 casuser casusers 75698 May 8 16:03 daemons.log

-rw-rw---- 1 casuser casusers 167922 May 8 16:03 RMECSTMServerlog

-rw-rw---- 1 casuser casusers 2321 May 8 18:41 diskWatcher.log

-rw-rw---- 1 casuser casusers 31417435 May 8 18:56 dmgtDbg.log

Joe Clarke Tue, 05/08/2007 - 16:43

These patches should not be causing this problem. Assuming the migration.log looks the same as before, the migration is complete, so things should start happening in the restore log. In that case, please get the output of:

lsof -p 28868

And the /tmp/restore.truss output of the following command (let it run for a minute, then do control+C):

truss -a -f -vall -rall -wall -o /tmp/restore.truss -p 28868

yjdabear Wed, 05/09/2007 - 05:28

Please find the outputs in the two new attachments to TAC case 605960929. Thanks.

lsof-p.txt

restore.truss

Joe Clarke Wed, 05/09/2007 - 09:38

Okay, I think I've isolated where the problem is occurring. It looks like a problem with /etc/init.d/dmgtd stop. This points to a problem with my previous patch, but I have not been able to reproduce it. First, capture a current ps -efl, and attach to the SR. Then run exactly the following command:

/etc/init.d/dmgtd stop

Capture any output, and let me know if the restore continues.

yjdabear Wed, 05/09/2007 - 10:16

That seems to be the jolt it needed:

Restoration of the RME data is in progress...

See the log file /var/adm/CSCOpx/log/migration.log for progress...

[insert "/etc/init.d/dmgtd stop" here]

doRestore of [Resource Manager Essentials] has completed.

postRestore of [Common Services] has started.

postRestore of [Common Services] has completed.

postRestore of [Campus Manager] has started.

postRestore of [Campus Manager] has completed.

postRestore of [Resource Manager Essentials] has started.

postRestore of [Resource Manager Essentials] has completed.

I attached ps-efl2.txt, the "new" restorebackup.log and migration.log.zip to the Case.

Joe Clarke Wed, 05/09/2007 - 10:21

And just to be absolutely clear, you ran "/etc/init.d/dmgtd stop" to shutdown dmgtd?

Joe Clarke Wed, 05/09/2007 - 12:14

Okay, so this is what's happening, and this is why I'm confused. The RME restore framework tries to clean up after itself. Part of this involves unregistering the old RME 3.x database. In order to do this, it needs to wait for dmgtd to stop. So it issues the command:

/etc/init.d/dmgtd stop

And then waits forever for it to stop. Based on your previous logs, it looks like this execution of dmgtd reports that dmgtd is not running when it really is. BUT, when YOU run the SAME command manually, it stops.

So, what is different between the session where you run "/etc/init.d/dmgtd stop" and the session that is running the restorebackup.pl script?

Joe Clarke Wed, 05/09/2007 - 12:48

If you can't find anything different between the two sessions, you can change line 605 of /opt/CSCOpx/bin/RestoreTools/rmeng/rmengRestoreLib.pm to:

my $sol_script = "sh -x /etc/init.d/dmgtd stop";

Then reproduce the problem. The output should be much more verbose.

yjdabear Wed, 05/09/2007 - 15:54

My turn to get confused: Why is RME 3.x still involved in this RME 4.x to 4.x restore? As far as I can think of, there's no difference. Both are (were) SSH sessions via Putty. I was under the non-root UID 0 account when I issued the "restoredb.pl" or "dmgtd stop" command.

If "restoredb.pl" calls "dmgtd stop" after all, why does it terminate and prompt for manually performing "dmgtd stop"?

Joe Clarke Wed, 05/09/2007 - 17:13

Sorry. The restore framework will stop dmgtd after the RME restoration regardless. Stopping it before unconfiguring the RME 3.x database is just another time it will call this function.

The restore doesn't terminate, it loops until dmgtd stop is called again. This is what is confusing me. The fact that "dmgtd stop" reports that dmgtd is not running the first time doesn't make any sense. I've been over the code numerous times, and I just can't see why that would happen. I am hoping the sh -x output sheds more light on this.

yjdabear Thu, 05/10/2007 - 11:35

The third restore has stalled. Restorebackup.log seems to be the verbose one. Attached below is the tailend of

restorebackup.log. I attached the full restorebackup.log, migration.log, and a current ps -efl out to Case 605960929.

-rw-r--r-- 1 root other 20319015 May 10 15:23 migration.log

-rw-r--r-- 1 casuser casusers 1855820 May 10 15:23 restorebackup.log

-rw-rw---- 1 casuser casusers 1731073 May 10 15:28 pidm.log

-rw-r----- 1 casuser casusers 101469 May 10 15:28 CampusOGSServer.log

-rw-rw---- 1 casuser casusers 449038 May 10 15:28 license.log

-rw-rw---- 1 casuser casusers 798 May 10 15:48 diskWatcher.log

-rw-r----- 1 casuser casusers 3002 May 10 15:54 RMEDbMonitor.log

-rw-r----- 1 casuser casusers 3002 May 10 15:54 CmfDbMonitor.log

-rw-rw---- 1 casuser casusers 36707 May 10 15:54 CTMJrmServer.log

-rw-rw---- 1 casuser casusers 405463 May 10 15:54 NCTemplateMgr.log

-rw-rw---- 1 root casusers 31753 May 10 15:54 rme_base.log

-rw-rw---- 1 root casusers 8222720 May 10 15:54 rme_ctm.log

-rw-rw---- 1 casuser casusers 5442462 May 10 15:54 IC_Server.log

-rw-rw---- 1 casuser casusers 1720983 May 10 15:54 dmgtd.log

-rw-rw---- 1 casuser casusers 75387 May 10 15:54 daemons.log

-rw-rw---- 1 casuser casusers 174526 May 10 15:54 RMECSTMServerlog

-rw-rw---- 1 casuser casusers 39813665 May 10 15:57 dmgtDbg.log

tail -f daemons.log

[ Thu May 10 15:54:56 EDT 2007 ]EssMonitor: Bcst Msg next line: jrm;jrm;1;0

TomcatMonitor: non-handled Msg from dmgtd:1111708500

TomcatMonitor: non-handled Msg from dmgtd:1111708500

TomcatMonitor: non-handled Msg from dmgtd:1111708500

[ Thu May 10 15:54:56 EDT 2007 ]EssMonitor: Bcst Msg next line: jrm;EDS-GCF;0;0

[ Thu May 10 15:54:56 EDT 2007 ]EssMonitor: Bcst Msg next line: jrm;CampusOGSServer;0;0

TomcatMonitor: non-handled Msg from dmgtd:1111708500

TomcatMonitor: non-handled Msg from dmgtd:1111708500

[ Thu May 10 15:54:56 EDT 2007 ]EssMonitor: Bcst Msg next line: jrm;CMFOGSServer;0;0

[ Thu May 10 15:54:56 EDT 2007 ]EssMonitor: Bcst Msg next line: jrm;DCRServer;0;0

Correct Answer
Joe Clarke Thu, 05/10/2007 - 12:20

Ah-ha! Found it. NMSROOT is reset by the fact that restorebackup.pl uses Perl. Instead of being the non-canonical /opt/CSCOpx, it is /product/CSCO/CSCOpx. Since dmgtd.sol always runs as /opt/CSCOpx/objects/dmgt/dmgtd.sol, the dmgtd script can't find it.

If you modify /etc/init.d/dmgtd and change line 30 to:

if echo ${i} | /usr/bin/cut -d, -f2 | /usr/bin/egrep "^/opt/CSCOpx/objects/dmgt/${DMGRFILE}$" >/dev/null 2>&1 ; then

This problem will go away. I don't like using hardcoded values like, but I think it's safe in this particular case. Sorry for the mess.

yjdabear Fri, 05/11/2007 - 12:12

Yeah, that did it. The fourth restore finished in under 50 mins. Is there no better way to make it cope with the $NMSROOT variable?

Joe Clarke Fri, 05/11/2007 - 13:14

The other solution would be to add:

unset PKG

Before obtaining NMSROOT in the dmgtd script. However, I think we're okay with the hardcoded /opt/CSCOpx here since dmgtd.sol will always be run as /opt/CSCOpx/objects/dmgt/dmgtd.sol.

Actions

This Discussion