05-02-2007 07:33 PM
This similar issue has happened twice now. The basic problem is LMS 2.6's "dmgtd stop" fails to stop all procs before shutting itself down. So it leaves a trail of procs in CLOSE_WAIT that never time out, even after I tried adjusting on-the-fly with "/usr/sbin/ndd -set /dev/tcp tcp_time_wait_interval 60000". "/opt/CSCOpx/bin/dbstop.pl all" does no good.
So eventually I tried to turn to /etc/rc2.d/K90dmgtd. Last time I tried it (unsure whether I used the stop or kill option), it killed both of my SSH sessions, sshd/telnetd/inetd, the EMC storage manager daemon, but left the box pingable. Tonight, thinking it was a coincidence last time, I tried it again, given the CLOSE_WAITs above. This time ./K90dmgtd stop
returned "Daemon Management is not running." Then as soon as I issued "./K90dmgtd kill", I got "INFO: Terminating all the processes launched by dmgr", followed by immediate termination of that particular SSH session. My 2nd SSH session remains alive. No new telnet/ssh connection is accepted however. I imagine I'd have to reboot the server again.
Solved! Go to Solution.
05-04-2007 08:25 AM
Since I've isolated (and reproduced) the most likely cause of the non-CiscoWorks process termination, you do not need to modify the dmgtd script further.
IPM evolved from CiscoWorks Blue, and many of the things that used to happen there were carried over. This is all going away in LMS 3.0 when IPM is put under dmgtd control.
05-16-2007 12:05 PM
Looks like you're missing two files from the end of /etc/rc.config.d/CiscoRMCtrl:
PX_SYS_CHECK=1
export PX_SYS_CHECK
05-02-2007 07:34 PM
netstat -n | grep 43441
127.0.0.1.49254 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT
127.0.0.1.43441 127.0.0.1.49254 32768 0 32768 0 FIN_WAIT_2
127.0.0.1.49294 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT
127.0.0.1.43441 127.0.0.1.49294 32768 0 32768 0 FIN_WAIT_2
127.0.0.1.49327 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT
127.0.0.1.43441 127.0.0.1.49327 32768 0 32768 0 FIN_WAIT_2
127.0.0.1.49375 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT
127.0.0.1.43441 127.0.0.1.49375 32768 0 32768 0 FIN_WAIT_2
127.0.0.1.49479 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT
127.0.0.1.43441 127.0.0.1.49479 32768 0 32768 0 FIN_WAIT_2
127.0.0.1.49499 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT
127.0.0.1.43441 127.0.0.1.49499 32768 0 32768 0 FIN_WAIT_2
127.0.0.1.49523 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT
127.0.0.1.43441 127.0.0.1.49523 32768 0 32768 0 FIN_WAIT_2
127.0.0.1.49592 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT
127.0.0.1.43441 127.0.0.1.49592 32768 0 32768 0 FIN_WAIT_2
after 15 mins
127.0.0.1.49254 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT
127.0.0.1.49294 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT
127.0.0.1.49327 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT
127.0.0.1.49375 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT
127.0.0.1.49479 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT
127.0.0.1.49499 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT
127.0.0.1.49523 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT
127.0.0.1.49592 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT
lsof -i :43441
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
cwjava 16371 casuser 5u IPv4 0x3003bc2d0b0 0t9218 TCP localhost:49254->localhost:cscocmfdb (CLOSE_WAIT)
cwjava 16445 casuser 10u IPv4 0x3001b576960 0t4996 TCP localhost:49294->localhost:cscocmfdb (CLOSE_WAIT)
cwjava 16445 casuser 12u IPv4 0x3003c2924d8 0t5192 TCP localhost:49479->localhost:cscocmfdb (CLOSE_WAIT)
cwjava 16445 casuser 13u IPv4 0x3001e8d2648 0t15837 TCP localhost:49523->localhost:cscocmfdb (CLOSE_WAIT)
cwjava 16445 casuser 19u IPv4 0x300242f0328 0t8546 TCP localhost:49592->localhost:cscocmfdb (CLOSE_WAIT)
cwjava 16447 casuser 8u IPv4 0x3001b568c70 0t88176 TCP localhost:49327->localhost:cscocmfdb (CLOSE_WAIT)
cwjava 16468 casuser 12u IPv4 0x30024349ce0 0t39726 TCP localhost:49499->localhost:cscocmfdb (CLOSE_WAIT)
cwjava 16536 casuser 8u IPv4 0x3003bc2c330 0t4974 TCP localhost:49375->localhost:cscocmfdb (CLOSE_WAIT)
05-02-2007 07:34 PM
60 mins later
ps -ef | grep -i cscopx
casuser 16447 1 0 22:26:45 ? 0:21 /opt/CSCOpx/bin/cwjava -cw /opt/CSCOpx -cw:jre lib/jre -server -Xms64m -Xmx192m
casuser 16469 1 0 22:26:49 ? 0:19 /opt/CSCOpx/bin/cwjava -cw /opt/CSCOpx -cw:jre lib/jre -server -cp:p MDC/tomcat
casuser 16510 1 0 22:26:57 ? 0:04 /opt/CSCOpx/bin/cwjava -cp:a /opt/CSCOpx:/product/CSCO/CSCOpx/lib/classpath:/pr
casuser 16371 1 0 22:26:34 ? 0:09 /opt/CSCOpx/bin/cwjava -cw:jre /opt/CSCOpx/MDC/jre -Xmx256m -Xss1m -cp:p /opt/C
casuser 16471 1 0 22:26:49 ? 0:00 /opt/CSCOpx/lib/vbroker/bin/osagent -p 42342
casuser 16575 1 0 22:27:29 ? 0:04 /opt/CSCOpx/bin/cwjava -cw /opt/CSCOpx -cw:jre lib/jre -server -Xms256m -Xmx256
root 21202 6485 0 23:29:43 pts/8 0:00 grep -i cscopx
casuser 16576 1 0 22:27:29 ? 0:00 /opt/CSCOpx/objects/db/bin/diskWatcher diskWatcher
casuser 16446 1 0 22:26:45 ? 0:16 /opt/CSCOpx/bin/cwjava -cw:jre /product/CSCO/CSCOpx/MDC/jre -Xmx512m -cp /produ
casuser 16536 1 0 22:27:07 ? 0:04 /opt/CSCOpx/bin/cwjava -cp:a /opt/CSCOpx com.cisco.nm.cmf.eds.gcf.Main
casuser 16550 1 0 22:27:16 ? 0:04 /opt/CSCOpx/bin/cwjava -cw /opt/CSCOpx -cw:jre lib/jre -server -cp:p MDC/tomcat
casuser 16500 1 0 22:26:53 ? 0:02 /opt/CSCOpx/bin/cwjava -cp:a lib/classpath/servlet.jar -Dvbroker.agent.port=423
casuser 16574 1 0 22:27:29 ? 0:05 /opt/CSCOpx/bin/cwjava -cw /opt/CSCOpx -cw:jre lib/jre -server -cp:p MDC/tomcat
casuser 16445 1 0 22:26:45 ? 0:13 /opt/CSCOpx/bin/cwjava -cw:jre /opt/CSCOpx/lib/jre -Xms64m -Xmx512m -cp:a /opt/
casuser 16468 1 0 22:26:49 ? 0:26 /opt/CSCOpx/bin/cwjava -cw /opt/CSCOpx -cw:jre lib/jre -server -cp:p MDC/tomcat
casuser 16470 1 0 22:26:49 ? 0:18 /opt/CSCOpx/bin/cwjava -cw /opt/CSCOpx -cw:jre lib/jre -server -Xms64m -Xmx512m
05-02-2007 07:53 PM
I'm told a whole sleuth of Solaris daemons were killed as a result of "./K90dmgtd kill", including crond and LPD/LPR.
05-02-2007 09:29 PM
I've heard reports of this before, but an analysis of the code doesn't reveal any obvious errors unless there is a bug in ptree.
What would really help, if you can reproduce this again, is to modify K90dmgtd (after backing it up), comment out line 220, and add a line above 218:
echo "XXX: Executing /usr/bin/kill -\"$SIG\" $line > /dev/null 2>&1"
Add a line before the original 645:
/usr/bin/ptree $dmpid | /usr/bin/fgrep -v dbsrv > /opt/dm_procs
Then, when things fail, get /opt/dm_pids, /opt/dm_procs, and any output from the command that you can capture.
Note: dmgtd kill should never be used unless under extreme circumstances. You should be using dmgtd stop instead. Of course, this should not happen regardless. In any event, the socket problems are mostly done away with in newer versions of Solaris. Solaris 8 will no longer be supported in LMS 3.0, so this should not be a problem moving forward.
05-02-2007 09:47 PM
Something else just occurred to me. If you're running another process that has "dmgtd.sol" in its name, then the script will kill off that process tree as well. I think I can fix this.
Try changing lines 644 and 630 to:
dmpid=`/usr/bin/ps -ef | /usr/bin/fgrep "${NMSROOT}/objects/dmgt/${DMGRFILE}" | /usr/bin/fgrep -v fgrep | /usr/bin/awk '{ print $2 }'`
That should fix the problem. I'll file a bug on this.
05-02-2007 10:24 PM
I filed CSCsi74887 to track this problem. I came up with a better patch that can be made available by calling the TAC.
05-03-2007 06:02 AM
Thanks for your prompt response. Does the patch address the "dmgtd stop" not shutting down all procs issue or the K90dmgtd hazard or both?
Would it be feasible to manually kill off those cwjava procs in that situation, instead of risking it with K90dmgtd again? If so, in what order should those procs be terminated?
05-03-2007 07:33 AM
My patch addresses the killing of non-CiscoWorks processes. I have not been able to reproduce the problem with dmgtd not shutting down all processes. The ptree command should take care of this. ptree is used for both dmgtd stop and dmgtd kill. It basically come down to a question of signal (15 vs. 9). But I think the patch will definitely help you.
As for killing cwjava processes manually, there is no order. Just make sure that the databases are stopped using dbstop.pl all, and that the databases are stopped last.
05-04-2007 06:57 AM
I received the patch from TAC. I'm wondering if I need to make those changes you provided earlier to the patched dmgtd, or the patch incorporates those already. Will the patch yield /opt/dm_pids, /opt/dm_procs, etc. mentioned above?
A somewhat related question: It appears /etc/rc2.d/K90dmgtd and /etc/rc3.d/S10dmgtd are hard links to /etc/init.d/dmgtd, while /etc/rc2.d/K40ipm and /etc/rc3.d/S93ipm are distinct (but identical) files to /etc/init.d/ipmrc. Why the different approach?
05-04-2007 08:25 AM
Since I've isolated (and reproduced) the most likely cause of the non-CiscoWorks process termination, you do not need to modify the dmgtd script further.
IPM evolved from CiscoWorks Blue, and many of the things that used to happen there were carried over. This is all going away in LMS 3.0 when IPM is put under dmgtd control.
05-08-2007 06:39 PM
Good stuff. Two for two with dmgtd stop tonight without a hiccup.
05-16-2007 11:55 AM
I know this is unsupported and all, but I copied everything LMS 2.6 over from one box to another. Why am I getting the following when trying to bring up dmgtd?
-r-xr-x--- 1 casuser casusers 23183 May 16 15:30 dmgtd
# /etc/init.d/dmgtd start
/etc/init.d/dmgtd: test: argument expected
Is this a shell problem?
05-16-2007 11:57 AM
Highly unsupported.
I'm betting you didn't copy over the /var/sadm stuff, and thus nothing will work. "sh -x /etc/init.d/dmgtd start" will shed more light on this.
05-16-2007 11:59 AM
I did copy over the /var/sadm/pkg/CSCO* and /var/sadm/install/contents. Here's what "sh -x /etc/init.d/dmgtd start" returns:
sh -x /etc/init.d/dmgtd start
LC_ALL=C
+ export LC_ALL
+ . /opt/CSCOpx/etc/commonscript.sh
+ set +u
PATH=/usr/bin:/usr/sbin:/usr/dt/bin:/usr/openwin/bin:/bin:/usr/ucb:/etc:/usr/local/bin:/opt/OV/bin:/usr/sbin:/usr/bin:/usr/etc:/usr/ucb
+ export PATH
PX_TMP_INSTBASE=/var/tmp/.CSCOsave
FILE_VERSION=3.0
RANDOM_PORT_FILE=/tmp/cscotmp/random_ports
+ GetOS
+ Debug GetOS .
+ [ -z -o = 0 ]
+ return 0
+ uname
OS=SunOS
OS=SOL
+ tr [:upper:] [:lower:]
+ echo SOL
LC_OS=sol
+ export OS LC_OS
+ return 0
+ GetDebug
+ [ -f /tmp/cscotmp/CSCO.debug ]
+ [ SOL = AIX -o SOL = HPUX ]
DEBUG=0
+ export DEBUG
+ return 0
+ [ SOL = HPUX ]
+ [ SOL = HPUX ]
+ GetLibPath
+ Debug GetLibPath .
+ [ -z 0 -o 0 = 0 ]
+ return 0
ENV_LIBPATH=LD_LIBRARY_PATH
+ return 0
+ GetInitDir
+ Debug GetInitDir .
+ [ -z 0 -o 0 = 0 ]
+ return 0
INIT_DIR=/etc
+ return 0
+ GetBootScript
+ Debug GetBootScript .
+ [ -z 0 -o 0 = 0 ]
+ return 0
BOOT_SCRIPT=/etc/init.d/dmgtd
+ return 0
+ GetNMSRoot
+ Debug GetNMSRoot .
+ [ -z 0 -o 0 = 0 ]
+ return 0
+ [ -f /var/adm/CSCOpx/CSCO.nmsroot ]
+ cat /var/adm/CSCOpx/CSCO.nmsroot
NMSROOT=/product/CSCO/CSCOpx
IU_ROOT=/product/CSCO/CSCOpx
+ return 0
+ GetIMode
+ Debug GetIMode .
+ [ -z 0 -o 0 = 0 ]
+ return 0
+ [ ! -f /tmp/cscotmp/CSCO.i_mode ]
I_MODE=NEW
+ return 0
+ Get_UserGroup
+ [ -f /tmp/cscotmp/CSCO.owner ]
+ grep uninstall.sh+ echo
/etc/init.d/dmgtd
+ [ 1 = 1 ]
+ trap SigTrap 2
PATH=/usr/sbin:/usr/bin:/usr/etc:/usr/ucb
+ export PATH
+ CheckUID
+ sed -e s/(.*$// -e s/^.*=//
+ /bin/id
CUR_UID=0
+ [ 0 != 0 ]
+ unset CUR_UID
+ return 0
+ GetOS
+ uname
OS=SunOS
OS=SOL
EXT=sol
+ [ SOL = HPUX ]
PKGNAME=CSCOmd
DB=CSCOdb
LOGFILE=/var/adm/CSCOpx/log/daemons.log
SAVEFILE=/var/adm/CSCOpx/log/daemonsbackup.log
PX_LOGFILE=/var/adm/CSCOpx/log/dmgtd.log
+ [ SOL = AIX ]
CRTL_DIR=/etc/rc.config.d
CTRL_FILE=CiscoRMCtrl
DMGRD=dmgtd.
DMGR_OS=sol
DMGRFILE=dmgtd.sol
MAXLOGSIZE=30000
+ GetPkgParam CSCOmd NMSROOT
+ [ 2 -ne 2 ]
PKG_NAME=CSCOmd
PARAM_NAME=NMSROOT
ANS_FILE=/product/CSCO/CSCOpx/etc/.CSCOmd.ans
+ [ SOL = HPUX -o SOL = AIX ]
+ CheckPkgInstalled CSCOmd
+ [ 1 -ne 1 ]
PKG_NAME=CSCOmd
+ pkginfo -q CSCOmd
+ return 0
+ [ 0 != 0 ]
+ [ SOL = SOL ]
+ [ x = xCSCOmd ]
+ pkgparam CSCOmd NMSROOT
L_RET_VAL=0
+ [ 0 != 0 ]
+ unset ANS_FILE PKG_NAME PARAM_NAME
+ return 0
NMSROOT=/opt/CSCOpx
+ : /opt/CSCOpx
+ GetPkgParam CSCOmd PX_USER
+ [ 2 -ne 2 ]
PKG_NAME=CSCOmd
PARAM_NAME=PX_USER
ANS_FILE=/opt/CSCOpx/etc/.CSCOmd.ans
+ [ SOL = HPUX -o SOL = AIX ]
+ CheckPkgInstalled CSCOmd
+ [ 1 -ne 1 ]
PKG_NAME=CSCOmd
+ pkginfo -q CSCOmd
+ return 0
+ [ 0 != 0 ]
+ [ SOL = SOL ]
+ [ x = xCSCOmd ]
+ pkgparam CSCOmd PX_USER
L_RET_VAL=0
+ [ 0 != 0 ]
+ unset ANS_FILE PKG_NAME PARAM_NAME
+ return 0
PX_USER=casuser
+ : casuser
+ GetPkgParam CSCOmd PX_GROUP
+ [ 2 -ne 2 ]
PKG_NAME=CSCOmd
PARAM_NAME=PX_GROUP
ANS_FILE=/opt/CSCOpx/etc/.CSCOmd.ans
+ [ SOL = HPUX -o SOL = AIX ]
+ CheckPkgInstalled CSCOmd
+ [ 1 -ne 1 ]
PKG_NAME=CSCOmd
+ pkginfo -q CSCOmd
+ return 0
+ [ 0 != 0 ]
+ [ SOL = SOL ]
+ [ x = xCSCOmd ]
+ pkgparam CSCOmd PX_GROUP
L_RET_VAL=0
+ [ 0 != 0 ]
+ unset ANS_FILE PKG_NAME PARAM_NAME
+ return 0
PX_GROUP=casusers
+ : casusers
CUR_SYS_TZ=US/Eastern
+ [ -f /etc/rc.config.d/CiscoRMCtrl ]
+ . /etc/rc.config.d/CiscoRMCtrl
PX_START_UP=1
PX_RETRY_COUNTER=3
PX_RESET_TIME=300
PX_CORE_SIZE=10
+ export PX_START_UP PX_RETRY_COUNTER PX_RESET_TIME PX_CORE_SIZE
PX_STOP_COUNTER=3
PX_START_DELAY=60000
PX_STOP_DELAY=1000
+ export PX_START_DELAY PX_STOP_COUNTER PX_STOP_DELAY
PX_SETPERM=casuser
+ export PX_SETPERM
PX_OPENFILES=256
+ export PX_OPENFILES
+ [ -eq 1 ]
/etc/init.d/dmgtd: test: argument expected
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide