Major hazard with /etc/rc2.d/K90dmgtd on Solaris 8

Answered Question
May 2nd, 2007
User Badges:
  • Gold, 750 points or more

This similar issue has happened twice now. The basic problem is LMS 2.6's "dmgtd stop" fails to stop all procs before shutting itself down. So it leaves a trail of procs in CLOSE_WAIT that never time out, even after I tried adjusting on-the-fly with "/usr/sbin/ndd -set /dev/tcp tcp_time_wait_interval 60000". "/opt/CSCOpx/bin/dbstop.pl all" does no good.


So eventually I tried to turn to /etc/rc2.d/K90dmgtd. Last time I tried it (unsure whether I used the stop or kill option), it killed both of my SSH sessions, sshd/telnetd/inetd, the EMC storage manager daemon, but left the box pingable. Tonight, thinking it was a coincidence last time, I tried it again, given the CLOSE_WAITs above. This time ./K90dmgtd stop

returned "Daemon Management is not running." Then as soon as I issued "./K90dmgtd kill", I got "INFO: Terminating all the processes launched by dmgr", followed by immediate termination of that particular SSH session. My 2nd SSH session remains alive. No new telnet/ssh connection is accepted however. I imagine I'd have to reboot the server again.

Correct Answer by Joe Clarke about 9 years 10 months ago

Looks like you're missing two files from the end of /etc/rc.config.d/CiscoRMCtrl:


PX_SYS_CHECK=1

export PX_SYS_CHECK

Correct Answer by Joe Clarke about 9 years 10 months ago

Since I've isolated (and reproduced) the most likely cause of the non-CiscoWorks process termination, you do not need to modify the dmgtd script further.


IPM evolved from CiscoWorks Blue, and many of the things that used to happen there were carried over. This is all going away in LMS 3.0 when IPM is put under dmgtd control.

  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 5 (4 ratings)
Loading.
yjdabear Wed, 05/02/2007 - 19:34
User Badges:
  • Gold, 750 points or more

netstat -n | grep 43441

127.0.0.1.49254 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT

127.0.0.1.43441 127.0.0.1.49254 32768 0 32768 0 FIN_WAIT_2

127.0.0.1.49294 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT

127.0.0.1.43441 127.0.0.1.49294 32768 0 32768 0 FIN_WAIT_2

127.0.0.1.49327 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT

127.0.0.1.43441 127.0.0.1.49327 32768 0 32768 0 FIN_WAIT_2

127.0.0.1.49375 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT

127.0.0.1.43441 127.0.0.1.49375 32768 0 32768 0 FIN_WAIT_2

127.0.0.1.49479 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT

127.0.0.1.43441 127.0.0.1.49479 32768 0 32768 0 FIN_WAIT_2

127.0.0.1.49499 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT

127.0.0.1.43441 127.0.0.1.49499 32768 0 32768 0 FIN_WAIT_2

127.0.0.1.49523 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT

127.0.0.1.43441 127.0.0.1.49523 32768 0 32768 0 FIN_WAIT_2

127.0.0.1.49592 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT

127.0.0.1.43441 127.0.0.1.49592 32768 0 32768 0 FIN_WAIT_2


after 15 mins

127.0.0.1.49254 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT

127.0.0.1.49294 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT

127.0.0.1.49327 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT

127.0.0.1.49375 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT

127.0.0.1.49479 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT

127.0.0.1.49499 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT

127.0.0.1.49523 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT

127.0.0.1.49592 127.0.0.1.43441 32768 0 32768 0 CLOSE_WAIT


lsof -i :43441

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME

cwjava 16371 casuser 5u IPv4 0x3003bc2d0b0 0t9218 TCP localhost:49254->localhost:cscocmfdb (CLOSE_WAIT)

cwjava 16445 casuser 10u IPv4 0x3001b576960 0t4996 TCP localhost:49294->localhost:cscocmfdb (CLOSE_WAIT)

cwjava 16445 casuser 12u IPv4 0x3003c2924d8 0t5192 TCP localhost:49479->localhost:cscocmfdb (CLOSE_WAIT)

cwjava 16445 casuser 13u IPv4 0x3001e8d2648 0t15837 TCP localhost:49523->localhost:cscocmfdb (CLOSE_WAIT)

cwjava 16445 casuser 19u IPv4 0x300242f0328 0t8546 TCP localhost:49592->localhost:cscocmfdb (CLOSE_WAIT)

cwjava 16447 casuser 8u IPv4 0x3001b568c70 0t88176 TCP localhost:49327->localhost:cscocmfdb (CLOSE_WAIT)

cwjava 16468 casuser 12u IPv4 0x30024349ce0 0t39726 TCP localhost:49499->localhost:cscocmfdb (CLOSE_WAIT)

cwjava 16536 casuser 8u IPv4 0x3003bc2c330 0t4974 TCP localhost:49375->localhost:cscocmfdb (CLOSE_WAIT)


yjdabear Wed, 05/02/2007 - 19:34
User Badges:
  • Gold, 750 points or more

60 mins later

ps -ef | grep -i cscopx

casuser 16447 1 0 22:26:45 ? 0:21 /opt/CSCOpx/bin/cwjava -cw /opt/CSCOpx -cw:jre lib/jre -server -Xms64m -Xmx192m

casuser 16469 1 0 22:26:49 ? 0:19 /opt/CSCOpx/bin/cwjava -cw /opt/CSCOpx -cw:jre lib/jre -server -cp:p MDC/tomcat

casuser 16510 1 0 22:26:57 ? 0:04 /opt/CSCOpx/bin/cwjava -cp:a /opt/CSCOpx:/product/CSCO/CSCOpx/lib/classpath:/pr

casuser 16371 1 0 22:26:34 ? 0:09 /opt/CSCOpx/bin/cwjava -cw:jre /opt/CSCOpx/MDC/jre -Xmx256m -Xss1m -cp:p /opt/C

casuser 16471 1 0 22:26:49 ? 0:00 /opt/CSCOpx/lib/vbroker/bin/osagent -p 42342

casuser 16575 1 0 22:27:29 ? 0:04 /opt/CSCOpx/bin/cwjava -cw /opt/CSCOpx -cw:jre lib/jre -server -Xms256m -Xmx256

root 21202 6485 0 23:29:43 pts/8 0:00 grep -i cscopx

casuser 16576 1 0 22:27:29 ? 0:00 /opt/CSCOpx/objects/db/bin/diskWatcher diskWatcher

casuser 16446 1 0 22:26:45 ? 0:16 /opt/CSCOpx/bin/cwjava -cw:jre /product/CSCO/CSCOpx/MDC/jre -Xmx512m -cp /produ

casuser 16536 1 0 22:27:07 ? 0:04 /opt/CSCOpx/bin/cwjava -cp:a /opt/CSCOpx com.cisco.nm.cmf.eds.gcf.Main

casuser 16550 1 0 22:27:16 ? 0:04 /opt/CSCOpx/bin/cwjava -cw /opt/CSCOpx -cw:jre lib/jre -server -cp:p MDC/tomcat

casuser 16500 1 0 22:26:53 ? 0:02 /opt/CSCOpx/bin/cwjava -cp:a lib/classpath/servlet.jar -Dvbroker.agent.port=423

casuser 16574 1 0 22:27:29 ? 0:05 /opt/CSCOpx/bin/cwjava -cw /opt/CSCOpx -cw:jre lib/jre -server -cp:p MDC/tomcat

casuser 16445 1 0 22:26:45 ? 0:13 /opt/CSCOpx/bin/cwjava -cw:jre /opt/CSCOpx/lib/jre -Xms64m -Xmx512m -cp:a /opt/

casuser 16468 1 0 22:26:49 ? 0:26 /opt/CSCOpx/bin/cwjava -cw /opt/CSCOpx -cw:jre lib/jre -server -cp:p MDC/tomcat

casuser 16470 1 0 22:26:49 ? 0:18 /opt/CSCOpx/bin/cwjava -cw /opt/CSCOpx -cw:jre lib/jre -server -Xms64m -Xmx512m


yjdabear Wed, 05/02/2007 - 19:53
User Badges:
  • Gold, 750 points or more

I'm told a whole sleuth of Solaris daemons were killed as a result of "./K90dmgtd kill", including crond and LPD/LPR.

Joe Clarke Wed, 05/02/2007 - 21:29
User Badges:
  • Cisco Employee,
  • Hall of Fame,

    Founding Member

I've heard reports of this before, but an analysis of the code doesn't reveal any obvious errors unless there is a bug in ptree.


What would really help, if you can reproduce this again, is to modify K90dmgtd (after backing it up), comment out line 220, and add a line above 218:


echo "XXX: Executing /usr/bin/kill -\"$SIG\" $line > /dev/null 2>&1"


Add a line before the original 645:


/usr/bin/ptree $dmpid | /usr/bin/fgrep -v dbsrv > /opt/dm_procs


Then, when things fail, get /opt/dm_pids, /opt/dm_procs, and any output from the command that you can capture.


Note: dmgtd kill should never be used unless under extreme circumstances. You should be using dmgtd stop instead. Of course, this should not happen regardless. In any event, the socket problems are mostly done away with in newer versions of Solaris. Solaris 8 will no longer be supported in LMS 3.0, so this should not be a problem moving forward.

Joe Clarke Wed, 05/02/2007 - 21:47
User Badges:
  • Cisco Employee,
  • Hall of Fame,

    Founding Member

Something else just occurred to me. If you're running another process that has "dmgtd.sol" in its name, then the script will kill off that process tree as well. I think I can fix this.


Try changing lines 644 and 630 to:


dmpid=`/usr/bin/ps -ef | /usr/bin/fgrep "${NMSROOT}/objects/dmgt/${DMGRFILE}" | /usr/bin/fgrep -v fgrep | /usr/bin/awk '{ print $2 }'`


That should fix the problem. I'll file a bug on this.

Joe Clarke Wed, 05/02/2007 - 22:24
User Badges:
  • Cisco Employee,
  • Hall of Fame,

    Founding Member

I filed CSCsi74887 to track this problem. I came up with a better patch that can be made available by calling the TAC.

yjdabear Thu, 05/03/2007 - 06:02
User Badges:
  • Gold, 750 points or more

Thanks for your prompt response. Does the patch address the "dmgtd stop" not shutting down all procs issue or the K90dmgtd hazard or both?


Would it be feasible to manually kill off those cwjava procs in that situation, instead of risking it with K90dmgtd again? If so, in what order should those procs be terminated?

Joe Clarke Thu, 05/03/2007 - 07:33
User Badges:
  • Cisco Employee,
  • Hall of Fame,

    Founding Member

My patch addresses the killing of non-CiscoWorks processes. I have not been able to reproduce the problem with dmgtd not shutting down all processes. The ptree command should take care of this. ptree is used for both dmgtd stop and dmgtd kill. It basically come down to a question of signal (15 vs. 9). But I think the patch will definitely help you.


As for killing cwjava processes manually, there is no order. Just make sure that the databases are stopped using dbstop.pl all, and that the databases are stopped last.

yjdabear Fri, 05/04/2007 - 06:57
User Badges:
  • Gold, 750 points or more

I received the patch from TAC. I'm wondering if I need to make those changes you provided earlier to the patched dmgtd, or the patch incorporates those already. Will the patch yield /opt/dm_pids, /opt/dm_procs, etc. mentioned above?


A somewhat related question: It appears /etc/rc2.d/K90dmgtd and /etc/rc3.d/S10dmgtd are hard links to /etc/init.d/dmgtd, while /etc/rc2.d/K40ipm and /etc/rc3.d/S93ipm are distinct (but identical) files to /etc/init.d/ipmrc. Why the different approach?

Correct Answer
Joe Clarke Fri, 05/04/2007 - 08:25
User Badges:
  • Cisco Employee,
  • Hall of Fame,

    Founding Member

Since I've isolated (and reproduced) the most likely cause of the non-CiscoWorks process termination, you do not need to modify the dmgtd script further.


IPM evolved from CiscoWorks Blue, and many of the things that used to happen there were carried over. This is all going away in LMS 3.0 when IPM is put under dmgtd control.

yjdabear Tue, 05/08/2007 - 18:39
User Badges:
  • Gold, 750 points or more

Good stuff. Two for two with dmgtd stop tonight without a hiccup.

yjdabear Wed, 05/16/2007 - 11:55
User Badges:
  • Gold, 750 points or more

I know this is unsupported and all, but I copied everything LMS 2.6 over from one box to another. Why am I getting the following when trying to bring up dmgtd?


-r-xr-x--- 1 casuser casusers 23183 May 16 15:30 dmgtd


# /etc/init.d/dmgtd start

/etc/init.d/dmgtd: test: argument expected


Is this a shell problem?


Joe Clarke Wed, 05/16/2007 - 11:57
User Badges:
  • Cisco Employee,
  • Hall of Fame,

    Founding Member

Highly unsupported.


I'm betting you didn't copy over the /var/sadm stuff, and thus nothing will work. "sh -x /etc/init.d/dmgtd start" will shed more light on this.

yjdabear Wed, 05/16/2007 - 11:59
User Badges:
  • Gold, 750 points or more

I did copy over the /var/sadm/pkg/CSCO* and /var/sadm/install/contents. Here's what "sh -x /etc/init.d/dmgtd start" returns:


sh -x /etc/init.d/dmgtd start

LC_ALL=C

+ export LC_ALL

+ . /opt/CSCOpx/etc/commonscript.sh

+ set +u

PATH=/usr/bin:/usr/sbin:/usr/dt/bin:/usr/openwin/bin:/bin:/usr/ucb:/etc:/usr/local/bin:/opt/OV/bin:/usr/sbin:/usr/bin:/usr/etc:/usr/ucb

+ export PATH

PX_TMP_INSTBASE=/var/tmp/.CSCOsave

FILE_VERSION=3.0

RANDOM_PORT_FILE=/tmp/cscotmp/random_ports

+ GetOS

+ Debug GetOS .

+ [ -z -o = 0 ]

+ return 0

+ uname

OS=SunOS

OS=SOL

+ tr [:upper:] [:lower:]

+ echo SOL

LC_OS=sol

+ export OS LC_OS

+ return 0

+ GetDebug

+ [ -f /tmp/cscotmp/CSCO.debug ]

+ [ SOL = AIX -o SOL = HPUX ]

DEBUG=0

+ export DEBUG

+ return 0

+ [ SOL = HPUX ]

+ [ SOL = HPUX ]

+ GetLibPath

+ Debug GetLibPath .

+ [ -z 0 -o 0 = 0 ]

+ return 0

ENV_LIBPATH=LD_LIBRARY_PATH

+ return 0

+ GetInitDir

+ Debug GetInitDir .

+ [ -z 0 -o 0 = 0 ]

+ return 0

INIT_DIR=/etc

+ return 0

+ GetBootScript

+ Debug GetBootScript .

+ [ -z 0 -o 0 = 0 ]

+ return 0

BOOT_SCRIPT=/etc/init.d/dmgtd

+ return 0

+ GetNMSRoot

+ Debug GetNMSRoot .

+ [ -z 0 -o 0 = 0 ]

+ return 0

+ [ -f /var/adm/CSCOpx/CSCO.nmsroot ]

+ cat /var/adm/CSCOpx/CSCO.nmsroot

NMSROOT=/product/CSCO/CSCOpx

IU_ROOT=/product/CSCO/CSCOpx

+ return 0

+ GetIMode

+ Debug GetIMode .

+ [ -z 0 -o 0 = 0 ]

+ return 0

+ [ ! -f /tmp/cscotmp/CSCO.i_mode ]

I_MODE=NEW

+ return 0

+ Get_UserGroup

+ [ -f /tmp/cscotmp/CSCO.owner ]

+ grep uninstall.sh+ echo

/etc/init.d/dmgtd

+ [ 1 = 1 ]

+ trap SigTrap 2

PATH=/usr/sbin:/usr/bin:/usr/etc:/usr/ucb

+ export PATH

+ CheckUID

+ sed -e s/(.*$// -e s/^.*=//

+ /bin/id

CUR_UID=0

+ [ 0 != 0 ]

+ unset CUR_UID

+ return 0

+ GetOS

+ uname

OS=SunOS

OS=SOL

EXT=sol

+ [ SOL = HPUX ]

PKGNAME=CSCOmd

DB=CSCOdb

LOGFILE=/var/adm/CSCOpx/log/daemons.log

SAVEFILE=/var/adm/CSCOpx/log/daemonsbackup.log

PX_LOGFILE=/var/adm/CSCOpx/log/dmgtd.log

+ [ SOL = AIX ]

CRTL_DIR=/etc/rc.config.d

CTRL_FILE=CiscoRMCtrl

DMGRD=dmgtd.

DMGR_OS=sol

DMGRFILE=dmgtd.sol

MAXLOGSIZE=30000

+ GetPkgParam CSCOmd NMSROOT

+ [ 2 -ne 2 ]

PKG_NAME=CSCOmd

PARAM_NAME=NMSROOT

ANS_FILE=/product/CSCO/CSCOpx/etc/.CSCOmd.ans

+ [ SOL = HPUX -o SOL = AIX ]

+ CheckPkgInstalled CSCOmd

+ [ 1 -ne 1 ]

PKG_NAME=CSCOmd

+ pkginfo -q CSCOmd

+ return 0

+ [ 0 != 0 ]

+ [ SOL = SOL ]

+ [ x = xCSCOmd ]

+ pkgparam CSCOmd NMSROOT

L_RET_VAL=0

+ [ 0 != 0 ]

+ unset ANS_FILE PKG_NAME PARAM_NAME

+ return 0

NMSROOT=/opt/CSCOpx

+ : /opt/CSCOpx

+ GetPkgParam CSCOmd PX_USER

+ [ 2 -ne 2 ]

PKG_NAME=CSCOmd

PARAM_NAME=PX_USER

ANS_FILE=/opt/CSCOpx/etc/.CSCOmd.ans

+ [ SOL = HPUX -o SOL = AIX ]

+ CheckPkgInstalled CSCOmd

+ [ 1 -ne 1 ]

PKG_NAME=CSCOmd

+ pkginfo -q CSCOmd

+ return 0

+ [ 0 != 0 ]

+ [ SOL = SOL ]

+ [ x = xCSCOmd ]

+ pkgparam CSCOmd PX_USER

L_RET_VAL=0

+ [ 0 != 0 ]

+ unset ANS_FILE PKG_NAME PARAM_NAME

+ return 0

PX_USER=casuser

+ : casuser

+ GetPkgParam CSCOmd PX_GROUP

+ [ 2 -ne 2 ]

PKG_NAME=CSCOmd

PARAM_NAME=PX_GROUP

ANS_FILE=/opt/CSCOpx/etc/.CSCOmd.ans

+ [ SOL = HPUX -o SOL = AIX ]

+ CheckPkgInstalled CSCOmd

+ [ 1 -ne 1 ]

PKG_NAME=CSCOmd

+ pkginfo -q CSCOmd

+ return 0

+ [ 0 != 0 ]

+ [ SOL = SOL ]

+ [ x = xCSCOmd ]

+ pkgparam CSCOmd PX_GROUP

L_RET_VAL=0

+ [ 0 != 0 ]

+ unset ANS_FILE PKG_NAME PARAM_NAME

+ return 0

PX_GROUP=casusers

+ : casusers

CUR_SYS_TZ=US/Eastern

+ [ -f /etc/rc.config.d/CiscoRMCtrl ]

+ . /etc/rc.config.d/CiscoRMCtrl

PX_START_UP=1

PX_RETRY_COUNTER=3

PX_RESET_TIME=300

PX_CORE_SIZE=10

+ export PX_START_UP PX_RETRY_COUNTER PX_RESET_TIME PX_CORE_SIZE

PX_STOP_COUNTER=3

PX_START_DELAY=60000

PX_STOP_DELAY=1000

+ export PX_START_DELAY PX_STOP_COUNTER PX_STOP_DELAY

PX_SETPERM=casuser

+ export PX_SETPERM

PX_OPENFILES=256

+ export PX_OPENFILES

+ [ -eq 1 ]

/etc/init.d/dmgtd: test: argument expected


Correct Answer
Joe Clarke Wed, 05/16/2007 - 12:05
User Badges:
  • Cisco Employee,
  • Hall of Fame,

    Founding Member

Looks like you're missing two files from the end of /etc/rc.config.d/CiscoRMCtrl:


PX_SYS_CHECK=1

export PX_SYS_CHECK

yjdabear Wed, 05/16/2007 - 12:11
User Badges:
  • Gold, 750 points or more

Thanks! I wasn't aware of /etc/rc.config.d/CiscoRMCtrl until now.

Nick Egloff Wed, 05/16/2007 - 13:09
User Badges:

Hi - You mention LMS 3.0 -- is there even a rough schedule for release on this (so I can budget it).


Thanks!

Joe Clarke Wed, 05/16/2007 - 13:10
User Badges:
  • Cisco Employee,
  • Hall of Fame,

    Founding Member

It will begin shipping in early June of this year.

Actions

This Discussion