12-27-2007 08:07 AM
Some of the Cat6k (CatOS 8.4) switches are out of VTY lines due to unterminated SSH sessions from RME 4.0.5. All these SSH sessions are in ESTABLISHED state.
1. Would it be safe to terminate the cwjava processes by pid? I assume these are results of once-a-day Config Archive jobs.
2. Is this SSH issue a CatOS or RME bug? A search in Bug Tool came up empty for any CatOS bug that's close.
Solved! Go to Solution.
04-10-2008 09:21 AM
It would help to see a sniffer trace of a socket that stays open to get an idea of what type of tcp/22 socket it is. It will help narrow down the code path.
04-10-2008 09:23 AM
I haven't touched all 12 of them with this switch. Would a snoop (Solaris) capture be equivalent?
I was told the switch thinks it's got those 12 SSH sessions too.
04-10-2008 09:29 AM
I'm not seeing active traffic to/from the switch filtering on port 22 via snoop.
All sessions are still ESTABLISHED according to "netstat -a".
04-10-2008 09:37 AM
You will need to clear out at least one VTY, and start a snoop trace. If you know what steps you can take to reproduce the lock-up, then it shouldn't take long to get a useful trace. If not, however, you will have to leave snoop running for a while until the VTY get occupied. For this, I would run:
snoop -o outfile -s 1518 host x.x.x.x and tcp port 22
04-10-2008 09:53 AM
Do you mean clearing a VTY line from the router or LMS side? If I try to end one of the non-lowest-pid sessions from LMS, would it close all 12?
Would the above be a moot point since I don't know how to artificially make this logjam recur? Judging by the two incidents so far, it happens "naturally" ~4 months apart, so I'm skeptical it's feasible to leave snoop running that long. Also, does the "set logout 0" switch config have any bearing on the issue? Because we're going to address this misconfiguration on these few out-of-compliance devices to prevent their VTY lines from being hogged, if not by LMS but certainly by forgetful human operators.
Is 1518 another port used by RME (ConfigMgmt) to connect to the switches?
04-10-2008 11:25 AM
If you restart ConfigMmtServer, all open sockets should be closed.
Every four months a new socket is allocated? That's impossible. You would still be on LMS 2.2, and would not have restarted ever.
If you mean that all the VTY lines get occupied after about four months, then you don't need to wait that long. All I need is a sniffer trace showing ONE socket that doesn't close to be able to identify possible code paths.
While set logout 0 isn't helping this issue, RME should be closing all the sockets it opens.
The -s 1518 argument to snoop specifies a per-frame slice size of 1518 bytes. I want to make sure the whole frame is captured and written to the file.
04-11-2008 09:40 AM
I have one of the 12 connections terminated from the switch end. Now I have snoop waiting for the next activity. From I have seen overnight, the only SSH activity occurs in the early morning when RME does the ArchivePoller job scheduled at 6:00am.
04-11-2008 10:31 AM
I needed to stopped dmgtd for an unrelated reason. This triggered SSH from LMS to the switch. However, dmgtd seems to be hung. I suspect this might have something to do with the SSH sessions to the switch too.
-rw-r----- 1 casuser casusers 22482 Apr 11 14:22 ess.log
-rw-rw---- 1 casuser casusers 6423480 Apr 11 14:23 SyslogAnalyzerUI.log
-rw-rw---- 1 casuser casusers 9444607 Apr 11 14:28 dmgtd.log
-rw-rw---- 1 casuser casusers 1803527 Apr 11 14:28 daemons.log
/var/adm/CSCOpx/log)->tail -f dmgtd.log
Apr 11 14:22:37 nms dmgt[4451]: [ID 306211 local1.warning] #2115:TYPE=WARNING:The daemon manager cannot complete the startup of an application EDS-GCF due to its dependence on stopped applications.
Apr 11 14:22:41 nms SQLAnywhere(rmengeng): [ID 702911 local1.notice] Cache size adjusted to 1186800K
Apr 11 14:22:41 nms dmgt[4451]: [ID 410731 local1.warning] #2115:TYPE=WARNING:The daemon manager cannot complete the startup of an application CmfDbMonitor due to its dependence on stopped applications.
Apr 11 14:22:45 nms dmgt[4451]: [ID 378359 local1.info] #3017:TYPE=INFO:Application (CmfDbEngine, pid=434640) stopped by request.
Apr 11 14:22:49 nms dmgt[4451]: [ID 337540 local1.warning] #2115:TYPE=WARNING:The daemon manager cannot complete the startup of an application RMEDbMonitor due to its dependence on stopped applications.
Apr 11 14:23:41 nms SQLAnywhere(rmengeng): [ID 702911 local1.notice] Cache size adjusted to 1064224K
Apr 11 14:24:41 nms SQLAnywhere(rmengeng): [ID 702911 local1.notice] Cache size adjusted to 1032688K
Apr 11 14:25:41 nms SQLAnywhere(rmengeng): [ID 702911 local1.notice] Cache size adjusted to 1027024K
Apr 11 14:26:41 nms SQLAnywhere(rmengeng): [ID 702911 local1.notice] Cache size adjusted to 1024680K
Apr 11 14:28:41 nms SQLAnywhere(rmengeng): [ID 702911 local1.notice] Cache size adjusted to 1024560K
/var/adm/CSCOpx/log)->tail -f daemons.log
Adaptive Server Anywhere Command File Hiding Utility Version 9.0.0.1364
Stopping database engine cmfEng
Adaptive Server Anywhere Stop Engine Utility Version 9.0.0.1364
Adaptive Server Anywhere Command File Hiding Utility Version 9.0.0.1364
Stopping database engine rmengEng
Cache size adjusted to 1064224K
Cache size adjusted to 1032688K
Cache size adjusted to 1027024K
Cache size adjusted to 1024680K
Cache size adjusted to 1024560K
04-11-2008 10:34 AM
dmgtd eventually returned:
/etc/init.d/dmgtd stop
Daemon Management stopping. This may take a few minutes.
WARNING: Daemon Manager terminated with SIGKILL.
INFO : Stopping DBEngine processes registered to Daemon Manager
WARNING: Please check if all processes have been terminated using
WARNING: the command - "ps -ef|grep CSCOpx" and
WARNING: terminate them if any processes are running.
INFO : Stopping DBEngine processes registered to Daemon Manager
It's leaving behind a bunch of orphans:
ps -ef |grep -i csco
root 14635 14624 0 14:22:51 ? 0:00 /opt/CSCOpx/objects/db/bin/dbstop @/opt/CSCOpx/bin/.temp114624.txt
casuser 4531 1 0 Mar 17 ? 765:01 /opt/CSCOpx/objects/db/bin/dbsrv9 -x tcpip{HOST=localhost;DOBROADCAST=NO;Server
casuser 16309 24198 0 14:35:38 pts/5 0:00 grep -i csco
root 14624 14623 0 14:22:50 ? 0:00 /opt/CSCOpx/objects/perl/bin/perl /opt/CSCOpx/bin/dbstop.pl RMEDbEngine
root 14623 1 0 14:22:50 ? 0:00 sh -c /opt/CSCOpx/bin/perl /opt/CSCOpx/bin/dbstop.pl RMEDbEngine
nobody 8326 25944 0 12:56:49 pts/2
04-11-2008 11:44 AM
This could be a database problem. You should probably check the consistency of the rmeng database:
NMSROOT/objects/db/conf/configureDb.pl action=validate dsn=rmeng
04-16-2008 06:03 AM
Could you take a look at TAC case 608418091? I've attached the snoop capture over several days, that has resulted in three active SSH sessions to the same CatOS switch.
04-11-2008 09:46 AM
1. No, do not terminate the process by PID. Use pdterm/pdexec to restart CiscoWorks daemons.
2. This will be a CiscoWorks bug assuming we are not actually closing the socket.
04-11-2008 10:01 AM
I think you just re-addressed my questions from last Dec 2007.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide