Solved: Re: SSH issue: CatOS or RME bug? - Page 2

yjdabear · ‎12-27-2007

Some of the Cat6k (CatOS 8.4) switches are out of VTY lines due to unterminated SSH sessions from RME 4.0.5. All these SSH sessions are in ESTABLISHED state.

1. Would it be safe to terminate the cwjava processes by pid? I assume these are results of once-a-day Config Archive jobs.

2. Is this SSH issue a CatOS or RME bug? A search in Bug Tool came up empty for any CatOS bug that's close.

Joe Clarke · ‎04-10-2008

It would help to see a sniffer trace of a socket that stays open to get an idea of what type of tcp/22 socket it is. It will help narrow down the code path.

yjdabear · ‎04-10-2008

I haven't touched all 12 of them with this switch. Would a snoop (Solaris) capture be equivalent?

I was told the switch thinks it's got those 12 SSH sessions too.

yjdabear · ‎04-10-2008

I'm not seeing active traffic to/from the switch filtering on port 22 via snoop.

All sessions are still ESTABLISHED according to "netstat -a".

Joe Clarke · ‎04-10-2008

You will need to clear out at least one VTY, and start a snoop trace. If you know what steps you can take to reproduce the lock-up, then it shouldn't take long to get a useful trace. If not, however, you will have to leave snoop running for a while until the VTY get occupied. For this, I would run:

snoop -o outfile -s 1518 host x.x.x.x and tcp port 22

yjdabear · ‎04-10-2008

Do you mean clearing a VTY line from the router or LMS side? If I try to end one of the non-lowest-pid sessions from LMS, would it close all 12?

Would the above be a moot point since I don't know how to artificially make this logjam recur? Judging by the two incidents so far, it happens "naturally" ~4 months apart, so I'm skeptical it's feasible to leave snoop running that long. Also, does the "set logout 0" switch config have any bearing on the issue? Because we're going to address this misconfiguration on these few out-of-compliance devices to prevent their VTY lines from being hogged, if not by LMS but certainly by forgetful human operators.

Is 1518 another port used by RME (ConfigMgmt) to connect to the switches?

Joe Clarke · ‎04-10-2008

If you restart ConfigMmtServer, all open sockets should be closed.

Every four months a new socket is allocated? That's impossible. You would still be on LMS 2.2, and would not have restarted ever.

If you mean that all the VTY lines get occupied after about four months, then you don't need to wait that long. All I need is a sniffer trace showing ONE socket that doesn't close to be able to identify possible code paths.

While set logout 0 isn't helping this issue, RME should be closing all the sockets it opens.

The -s 1518 argument to snoop specifies a per-frame slice size of 1518 bytes. I want to make sure the whole frame is captured and written to the file.

yjdabear · ‎04-11-2008

I have one of the 12 connections terminated from the switch end. Now I have snoop waiting for the next activity. From I have seen overnight, the only SSH activity occurs in the early morning when RME does the ArchivePoller job scheduled at 6:00am.

yjdabear · ‎04-11-2008

I needed to stopped dmgtd for an unrelated reason. This triggered SSH from LMS to the switch. However, dmgtd seems to be hung. I suspect this might have something to do with the SSH sessions to the switch too.

-rw-r----- 1 casuser casusers 22482 Apr 11 14:22 ess.log

-rw-rw---- 1 casuser casusers 6423480 Apr 11 14:23 SyslogAnalyzerUI.log

-rw-rw---- 1 casuser casusers 9444607 Apr 11 14:28 dmgtd.log

-rw-rw---- 1 casuser casusers 1803527 Apr 11 14:28 daemons.log

/var/adm/CSCOpx/log)->tail -f dmgtd.log

Apr 11 14:22:37 nms dmgt[4451]: [ID 306211 local1.warning] #2115:TYPE=WARNING:The daemon manager cannot complete the startup of an application EDS-GCF due to its dependence on stopped applications.

Apr 11 14:22:41 nms SQLAnywhere(rmengeng): [ID 702911 local1.notice] Cache size adjusted to 1186800K

Apr 11 14:22:41 nms dmgt[4451]: [ID 410731 local1.warning] #2115:TYPE=WARNING:The daemon manager cannot complete the startup of an application CmfDbMonitor due to its dependence on stopped applications.

Apr 11 14:22:45 nms dmgt[4451]: [ID 378359 local1.info] #3017:TYPE=INFO:Application (CmfDbEngine, pid=434640) stopped by request.

Apr 11 14:22:49 nms dmgt[4451]: [ID 337540 local1.warning] #2115:TYPE=WARNING:The daemon manager cannot complete the startup of an application RMEDbMonitor due to its dependence on stopped applications.

Apr 11 14:23:41 nms SQLAnywhere(rmengeng): [ID 702911 local1.notice] Cache size adjusted to 1064224K

Apr 11 14:24:41 nms SQLAnywhere(rmengeng): [ID 702911 local1.notice] Cache size adjusted to 1032688K

Apr 11 14:25:41 nms SQLAnywhere(rmengeng): [ID 702911 local1.notice] Cache size adjusted to 1027024K

Apr 11 14:26:41 nms SQLAnywhere(rmengeng): [ID 702911 local1.notice] Cache size adjusted to 1024680K

Apr 11 14:28:41 nms SQLAnywhere(rmengeng): [ID 702911 local1.notice] Cache size adjusted to 1024560K

/var/adm/CSCOpx/log)->tail -f daemons.log

Adaptive Server Anywhere Command File Hiding Utility Version 9.0.0.1364

Stopping database engine cmfEng

Adaptive Server Anywhere Stop Engine Utility Version 9.0.0.1364

Adaptive Server Anywhere Command File Hiding Utility Version 9.0.0.1364

Stopping database engine rmengEng

Cache size adjusted to 1064224K

Cache size adjusted to 1032688K

Cache size adjusted to 1027024K

Cache size adjusted to 1024680K

Cache size adjusted to 1024560K

yjdabear · ‎04-11-2008

dmgtd eventually returned:

/etc/init.d/dmgtd stop

Daemon Management stopping. This may take a few minutes.

WARNING: Daemon Manager terminated with SIGKILL.

INFO : Stopping DBEngine processes registered to Daemon Manager

WARNING: Please check if all processes have been terminated using

WARNING: the command - "ps -ef|grep CSCOpx" and

WARNING: terminate them if any processes are running.

INFO : Stopping DBEngine processes registered to Daemon Manager

It's leaving behind a bunch of orphans:

ps -ef |grep -i csco

root 14635 14624 0 14:22:51 ? 0:00 /opt/CSCOpx/objects/db/bin/dbstop @/opt/CSCOpx/bin/.temp114624.txt

casuser 4531 1 0 Mar 17 ? 765:01 /opt/CSCOpx/objects/db/bin/dbsrv9 -x tcpip{HOST=localhost;DOBROADCAST=NO;Server

casuser 16309 24198 0 14:35:38 pts/5 0:00 grep -i csco

root 14624 14623 0 14:22:50 ? 0:00 /opt/CSCOpx/objects/perl/bin/perl /opt/CSCOpx/bin/dbstop.pl RMEDbEngine

root 14623 1 0 14:22:50 ? 0:00 sh -c /opt/CSCOpx/bin/perl /opt/CSCOpx/bin/dbstop.pl RMEDbEngine

nobody 8326 25944 0 12:56:49 pts/2

Joe Clarke · ‎04-11-2008

This could be a database problem. You should probably check the consistency of the rmeng database:

NMSROOT/objects/db/conf/configureDb.pl action=validate dsn=rmeng

yjdabear · ‎04-16-2008

Could you take a look at TAC case 608418091? I've attached the snoop capture over several days, that has resulted in three active SSH sessions to the same CatOS switch.

Joe Clarke · ‎04-11-2008

1. No, do not terminate the process by PID. Use pdterm/pdexec to restart CiscoWorks daemons.

2. This will be a CiscoWorks bug assuming we are not actually closing the socket.

yjdabear · ‎04-11-2008

I think you just re-addressed my questions from last Dec 2007.