SSH issue: CatOS or RME bug?

Answered Question
Dec 27th, 2007

Some of the Cat6k (CatOS 8.4) switches are out of VTY lines due to unterminated SSH sessions from RME 4.0.5. All these SSH sessions are in ESTABLISHED state.

1. Would it be safe to terminate the cwjava processes by pid? I assume these are results of once-a-day Config Archive jobs.

2. Is this SSH issue a CatOS or RME bug? A search in Bug Tool came up empty for any CatOS bug that's close.

I have this problem too.
0 votes
Correct Answer by Joe Clarke about 8 years 11 months ago

Restarting ConfigMgmtServer should clear these up. The problem is documented in CSCse20477, and a script is available from TAC that will disable syslog-triggered config archive for these devices.

  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 4.7 (3 ratings)
Loading.
Correct Answer
Joe Clarke Thu, 12/27/2007 - 09:59

Restarting ConfigMgmtServer should clear these up. The problem is documented in CSCse20477, and a script is available from TAC that will disable syslog-triggered config archive for these devices.

yjdabear Thu, 12/27/2007 - 10:10

The bug description goes: "This only happens when using TFTP in transport protocol which uses snmp read-write when trying to

pull the config." and "* Do not use TFTP as a configuration fetch protocol. Instead use telnet, SSH, SCP, or RCP". In my case, it's the SSH sessions that're backed up. My Transports Settings for ArchiveMgmt is: SSH, SCP, Telnet, and TFTP, in that order. I'm not sure this is the bug.

More findings: There were about a dozen SSH sessions to each of those two switches (sw1 and sw2). After doing a "kill " against the oldest (by pid) of these to sw1, all the other SSH sessions to both switches terminated as well.

Joe Clarke Thu, 12/27/2007 - 10:16

Yes, the problem is trigger by TFTP. Typically, if a CatOS device's lines are being tied up, this is the bug. What was the actual process (process name) you killed? If it was a cwjava process, it would have helped to know which dmgtd daemon it was to figure out why the lines were hung.

yjdabear Thu, 12/27/2007 - 10:21

I didn't do a thorough capture of pid 27601, which I just realized was the same pid implicated by every active socket connection. Hopefully, what's left on my screen can still shed some light:

netstat -f inet | grep sw2

nms.fqdn.com.59446 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.36623 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.57115 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.53398 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.38539 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.65329 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.54673 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.38238 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.49460 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.44512 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.48385 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.39146 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

netstat -f inet | grep -i sw1

nms.fqdn.com.65023 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.43251 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.47566 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.50480 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.54475 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.61729 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.36249 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.52040 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.54341 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.61546 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

lsof -i :44512

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME

cwjava 27601 casuser 256u IPv4 0x30020f413b0 0t463912 TCP nms.fqdn.com:44512->sw2.fqdn.com.com:ssh (ESTABLISHED)

lsof -i :36249

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME

cwjava 27601 casuser 197u IPv4 0x3001a280dc8 0t432792 TCP nms.fqdn.com:36249->sw1.fqdn.com:ssh (ESTABLISHED)

kill 27601

Joe Clarke Thu, 12/27/2007 - 10:28

No, this does not help identify the CiscoWorks daemon. You would have needed to use pdshow to translate the PID to a daemon name. Then, a pdterm on that daemon would have been much better than simply killing it.

yjdabear Thu, 12/27/2007 - 10:34

Thanks, I'll keep that in mind. In the meantime, are RMEs beyond v4.0.5 still affected by CSCse20477? I cannot disable TFTP due to a few straggler devices that do not have SSH support. I've also found the switches involved all have "set logging server severity 5" so RME activities couldn't be triggered by %SYS-6-CFG_CHG.

Joe Clarke Thu, 12/27/2007 - 10:35

All versions of RME from 3.5 up are affected. There will be a fix going in to RME 4.2 to alleviate this.

yjdabear Thu, 04/10/2008 - 05:44

This problem cropped up again, so far affecting one of the same two devices across the continent:

nms.domain.tld.50366 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED

nms.domain.tld.50967 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED

nms.domain.tld.51132 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED

nms.domain.tld.53207 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED

nms.domain.tld.54888 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED

nms.domain.tld.54964 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED

nms.domain.tld.56634 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED

nms.domain.tld.56796 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED

nms.domain.tld.57613 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED

nms.domain.tld.58926 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED

nms.domain.tld.61398 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED

nms.domain.tld.65402 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED

lsof -i:50366

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME

cwjava 4784 casuser 44u IPv4 0x300100ab3f0 0t462140 TCP nms.domain.tld:50366->catos-switch.domain.tld:ssh (ESTABLISHED)

pdshow

Process= ConfigMgmtServer

State = Program started - No mgt msgs received

Pid = 4784

RC = 0

Signo = 0

Start = 03/17/08 09:06:04

Stop = Not applicable

Core = Not applicable

Info = Application started by administrator request.

Questions:

1) While it's ConfigMgmtServer as described in the bug, does the same bug still apply when the switches are configured to send %SYS-6-CFG_CHG so there's no chance it's RME acting on syslog notification?

2) Since it appears to affect the same device(s), does that mean the device OS is also buggy? Is there anything that can be done on the device side to prevent this (other than bumping up the number of VTY lines)?

yjdabear Thu, 04/10/2008 - 07:06

The switches impacted all have "set logout 0" configured. This raises the question of how LMS (specifically ConfigMgmtServer) terminates the SSH sessions to the devices. Does it send an explicit "exit"? It seems it doesn't.

Joe Clarke Thu, 04/10/2008 - 08:27

Not all of the SSH sessions are true SSH sessions. RME sends a SYN to port 22 to test for SSH connectivity, and to determine the supported version of SSH. This socket (as with all sockets) is explicitly closed in Java. I made it my mission to hunt down socket leaks in LMS, and all known socket leaks should be fixed in RME 4.0.6 and 4.1.1.

Joe Clarke Thu, 04/10/2008 - 08:24

1. If the switches are sending these messages to RME, and RME has not been configured to ignore them, and RME is using TFTP to fetch the config, then RME will loop trying to fetch the config, and you will see this problem with VTYs being eaten up.

2. The problem only started occurring on later versions of CatOS. Earlier versions did not send the %SYS-6-CFG_CHG message when the tftpGrp was modified via SNMP. This really can't be considered a device bug.

yjdabear Thu, 04/10/2008 - 09:13

My bad. I left out the "not" as in "the switches are *not* configured to send %SYS-6-CFG_CHG so there's no chance it's RME acting on syslog notification? "

Joe Clarke Thu, 04/10/2008 - 09:21

It would help to see a sniffer trace of a socket that stays open to get an idea of what type of tcp/22 socket it is. It will help narrow down the code path.

yjdabear Thu, 04/10/2008 - 09:23

I haven't touched all 12 of them with this switch. Would a snoop (Solaris) capture be equivalent?

I was told the switch thinks it's got those 12 SSH sessions too.

yjdabear Thu, 04/10/2008 - 09:29

I'm not seeing active traffic to/from the switch filtering on port 22 via snoop.

All sessions are still ESTABLISHED according to "netstat -a".

Joe Clarke Thu, 04/10/2008 - 09:37

You will need to clear out at least one VTY, and start a snoop trace. If you know what steps you can take to reproduce the lock-up, then it shouldn't take long to get a useful trace. If not, however, you will have to leave snoop running for a while until the VTY get occupied. For this, I would run:

snoop -o outfile -s 1518 host x.x.x.x and tcp port 22

yjdabear Thu, 04/10/2008 - 09:53

Do you mean clearing a VTY line from the router or LMS side? If I try to end one of the non-lowest-pid sessions from LMS, would it close all 12?

Would the above be a moot point since I don't know how to artificially make this logjam recur? Judging by the two incidents so far, it happens "naturally" ~4 months apart, so I'm skeptical it's feasible to leave snoop running that long. Also, does the "set logout 0" switch config have any bearing on the issue? Because we're going to address this misconfiguration on these few out-of-compliance devices to prevent their VTY lines from being hogged, if not by LMS but certainly by forgetful human operators.

Is 1518 another port used by RME (ConfigMgmt) to connect to the switches?

Joe Clarke Thu, 04/10/2008 - 11:25

If you restart ConfigMmtServer, all open sockets should be closed.

Every four months a new socket is allocated? That's impossible. You would still be on LMS 2.2, and would not have restarted ever.

If you mean that all the VTY lines get occupied after about four months, then you don't need to wait that long. All I need is a sniffer trace showing ONE socket that doesn't close to be able to identify possible code paths.

While set logout 0 isn't helping this issue, RME should be closing all the sockets it opens.

The -s 1518 argument to snoop specifies a per-frame slice size of 1518 bytes. I want to make sure the whole frame is captured and written to the file.

yjdabear Fri, 04/11/2008 - 09:40

I have one of the 12 connections terminated from the switch end. Now I have snoop waiting for the next activity. From I have seen overnight, the only SSH activity occurs in the early morning when RME does the ArchivePoller job scheduled at 6:00am.

yjdabear Fri, 04/11/2008 - 10:31

I needed to stopped dmgtd for an unrelated reason. This triggered SSH from LMS to the switch. However, dmgtd seems to be hung. I suspect this might have something to do with the SSH sessions to the switch too.

-rw-r----- 1 casuser casusers 22482 Apr 11 14:22 ess.log

-rw-rw---- 1 casuser casusers 6423480 Apr 11 14:23 SyslogAnalyzerUI.log

-rw-rw---- 1 casuser casusers 9444607 Apr 11 14:28 dmgtd.log

-rw-rw---- 1 casuser casusers 1803527 Apr 11 14:28 daemons.log

/var/adm/CSCOpx/log)->tail -f dmgtd.log

Apr 11 14:22:37 nms dmgt[4451]: [ID 306211 local1.warning] #2115:TYPE=WARNING:The daemon manager cannot complete the startup of an application EDS-GCF due to its dependence on stopped applications.

Apr 11 14:22:41 nms SQLAnywhere(rmengeng): [ID 702911 local1.notice] Cache size adjusted to 1186800K

Apr 11 14:22:41 nms dmgt[4451]: [ID 410731 local1.warning] #2115:TYPE=WARNING:The daemon manager cannot complete the startup of an application CmfDbMonitor due to its dependence on stopped applications.

Apr 11 14:22:45 nms dmgt[4451]: [ID 378359 local1.info] #3017:TYPE=INFO:Application (CmfDbEngine, pid=434640) stopped by request.

Apr 11 14:22:49 nms dmgt[4451]: [ID 337540 local1.warning] #2115:TYPE=WARNING:The daemon manager cannot complete the startup of an application RMEDbMonitor due to its dependence on stopped applications.

Apr 11 14:23:41 nms SQLAnywhere(rmengeng): [ID 702911 local1.notice] Cache size adjusted to 1064224K

Apr 11 14:24:41 nms SQLAnywhere(rmengeng): [ID 702911 local1.notice] Cache size adjusted to 1032688K

Apr 11 14:25:41 nms SQLAnywhere(rmengeng): [ID 702911 local1.notice] Cache size adjusted to 1027024K

Apr 11 14:26:41 nms SQLAnywhere(rmengeng): [ID 702911 local1.notice] Cache size adjusted to 1024680K

Apr 11 14:28:41 nms SQLAnywhere(rmengeng): [ID 702911 local1.notice] Cache size adjusted to 1024560K

/var/adm/CSCOpx/log)->tail -f daemons.log

Adaptive Server Anywhere Command File Hiding Utility Version 9.0.0.1364

Stopping database engine cmfEng

Adaptive Server Anywhere Stop Engine Utility Version 9.0.0.1364

Adaptive Server Anywhere Command File Hiding Utility Version 9.0.0.1364

Stopping database engine rmengEng

Cache size adjusted to 1064224K

Cache size adjusted to 1032688K

Cache size adjusted to 1027024K

Cache size adjusted to 1024680K

Cache size adjusted to 1024560K

yjdabear Fri, 04/11/2008 - 10:34

dmgtd eventually returned:

/etc/init.d/dmgtd stop

Daemon Management stopping. This may take a few minutes.

WARNING: Daemon Manager terminated with SIGKILL.

INFO : Stopping DBEngine processes registered to Daemon Manager

WARNING: Please check if all processes have been terminated using

WARNING: the command - "ps -ef|grep CSCOpx" and

WARNING: terminate them if any processes are running.

INFO : Stopping DBEngine processes registered to Daemon Manager

It's leaving behind a bunch of orphans:

ps -ef |grep -i csco

root 14635 14624 0 14:22:51 ? 0:00 /opt/CSCOpx/objects/db/bin/dbstop @/opt/CSCOpx/bin/.temp114624.txt

casuser 4531 1 0 Mar 17 ? 765:01 /opt/CSCOpx/objects/db/bin/dbsrv9 -x tcpip{HOST=localhost;DOBROADCAST=NO;Server

casuser 16309 24198 0 14:35:38 pts/5 0:00 grep -i csco

root 14624 14623 0 14:22:50 ? 0:00 /opt/CSCOpx/objects/perl/bin/perl /opt/CSCOpx/bin/dbstop.pl RMEDbEngine

root 14623 1 0 14:22:50 ? 0:00 sh -c /opt/CSCOpx/bin/perl /opt/CSCOpx/bin/dbstop.pl RMEDbEngine

nobody 8326 25944 0 12:56:49 pts/2

Joe Clarke Fri, 04/11/2008 - 11:44

This could be a database problem. You should probably check the consistency of the rmeng database:

NMSROOT/objects/db/conf/configureDb.pl action=validate dsn=rmeng

yjdabear Wed, 04/16/2008 - 06:03

Could you take a look at TAC case 608418091? I've attached the snoop capture over several days, that has resulted in three active SSH sessions to the same CatOS switch.

Joe Clarke Fri, 04/11/2008 - 09:46

1. No, do not terminate the process by PID. Use pdterm/pdexec to restart CiscoWorks daemons.

2. This will be a CiscoWorks bug assuming we are not actually closing the socket.

yjdabear Fri, 04/11/2008 - 10:01

I think you just re-addressed my questions from last Dec 2007.

Actions

This Discussion