Solved: SSH issue: CatOS or RME bug?

yjdabear · ‎12-27-2007

Some of the Cat6k (CatOS 8.4) switches are out of VTY lines due to unterminated SSH sessions from RME 4.0.5. All these SSH sessions are in ESTABLISHED state.

1. Would it be safe to terminate the cwjava processes by pid? I assume these are results of once-a-day Config Archive jobs.

2. Is this SSH issue a CatOS or RME bug? A search in Bug Tool came up empty for any CatOS bug that's close.

Joe Clarke · ‎12-27-2007

Restarting ConfigMgmtServer should clear these up. The problem is documented in CSCse20477, and a script is available from TAC that will disable syslog-triggered config archive for these devices.

View solution in original post

Joe Clarke · ‎12-27-2007

Restarting ConfigMgmtServer should clear these up. The problem is documented in CSCse20477, and a script is available from TAC that will disable syslog-triggered config archive for these devices.

yjdabear · ‎12-27-2007

The bug description goes: "This only happens when using TFTP in transport protocol which uses snmp read-write when trying to

pull the config." and "* Do not use TFTP as a configuration fetch protocol. Instead use telnet, SSH, SCP, or RCP". In my case, it's the SSH sessions that're backed up. My Transports Settings for ArchiveMgmt is: SSH, SCP, Telnet, and TFTP, in that order. I'm not sure this is the bug.

More findings: There were about a dozen SSH sessions to each of those two switches (sw1 and sw2). After doing a "kill " against the oldest (by pid) of these to sw1, all the other SSH sessions to both switches terminated as well.

Joe Clarke · ‎12-27-2007

Yes, the problem is trigger by TFTP. Typically, if a CatOS device's lines are being tied up, this is the bug. What was the actual process (process name) you killed? If it was a cwjava process, it would have helped to know which dmgtd daemon it was to figure out why the lines were hung.

yjdabear · ‎12-27-2007

I didn't do a thorough capture of pid 27601, which I just realized was the same pid implicated by every active socket connection. Hopefully, what's left on my screen can still shed some light:

netstat -f inet | grep sw2

nms.fqdn.com.59446 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.36623 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.57115 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.53398 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.38539 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.65329 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.54673 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.38238 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.49460 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.44512 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.48385 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.39146 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

netstat -f inet | grep -i sw1

nms.fqdn.com.65023 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.43251 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.47566 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.50480 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.54475 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.61729 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.36249 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.52040 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.54341 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

nms.fqdn.com.61546 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED

lsof -i :44512

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME

cwjava 27601 casuser 256u IPv4 0x30020f413b0 0t463912 TCP nms.fqdn.com:44512->sw2.fqdn.com.com:ssh (ESTABLISHED)

lsof -i :36249

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME

cwjava 27601 casuser 197u IPv4 0x3001a280dc8 0t432792 TCP nms.fqdn.com:36249->sw1.fqdn.com:ssh (ESTABLISHED)

kill 27601

Joe Clarke · ‎12-27-2007

No, this does not help identify the CiscoWorks daemon. You would have needed to use pdshow to translate the PID to a daemon name. Then, a pdterm on that daemon would have been much better than simply killing it.

yjdabear · ‎12-27-2007

Thanks, I'll keep that in mind. In the meantime, are RMEs beyond v4.0.5 still affected by CSCse20477? I cannot disable TFTP due to a few straggler devices that do not have SSH support. I've also found the switches involved all have "set logging server severity 5" so RME activities couldn't be triggered by %SYS-6-CFG_CHG.

Joe Clarke · ‎12-27-2007

All versions of RME from 3.5 up are affected. There will be a fix going in to RME 4.2 to alleviate this.

yjdabear · ‎04-10-2008

This problem cropped up again, so far affecting one of the same two devices across the continent:

nms.domain.tld.50366 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED

nms.domain.tld.50967 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED

nms.domain.tld.51132 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED

nms.domain.tld.53207 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED

nms.domain.tld.54888 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED

nms.domain.tld.54964 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED

nms.domain.tld.56634 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED

nms.domain.tld.56796 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED

nms.domain.tld.57613 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED

nms.domain.tld.58926 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED

nms.domain.tld.61398 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED

nms.domain.tld.65402 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED

lsof -i:50366

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME

cwjava 4784 casuser 44u IPv4 0x300100ab3f0 0t462140 TCP nms.domain.tld:50366->catos-switch.domain.tld:ssh (ESTABLISHED)

pdshow

Process= ConfigMgmtServer

State = Program started - No mgt msgs received

Pid = 4784

RC = 0

Signo = 0

Start = 03/17/08 09:06:04

Stop = Not applicable

Core = Not applicable

Info = Application started by administrator request.

Questions:

1) While it's ConfigMgmtServer as described in the bug, does the same bug still apply when the switches are configured to send %SYS-6-CFG_CHG so there's no chance it's RME acting on syslog notification?

2) Since it appears to affect the same device(s), does that mean the device OS is also buggy? Is there anything that can be done on the device side to prevent this (other than bumping up the number of VTY lines)?

yjdabear · ‎04-10-2008

The switches impacted all have "set logout 0" configured. This raises the question of how LMS (specifically ConfigMgmtServer) terminates the SSH sessions to the devices. Does it send an explicit "exit"? It seems it doesn't.

Joe Clarke · ‎04-10-2008

Not all of the SSH sessions are true SSH sessions. RME sends a SYN to port 22 to test for SSH connectivity, and to determine the supported version of SSH. This socket (as with all sockets) is explicitly closed in Java. I made it my mission to hunt down socket leaks in LMS, and all known socket leaks should be fixed in RME 4.0.6 and 4.1.1.

Joe Clarke · ‎04-10-2008

1. If the switches are sending these messages to RME, and RME has not been configured to ignore them, and RME is using TFTP to fetch the config, then RME will loop trying to fetch the config, and you will see this problem with VTYs being eaten up.

2. The problem only started occurring on later versions of CatOS. Earlier versions did not send the %SYS-6-CFG_CHG message when the tftpGrp was modified via SNMP. This really can't be considered a device bug.

yjdabear · ‎04-10-2008

My bad. I left out the "not" as in "the switches are *not* configured to send %SYS-6-CFG_CHG so there's no chance it's RME acting on syslog notification? "

Joe Clarke · ‎04-10-2008

And you're seeing this problem with RME 4.0.6?

yjdabear · ‎04-10-2008

Yes.