12-27-2007 08:07 AM
Some of the Cat6k (CatOS 8.4) switches are out of VTY lines due to unterminated SSH sessions from RME 4.0.5. All these SSH sessions are in ESTABLISHED state.
1. Would it be safe to terminate the cwjava processes by pid? I assume these are results of once-a-day Config Archive jobs.
2. Is this SSH issue a CatOS or RME bug? A search in Bug Tool came up empty for any CatOS bug that's close.
Solved! Go to Solution.
12-27-2007 09:59 AM
Restarting ConfigMgmtServer should clear these up. The problem is documented in CSCse20477, and a script is available from TAC that will disable syslog-triggered config archive for these devices.
12-27-2007 09:59 AM
Restarting ConfigMgmtServer should clear these up. The problem is documented in CSCse20477, and a script is available from TAC that will disable syslog-triggered config archive for these devices.
12-27-2007 10:10 AM
The bug description goes: "This only happens when using TFTP in transport protocol which uses snmp read-write when trying to
pull the config." and "* Do not use TFTP as a configuration fetch protocol. Instead use telnet, SSH, SCP, or RCP". In my case, it's the SSH sessions that're backed up. My Transports Settings for ArchiveMgmt is: SSH, SCP, Telnet, and TFTP, in that order. I'm not sure this is the bug.
More findings: There were about a dozen SSH sessions to each of those two switches (sw1 and sw2). After doing a "kill
12-27-2007 10:16 AM
Yes, the problem is trigger by TFTP. Typically, if a CatOS device's lines are being tied up, this is the bug. What was the actual process (process name) you killed? If it was a cwjava process, it would have helped to know which dmgtd daemon it was to figure out why the lines were hung.
12-27-2007 10:21 AM
I didn't do a thorough capture of pid 27601, which I just realized was the same pid implicated by every active socket connection. Hopefully, what's left on my screen can still shed some light:
netstat -f inet | grep sw2
nms.fqdn.com.59446 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED
nms.fqdn.com.36623 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED
nms.fqdn.com.57115 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED
nms.fqdn.com.53398 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED
nms.fqdn.com.38539 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED
nms.fqdn.com.65329 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED
nms.fqdn.com.54673 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED
nms.fqdn.com.38238 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED
nms.fqdn.com.49460 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED
nms.fqdn.com.44512 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED
nms.fqdn.com.48385 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED
nms.fqdn.com.39146 sw2.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED
netstat -f inet | grep -i sw1
nms.fqdn.com.65023 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED
nms.fqdn.com.43251 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED
nms.fqdn.com.47566 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED
nms.fqdn.com.50480 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED
nms.fqdn.com.54475 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED
nms.fqdn.com.61729 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED
nms.fqdn.com.36249 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED
nms.fqdn.com.52040 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED
nms.fqdn.com.54341 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED
nms.fqdn.com.61546 sw1.fqdn.com.ssh 4096 0 64240 0 ESTABLISHED
lsof -i :44512
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
cwjava 27601 casuser 256u IPv4 0x30020f413b0 0t463912 TCP nms.fqdn.com:44512->sw2.fqdn.com.com:ssh (ESTABLISHED)
lsof -i :36249
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
cwjava 27601 casuser 197u IPv4 0x3001a280dc8 0t432792 TCP nms.fqdn.com:36249->sw1.fqdn.com:ssh (ESTABLISHED)
kill 27601
12-27-2007 10:28 AM
No, this does not help identify the CiscoWorks daemon. You would have needed to use pdshow to translate the PID to a daemon name. Then, a pdterm on that daemon would have been much better than simply killing it.
12-27-2007 10:34 AM
Thanks, I'll keep that in mind. In the meantime, are RMEs beyond v4.0.5 still affected by CSCse20477? I cannot disable TFTP due to a few straggler devices that do not have SSH support. I've also found the switches involved all have "set logging server severity 5" so RME activities couldn't be triggered by %SYS-6-CFG_CHG.
12-27-2007 10:35 AM
All versions of RME from 3.5 up are affected. There will be a fix going in to RME 4.2 to alleviate this.
04-10-2008 05:44 AM
This problem cropped up again, so far affecting one of the same two devices across the continent:
nms.domain.tld.50366 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED
nms.domain.tld.50967 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED
nms.domain.tld.51132 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED
nms.domain.tld.53207 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED
nms.domain.tld.54888 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED
nms.domain.tld.54964 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED
nms.domain.tld.56634 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED
nms.domain.tld.56796 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED
nms.domain.tld.57613 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED
nms.domain.tld.58926 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED
nms.domain.tld.61398 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED
nms.domain.tld.65402 catos-switch.domain.tld.ssh 4096 0 64240 0 ESTABLISHED
lsof -i:50366
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
cwjava 4784 casuser 44u IPv4 0x300100ab3f0 0t462140 TCP nms.domain.tld:50366->catos-switch.domain.tld:ssh (ESTABLISHED)
pdshow
Process= ConfigMgmtServer
State = Program started - No mgt msgs received
Pid = 4784
RC = 0
Signo = 0
Start = 03/17/08 09:06:04
Stop = Not applicable
Core = Not applicable
Info = Application started by administrator request.
Questions:
1) While it's ConfigMgmtServer as described in the bug, does the same bug still apply when the switches are configured to send %SYS-6-CFG_CHG so there's no chance it's RME acting on syslog notification?
2) Since it appears to affect the same device(s), does that mean the device OS is also buggy? Is there anything that can be done on the device side to prevent this (other than bumping up the number of VTY lines)?
04-10-2008 07:06 AM
The switches impacted all have "set logout 0" configured. This raises the question of how LMS (specifically ConfigMgmtServer) terminates the SSH sessions to the devices. Does it send an explicit "exit"? It seems it doesn't.
04-10-2008 08:27 AM
Not all of the SSH sessions are true SSH sessions. RME sends a SYN to port 22 to test for SSH connectivity, and to determine the supported version of SSH. This socket (as with all sockets) is explicitly closed in Java. I made it my mission to hunt down socket leaks in LMS, and all known socket leaks should be fixed in RME 4.0.6 and 4.1.1.
04-10-2008 08:24 AM
1. If the switches are sending these messages to RME, and RME has not been configured to ignore them, and RME is using TFTP to fetch the config, then RME will loop trying to fetch the config, and you will see this problem with VTYs being eaten up.
2. The problem only started occurring on later versions of CatOS. Earlier versions did not send the %SYS-6-CFG_CHG message when the tftpGrp was modified via SNMP. This really can't be considered a device bug.
04-10-2008 09:13 AM
My bad. I left out the "not" as in "the switches are *not* configured to send %SYS-6-CFG_CHG so there's no chance it's RME acting on syslog notification? "
04-10-2008 09:17 AM
And you're seeing this problem with RME 4.0.6?
04-10-2008 09:18 AM
Yes.
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: