We have a pair of CSS 11503, one armed configurations running VIP redundancy on a single VLAN with a single virtual router, without ASR.
Recently the device that had the Master Virtual Router displayed a problem with TCP port keepalives. Some services constantly switched between Alive and Dying, others failing completely to Dead state.
We shutdown the interface supporting this CSS and the Backup Virtual Router took over within the 3 second default failover time. The new master has been running like this for several days now.
When we opened the interface back up on the old master, it came-up and stayed as backup. This part good. However, the Keepalives continued to transition.
When we analyse a trace of this keepalive traffic it appears that a keepalive with a default frequency, max-retries and retry period (5,3,5) actually seems to transmit at just over 10 second intervals.
The keepalives we can see are all good sequences, with SYN ACK received in 5-6 milliseconds.
My view is that the initial keepalive for an alive service is not actually leaving the CSS. The CSS marks this service as dying and sends the next keepalive at the retry interval (5 seconds for the first (not seen on trace, and 5 seconds for the retry interval. This does leave the CSS and marks the service Alive.
Has anyone else seen this type of activity?
The behaviour does change slightly depending on whether we have keepalive tcp-close fin or rst. However the services still transition constantly or stay dead.
win-glou-02# show ver
Version: sg0730203 (07.30.2.03)
Flash (Locked): 07.20.0.03
Flash (Operational): 07.30.2.03
Licensed Cmd Set(s): Standard Feature Set
Where did you capture this trace ?
If on the server, I would suggest to recapture and this time on the CSS port side.
The CSS actually wait (frequency - 1) seconds for the server response.
So, if you see new SYN from the CSS every ~10 sec it means the CSS is actually not receiving the SYN/ACK from the client and it's just waiting ~4sec before declaring the keepalive has down.
If you have an ACL, I would also verify the ACL permits traffic from server to CSS.
You should also try different cable/port to make sure hardware is ok.
Thanks for rating this answer.
I am going to attempt to attach the trace which was taken at the L2 Catalyst using NAM data capture, on the port attached to the CSS. We filtered this trace to anything to/from the IP address of the "Service" i.e. 220.127.116.11.
The CSS has a single Interface (one-armed config).
You will see that all the keepalives are successful as far as the LAN trace on the CCS interface is concerned
However the keepalive config for this service was a simple tcp keepalive as follows:
ip address 18.104.22.168
keepalive type tcp
This showed as:
Name: kal-test Index: 33
Type: Local State: Down
Rule ( 22.214.171.124 TCP 51000 )
Session Redundancy: Disabled
Keepalive: (TCP-51000 5 3 5 )
Last Clearing of Stats Counters: 10/20/2005 09:22:16
Mtu: 1500 State Transitions: 10
Total Local Connections: 0 Total Backup Connections: 0
Current Local Connections: 0 Current Backup Connections: 0
Total Connections: 0 Max Connections: 65534
Total Reused Conns: 0
Weight: 1 Load: 255
Thus the keepalives should be transmiting at 5 second intervals as the success is immediate.
What we see is nearly 11 seconds interval and all during this time the service is transitioning between alive and dying.
Perhaps, something like this occurs...
Keepalive success. Service marked alive and a keepalive is scheduled at the configured frequency i.e. 5 seconds after the last success. This does not leave the CSS, but the CSS tests for success after the 1 second wait. It does not see an answer to the Keepalive it actually has not transmitted, so marks the service dying and schedules a retry at the retry-period, another 5 seconds. A total of 11 seconds! Hey presto this is successful so it is marked back Alive.
NB This is a redundant VIP configuration and we have only seen this phenomena on one of the two boxes. Both boxes have a single Gigabit connection to different Layer 2 Catalysts.
I think you are hitting the following bug
CSCeg60264 - keepalives remain in a
This was fixed in version 7.30(4.02) and 7.40(2.02).
I would recommend to migrate to a 7.40 or 7.50 version.
Thanks for rating this answer.
thanks for this. I did not find this bug because I searched for "keepalive" rather than "keepalives", (and there is a spelling mistake in the bug). I will be a bit more thorough next time.
Anyway, I am not completely convinced as:
The services do not always remain in
The published workaround does not work. It does not seem to matter whether we are in tcp-close fin or rst, we still get bouncing or dead KAL status.
Did I mention that ICMP keepalives work well. However, we cannot run ICMP keepalives on these services as the servers block Pings normally.
Anyway I will try and persuade the operational area to upgrade to a later version. I think only time will tell whether we have cleared the problem.
Unfortunately we are currently regression testing our apps etc against 7.40(1.03) which was the latest published version of 7.4 at the time. I have got to persuade them that this bug is worth a move further forward. We have a policy to stay at latest release minus 1, which is why we are not moving forward to 7.5 yet, unless we hit a bug that is only solved in that release.
I will also be asking our Account team for advice on this policy.
For the workaround to work, you have to make the config changes and then reboot.
The problem is that some ports get stuck and when the CSS tries to use such a port, the keepalive will fail.
At the end, the number of affected ports could be so high that the keepalive stays always down.