cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
452
Views
8
Helpful
7
Replies

CSS TCP KAL Failures in Redundant pair config

andrew.thomson
Level 1
Level 1

We have a pair of CSS 11503, one armed configurations running VIP redundancy on a single VLAN with a single virtual router, without ASR.

Recently the device that had the Master Virtual Router displayed a problem with TCP port keepalives. Some services constantly switched between Alive and Dying, others failing completely to Dead state.

We shutdown the interface supporting this CSS and the Backup Virtual Router took over within the 3 second default failover time. The new master has been running like this for several days now.

When we opened the interface back up on the old master, it came-up and stayed as backup. This part good. However, the Keepalives continued to transition.

When we analyse a trace of this keepalive traffic it appears that a keepalive with a default frequency, max-retries and retry period (5,3,5) actually seems to transmit at just over 10 second intervals.

The keepalives we can see are all good sequences, with SYN ACK received in 5-6 milliseconds.

My view is that the initial keepalive for an alive service is not actually leaving the CSS. The CSS marks this service as dying and sends the next keepalive at the retry interval (5 seconds for the first (not seen on trace, and 5 seconds for the retry interval. This does leave the CSS and marks the service Alive.

Has anyone else seen this type of activity?

The behaviour does change slightly depending on whether we have keepalive tcp-close fin or rst. However the services still transition constantly or stay dead.

7 Replies 7

Gilles Dufour
Cisco Employee
Cisco Employee

what version of CSS ?

Makes it easier to search known-issues.

Gilles.

Sorry...forgot

win-glou-02# show ver

Version: sg0730203 (07.30.2.03)

Flash (Locked): 07.20.0.03

Flash (Operational): 07.30.2.03

Type: PRIMARY

Licensed Cmd Set(s): Standard Feature Set

Secure Management

Another question.

Where did you capture this trace ?

If on the server, I would suggest to recapture and this time on the CSS port side.

The CSS actually wait (frequency - 1) seconds for the server response.

So, if you see new SYN from the CSS every ~10 sec it means the CSS is actually not receiving the SYN/ACK from the client and it's just waiting ~4sec before declaring the keepalive has down.

If you have an ACL, I would also verify the ACL permits traffic from server to CSS.

You should also try different cable/port to make sure hardware is ok.

Regards,

Gilles.

Thanks for rating this answer.

I am going to attempt to attach the trace which was taken at the L2 Catalyst using NAM data capture, on the port attached to the CSS. We filtered this trace to anything to/from the IP address of the "Service" i.e. 31.248.50.241.

The CSS has a single Interface (one-armed config).

You will see that all the keepalives are successful as far as the LAN trace on the CCS interface is concerned

SYN-->

<---SYN-ACK

ACK-->

RST-->

However the keepalive config for this service was a simple tcp keepalive as follows:

service kal-test

ip address 31.248.50.241

protocol tcp

port 51001

keepalive type tcp

active

This showed as:

Name: kal-test Index: 33

Type: Local State: Down

Rule ( 31.248.50.241 TCP 51000 )

Session Redundancy: Disabled

Redirect Domain:

Redirect String:

Keepalive: (TCP-51000 5 3 5 )

Last Clearing of Stats Counters: 10/20/2005 09:22:16

Mtu: 1500 State Transitions: 10

Total Local Connections: 0 Total Backup Connections: 0

Current Local Connections: 0 Current Backup Connections: 0

Total Connections: 0 Max Connections: 65534

Total Reused Conns: 0

Weight: 1 Load: 255

DFP: Disable

Thus the keepalives should be transmiting at 5 second intervals as the success is immediate.

What we see is nearly 11 seconds interval and all during this time the service is transitioning between alive and dying.

Perhaps, something like this occurs...

Keepalive success. Service marked alive and a keepalive is scheduled at the configured frequency i.e. 5 seconds after the last success. This does not leave the CSS, but the CSS tests for success after the 1 second wait. It does not see an answer to the Keepalive it actually has not transmitted, so marks the service dying and schedules a retry at the retry-period, another 5 seconds. A total of 11 seconds! Hey presto this is successful so it is marked back Alive.

NB This is a redundant VIP configuration and we have only seen this phenomena on one of the two boxes. Both boxes have a single Gigabit connection to different Layer 2 Catalysts.

I think you are hitting the following bug

CSCeg60264 - keepalives remain in a state

This was fixed in version 7.30(4.02) and 7.40(2.02).

I would recommend to migrate to a 7.40 or 7.50 version.

Thanks for rating this answer.

Regards,

Gilles.

Gilles,

thanks for this. I did not find this bug because I searched for "keepalive" rather than "keepalives", (and there is a spelling mistake in the bug). I will be a bit more thorough next time.

Anyway, I am not completely convinced as:

The services do not always remain in state. They transition constantly between ALIVE and DYING. However, if they do actually make it to DOWN they often stay DOWN.

The published workaround does not work. It does not seem to matter whether we are in tcp-close fin or rst, we still get bouncing or dead KAL status.

Did I mention that ICMP keepalives work well. However, we cannot run ICMP keepalives on these services as the servers block Pings normally.

Anyway I will try and persuade the operational area to upgrade to a later version. I think only time will tell whether we have cleared the problem.

Unfortunately we are currently regression testing our apps etc against 7.40(1.03) which was the latest published version of 7.4 at the time. I have got to persuade them that this bug is worth a move further forward. We have a policy to stay at latest release minus 1, which is why we are not moving forward to 7.5 yet, unless we hit a bug that is only solved in that release.

I will also be asking our Account team for advice on this policy.

For the workaround to work, you have to make the config changes and then reboot.

The problem is that some ports get stuck and when the CSS tries to use such a port, the keepalive will fail.

At the end, the number of affected ports could be so high that the keepalive stays always down.

Gilles.

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: