I hope someone can help me out with this problem. We are running A3(2.2), and two of our rservers (the only two for a particular farm, and the only two running the same OS/webserver combination (Ubuntu 10.04/gninx) are reporting a very high number of connections.
Three items may be of interest:
1) The number of connections showing with 'show rserver <name>' is about 10 times as high as a 'netstat' on the server itself.
2) show conn rserver <name> detail is showing the detail of many more connections than are actually established on the server. Some of the connections are shown as being idle for up to 1 hour; it seems the ACE gets rid of the connection after an hour of idle time.
9228 1 in TCP 100 x.y.125.12:3103 a.b.c.d:80 ESTAB
[ idle time : 00:52:12, byte count : 1471 ]
[ elapsed time: 00:52:18, packet count: 8 ]
285 1 out TCP 100 a.b.c.d:80 a.y.125.12:3103 CLOSED
[ conn in reuse pool : FALSE]
[ idle time : 00:52:13, byte count : 7248 ]
[ elapsed time: 00:52:18, packet count: 9 ]
3) show rserver <name> detail show "total conn-failures" steadily increasing all the time, much more so than all our other servers, even servers that process a higher number of requests. The total failure rate is close to 3%.
I have stress tested the webservers with wget and the MS Web Application Stress Tool, but cannot get a connection failure.
My theory is that the failed connections are sticking around until the ACE times them out at 1 hour. The question is, how can I debug why the ACE is failing to make the connection? None of the other servers behind the ACE are exhibiting this problem.
For an explanation about the conn-failure counter, have a look here:
-------------------------As per the design, connection that gets closed before three way handshake is considered as connection failure and the corresponding server counter is incremented. Below describes how it works for L4/L7 connections with/with-out normalization
1. With normalization on
The count will increment if the three way handshake fails to establish of one of the followings:
a. A RST comes from client or server after SYN-ACK
b. Server never reply to a SYN. The connection will then time out.2, WIth normalization offThe count will not increment. For L7 (Normalization is always on)The count will increment if the three way handshake fails to establish of one of the followings:a. A RST comes from server after the front end connection is established.b. Server never reply to a SYN. The connection will then time out.----------------------------
Keep in mind the ACE will keep the TCP connections in its database for one hour (unless connection is closed fromt the client) by default in idle state after which it will time it out. You can modify this behaviour here:
If connection from the client side is not closed, then even though connections might not be active on the server side (and therefore not present in the netstat), they will be counted in the connections output for show rserver
To debug why you see conn-failure increasing you might take a sniffer trace on the tengig interface/on the real server and check if all conns are able to be established thorugh normal 3way handshake.
Otherwise, since you are saying the other servers are fine, it would be interesting to understand what are the differences between servers working and not worrking?
i.e. type of connection/TCP port/parameter/load/VLAN/type of server?
Hope this helps.
Thank you for the detailed information and confirming my assumptions about the long-active connections.
The environment is a virtual environment, and everything (VLAN, Host, protocol, port) is the same. The only difference is OS and web server. However, whether I connect through the load balancer, or directly through the servers, I seem to be getting a connection each time. I did a MS Network Monitor packet capture, and the 3 way handshake seems to work every time. So I'm not sure why the connection errors on the ACE are increasing so fast - from a user perspective, everything looks normal.
Is there a way to log connection errors to syslog or something? Then I can try to correlate the source port/ip with the customer's web server logs. Packet capture from the server side is impossible for our environment since we don't have access to the server, and it is virtual so sniffing the switch port is not possible.
Is it a L4 policy? If yes, can you verify if by disabling the normalization you are still getting the conn-failures?
If it is a L7 policy, by any chance did you apply the parameter map with server-conn reuse command?
Indeed, for the traces, I understand you took them from the server, however if for some reason the SYN packet does not arrive to the server, the counter will increase and we wont see the missed SYN in the trace.
The best would be to collect a trace from the tengig of the cat6k:
------------------------------------------------------------------- TO SETUP MONITOR SESSION Set a monitor session having source of the span the ACE TenGigabit interface and destination of the span a destination port connected to a network capture device. For example, having the ACE in slot 2 and the capture device connected on GE 6/9: Router(config)#monitor session 1 source interface TenGigabitEthernet 2/1 both Router(config)#monitor session 1 destination interface GigabitEthernet 6/9 If you wish you can filter by VLAN. For example if you wish to span only VLAN 3 to 5 and 10: Router(config)#monitor session 1 filter vlan 3 - 5 , 10 Configure the destination port as a trunk port so that the VLAN IDs will be preserved: interface GigabitEthernet6/9 switchport switchport trunk encapsulation dot1q switchport mode trunk switchport nonegotiate NOTE: When connecting to GigabitEthernet 6/9 be sure to use a network capture device that can monitor VLAN tagging (a trunked port). In this way, VLAN tags will be preserved, and we will be able to clearly see which VLAN a packet arrived on, and which VLAN it exited on. Please refer the following links how to preserve the VLAN tags for different operating systems: http://wiki.ethereal.com/CaptureSetup/VLAN http://wiki.wireshark.org/CaptureSetup/VLAN NOTE: Please note that on the packet trace tool (WireShark, Ethereal, SnifferPro), the frame snap size should be set to unlimited otherwise, only the first 68 bytes of each frame may be captured. More info regarding the configuring the SPAN: http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SXF/native/configuration/guide/span.html -------------------------------------------------------------------------
Our rules are L7. Using Microsoft network monitor, I think I have made a significant discovery:
To an IIS7 server (connections do not get stuck on the load balancer for 60 minutes), the "TCP State" per connection is as follows:
1) Half connected
To nginx, the "TCP State" per connection appears in the monitor as follows:
1) Half connected
Please correct me if I'm wrong, but the missing FinWait2 means that the client (my PC) never gets an ACK for the FIN that it sends to the server? From the link you sent me: "To configure a timeout for TCP embryonic connections (connections that result from an incomplete three-way handshake) and half-closed connections (connections where the client has sent a FIN and the server has not responded), use the set tcp timeout command. Use the no form of this command to reset TCP timeout values to their default settings." - I assume the exact behaviour we are seeing is explained by the missing ACK.
Thanks for your assistance with this.
conn-failure should increase only on three way handshake, therefore if the closure is terminated due to a timeout or rather a starndard fin -fin/ack - ack process, this should not impact this counter.
Another thing, where did you put the Microsoft network monitor? Is it on the client or rather on the server? You should watch the server side.
Is the conn-failure counter increasing for each connection established? i.e. are you able to see the counter increasing even only establishing one session?
And if yes, is the counter increasing when you initiate the connection/while established or at the closure?
Or indeed are you running performance test and the problem shows only in this situation and statistically how many conn-failure on how many total conns you see?
Also, can you tell me if you are using the server conn-reuse parameter?
Apparently if this parameter is enabled in some ACE release this might lead to mistaken conn-failure counts.
I am running on the client, as the MS Network Monitor only runs on Windows, and I do not have access to the (Linux) servers in question. We do not have the word 'reuse' anywhere in our configuration, so it's definitely not that.
I think we may have two different issues here (which may or may not be related):
1) High number of connections showing with "show conn rserver", with a long idle times
2) High number of connection failures
I took about a dozen or so of the stale (>40 minutes idle) connections listed with "show conn rserver detail", and queried our netflow database for any flows matching the client's IP and port number. I can definitely see flows from the client to the load balancer, from the load balancer to the client, and from the rserver's IP to the client's IP. The number of bytes being sent from the rserver through the load balancer to the client (10KB-100KB) implies that the 3way handshake did work, and that the connections were successful. From this I think we can conclude that at least some of the "stale" connections are not being caused by failed connections. In other words, I don't think #1 above is being casued by #2, at least not entirely. What are your thoughts on this? Is my theory regarding the missing ACK to the client's FIN valid to explain this?
Onto #2: The conn-failures do not increase for every connection. This value increases at a rate of 2-3% of the total number of connections (whether or not I'm doing load testing). Since there are so many connections going to the server (500K+/day), I cannot see if the counter is increasing on the open rather than the close. However, there does seem to be a loose correlation between a stale connection hitting 00:59:59 idle and when the connection failure counter increases - could you perhaps find out if the counter does increment when a connection is forcefully expired by the ACE?
Sniffing packets is going to require me to install another physical server at the site (which is 10000km away), so I will have to put that on hold for now. I am not able to generate connection errors from a few test servers, whether I connect to the rservers directly, or through the load balancer.