function of health probe timers on CSM

thorsten.steffen · ‎04-22-2009

Hi,

we use the following configuration on a csm to monitor a server farm and I'm wondering how exactly the probe timers work.

===

serverfarm sf

nat server

nat client natpool1

failaction purge

real name serv1

weight 1

inservice

real name serv2

weight 1

inservice

probe probe1

probe probe1 script

script LDAP_PROBE

interval 5

retries 2

receive 1

port 389

===

So in my eyes the probes are sent every 5 seconds. When a probe isn't answered within one second it's marked as failed. If two probes are failed (retries 2) the real server is marked as down.

Is this correct?

In a network trace I see a different behaviour: Probes are sent every 5 seconds. If a real server goes out-of-service I see a probe which is not answered and the next probe is sent after 10 seconds (I expected 5 seconds). 5 seconds later the real server is marked down in the switch log.

It would be fine if anybody could help me.

Best Regards,

Thorsten Steffen

dario.didio · ‎04-27-2009

Hi,

following the meaning of the parameters:

Router(config-slb-probe)#

interval seconds

Sets the interval between probes in seconds (from the end of the previous probe to the beginning of the next probe) when the server is healthy.

Range = 2-65535 seconds

Default = 120 seconds

Router(config-slb-probe)#

retries retry-count

Sets the number of failed probes that are allowed before marking the server as failed.

Range = 0-65535

Default = 3

Router(config-slb-probe)#

failed failed-interval

Sets the time between health checks when the server has been marked as failed. The time is in seconds.

Range = 2-65535

Default = 300 seconds

Router(config-slb-probe)# open

open-timeout

Sets the maximum time to wait for a TCP connection. This command is not used for any non-TCP health checks (ICMP or DNS1).

Range = 1-65535

Default = 10 seconds

There are two different timeout values: open and receive. The open timeout specifies how many seconds to wait for the connection to open (that is, how many seconds to wait for SYN ACK after sending SYN). The receive timeout specifies how many seconds to wait for data to be received (that is, how many seconds to wait for an HTTP reply after sending a GET/HHEAD request). Because TCP probes close as soon as they open without sending any data, the receive timeout is not used.

When sniffing, you should see a probe each 5 seconds. When a probe fails for the first time, a second probe should be send after 5 seconds. when this probe fails too, the server is put out of service.

That should be the behaviour you should see.

HTH,

Dario

thorsten.steffen · ‎05-06-2009

Hi Dario,

I took again a trace with wireshark and while the server fails I saw the following packets:

====

Second 0: TCP Handshake and LDAP Bind Request from CSM to Real Server -> Real Server acks the LDAP Request but does not send an answer because LDAP is failed

Second 5: TCP FIN from CSM to Real Server

Second 10: Next TCP Handshake and LDAP Bind Request from CSM to Real Server -> same behaviour as above

Second 15: TCP FIN from CSM to Real Server

Second 15: Syslog Message with health probe failed

Second 315: Next TCP Handshake and LDAP Bind Request from CSM to Real Server and Bind Response from the Real Server which is alive again.

====

So in my eyes the receive timer does not work as expected because the csm waits 5 seconds (instead of 1 configured) until it closes a session where it did not receive a ldap response.

Do you have any idea concerning this behaviour?

Further on, does the receive timer include the tcp handshake time or does it start when the handshake is done? In the last case is it correct that we should use also the open timer to prevent long tcp handshake times?

Best Regards,

Thorsten

dario.didio · ‎05-06-2009

Hi,

like said in my previous post, the receive timer is only used when using a HTTP probe.

"The receive timeout specifies how many seconds to wait for data to be received (that is, how many seconds to wait for an HTTP reply after sending a GET/HHEAD request). Because TCP probes close as soon as they open without sending any data, the receive timeout is not used."

You could use the open timer to prevent long TCP handshakes, because that one will take into account the time needed to receive the SYN/ACK after the SYN is send.

HTH,

Dario

thorsten.steffen · ‎05-06-2009

Hello Dario,

thanks for the hint.

So is there any possibility to configure the receive timeout (which seems to be 5 seconds) for the ldap response respectively responses from applications other than http where the tcp handshake has finished correctly?

Regards,

Thorsten