CSS 11503 Load Balancing Verification

jfraasch · ‎04-08-2010

Alright, so I have toiled long and hard to get this right. I think I have the config down but I am unsure on how to verify how this load balancing is working.

Here is the Content Config that I am speaking of:

content cad-rule
    add service wls1-e0
    add service wls1-e1
    add service wls2-e0
    add service wls2-e1
    add service wls3-e0
    add service wls3-e1
    add service wls4-e0
    add service wls4-e1
    add service wls5-e0
    add service wls5-e1
    add service wls6-e0
    add service wls6-e1
    arrowpoint-cookie expiration 00:00:15:00
    advanced-balance arrowpoint-cookie
    redundant-index 2
    vip address 172.30.194.195 range 2
    arrowpoint-cookie name TOQ
    protocol tcp
    port 8001
    url "/*"
    active

Each service in the rule above is configured as follows:

service wls1-e1
port 8001
protocol tcp
strin ags001-e1
ip address 172.30.193.81
keepalive type http
keepalive uri "/cad/index.html"
redundant-index 12
keepalive frequency 20
keepalive maxfailure 10
keepalive retryperiod 2
active

I am using the advanced arrowpoint cookies because I need some stickiness here. Straight round-robin would not have done what I needed it to do.

Now, when I go to my show summary, this is what I see for this rule:

                 cad-rule    Master   wls1-e0 84274
                                            wls1-e1 13144
                                            wls2-e0 96884
                                            wls2-e1 26374
                                            wls3-e0 71145
                                            wls3-e1 16592
                                            wls4-e0 76403
                                            wls4-e1 8657
                                            wls5-e0 118623
                                            wls5-e1 22760
                                            wls6-e0 30836
                                            wls6-e1 20464

The far right column indicates the services hits. I originally had the E1's suspended and activated them later on. So if this was true round robin, all the E0's should have the same number of service hits and all the E1's should have the same number of service hits. But as you can see, the wls5 server is getting hit the most while the wls6 server is sitting there twiddling its thumbs.

Now understanding how the arrowpoint cookies do their load balancing (inserting a cooking into the flow and then timing out after 15 mins as configured above) I would not expect a 1:1 ratio of load balancing between servers. But the distribution above seems rather extreme.

Does anyone have any suggestions on how to both A) verify that this is the right config and B) suggest to my boss that this is working the way it should be working?

Thanks!

James

jfraasch · ‎04-08-2010

I just came across this link:

http://www.cisco.com/en/US/products/hw/contnetw/ps789/products_tech_note09186a00801c8c2f.shtml

That says I might be having cookie timeout at ONE YEAR instead of 15mins.

Here is my show ver:

CSS11503# sh ver
Version:               sg0740004 (07.40.0.04)
Flash (Locked):        07.30.2.03
Flash (Operational):   07.40.0.04
Type:                  PRIMARY
Licensed Cmd Set(s):   Standard Feature Set

Looks like I might be able to remove the cookie timeout to get the load balancing in better order...

Thoughts?

James

Diego Vargas · ‎04-08-2010

Hi James,

There are several reasons of the uneven load balancing that you are seeing (based on the show summary). First
of all, the CSS is configured to do stickiness (advance-balance). 

With arrowpoint-cookies (for HTTP only) method for stickiness, only the requests coming with the same cookie 
are going to get stuck to the same server, since the cookie is
lost when the browser is closed (or based on the expiration), then the stickiness is going to be session
based and if the same client open a new session is going to be load balanced.

Is important to understand that when using stickiness, no real even load balancing is
going to happen since we are sticking new flows to the same server; even when layer 5 stickiness would
permit more even balancing than layer 3 stickiness (source IP based).

Also consider that the "show summary" is a command to see the hits (requests) being balanced to an specific
server, this is a good command to see the load balancing, anyway since the CSS balance
connections (flows), a persistent connection could have a lot of requests, so all those requests are
always going to the same server (incrementing the amount of hits in the counter) while a non-persistent 
connection would be just one request (refer to HTTP persistence). 


Also keep in mind that if a service is take out for maintenance, or is added to the load balancing later 
than another, or if goes down for a period of time, then the CSS will be balancing among the remaining alive
servers. When you add the server again, the another servers are going to have connections
already established, so since the CSS is doing round robin, the server last added will
never have the same amount of connections (nor hits) that the other ones, because while one could
have 55 for example, the new one will have it first connection, and when the first one
gets the 56, the another will get the second, and so on. 

Please let me know if this makes any sense.

Diego M

jfraasch · ‎04-09-2010

Diego,

Thanks for the response.

I think we are both on the right path. Cookie based Load-balancing will never be equal. It could be that the user that is current "stuck" to server 5 will simply move over to another server and crush it when the cooking expires.

Such is the dilemna with having users "stuck".

I will continue to monitor things. Even if I remove the timeout and reset the cookie each time a browser closes, whatever user is causing the tremendous load will just reconnect to another server and do the same thing there when his browser reopens.

Maybe the question is more about why the user is causing more havoc than others.

Thanks again.

James