ACE: problem with balance of the RADIUS flows

k-gerasymenko · ‎05-16-2010

Hello!

There are the one RADIUS-client (IP:10.10.10.60) and two RADIUS-server (IP:10.10.10.15 and 10.10.10.16).

The ACE module needs for balancing RADIUS requests between RADIUS-server. It is need to
balance the Radius requests based exeptionally on their "calling-station-id" from one RADIUS-client to two
RADIUS-servers

Right now I am testing the same scheme with ACE module. But there is some other problem.
It is used the "one-arm" type connection of the ACE module in Cisco 7604 to the network. When the RADIUS-client send the requests
the ACE terminate their and answer to the client with succesful. The ACE balance these requets between both of the RADIUS-servers approximately in proportion 2/3.
But these calls are fails on the servers site. The Sticky have not helped. Without ACE the all reguest are successful.

Is there the decision of this problem?

My config is:

access-list ANY line 8 extended permit ip any any

rserver host TEST1

ip address 10.10.10.15
inservice
rserver host TEST2
ip address 10.10.10.16
inservice

serverfarm host SERVERFARM1
rserver TEST1
inservice
rserver TEST2
inservice

class-map type management match-any MGMT_CLASS
2 match protocol icmp any
3 match protocol ssh any
4 match protocol telnet any
class-map match-any RADIUS_L4
2 match virtual-address 10.10.10.100 udp range 1812 1813
class-map type radius loadbalance match-any RADIUS_L7
2 match radius attribute calling-station-id ".*"

policy-map type management first-match MGMT_POLICY
class MGMT_CLASS
permit

policy-map type loadbalance radius first-match RADIUS_L7_POLICY
class RADIUS_L7
serverfarm SERVERFARM1

policy-map multi-match RADIUS

class RADIUS_L4
    loadbalance vip inservice
    loadbalance policy RADIUS_L7_POLICY
    loadbalance vip icmp-reply active
    nat dynamic 1 vlan 10

interface vlan 10
ip address 10.10.10.10 255.255.255.0
access-group input ANY
access-group output ANY
nat-pool 1 10.10.10.100 10.10.10.100 netmask 255.255.255.0 pat
service-policy input MGMT_POLICY
service-policy input RADIUS
no shutdown

Best regards

Konstantyn

Gilles Dufour · ‎05-17-2010

You haven't configured stickyness right now.

You just try to match radius requests containing a "calling-station-id".

Requests not having a calling-station-id will get dropped.

Did you try simple loadbalancing without radius ?

Did you get a sniffer trace in the curreny situation to see where it fails ?

If the request gets to the server, where is the response from the server sent ?

Is it correctly sent to the ACE ?

Is ACE then correctly forwarding the response to the source ?

Gilles.

Filip Talpa · ‎05-17-2010

from my experience roundrobin tends to end up in unequal distribution. I prefer using least-connections predictor on serverfarm.

k-gerasymenko · ‎05-18-2010

Hi

There is other config with the sticky, but the result is the same.

Two RADIUS servers are active: For example the RADIUS client send 10 reguest and all these request are successful. On the RADIUS servers site I see that have been received 15 calls (6 calls - on the TEST1, 9 calls - on the TEST2) and only 8 calls are successful (3 on the TEST1, 5 calls - on the TEST2). The log file on the servers are sowed: call timeout detecting.

If the one of the RADIUS servers is down, all call are successful and equal quantity as well RADIUS server as RADIUS client.

access-list ANY line 8 extended permit ip any any

probe icmp PROBE_ICMP
interval 2
faildetect 2
receive 5
probe radius PROBE_RADIUS

rserver host TEST1
ip address 10.10.10.15
inservice
rserver host TEST2
ip address 10.10.10.16
inservice

serverfarm host SERVERFARM1
predictor leastconns
probe PROBE_ICMP
rserver TEST1
inservice
rserver TEST2
inservice

sticky radius framed-ip calling-station-id STICKY-1
serverfarm SERVERFARM1

class-map type management match-any MGMT_CLASS
2 match protocol icmp any
3 match protocol ssh any
4 match protocol telnet any
class-map match-any RAD_L4_C
2 match virtual-address 10.10.10.100 udp range 1812 1813
class-map type radius loadbalance match-all RAD_L7_C
2 match radius attribute calling-station-id ".*"

policy-map type management first-match MGMT_POLICY
class MGMT_CLASS
permit

policy-map type loadbalance radius first-match RAD_L7_P
class RAD_L7_C
sticky-serverfarm STICKY-1

policy-map multi-match POLICY_L7
class RAD_L4_C
    loadbalance vip inservice
    loadbalance policy RAD_L7_P
    loadbalance vip icmp-reply active
    nat dynamic 1 vlan 10

interface vlan 10
ip address 10.10.10.10 255.255.255.0
access-group input ANY
access-group output ANY
nat-pool 1 10.10.10.100 10.10.10.100 netmask 255.255.255.0 pat
service-policy input MGMT_POLICY
service-policy input POLICY_L7
no shutdown

Gilles Dufour · ‎05-18-2010

You still have no idea what is happening in the network.

ACE is a network device, so we do not need to know what the server reports, but we need to know what packets comes in, what packets go out and what is the content of each packet.

So we need a sniffer trace.

Also, you keep talking about call ??

Is the call traffic also going through ACE ?

G.

Filip Talpa · ‎05-18-2010

this is IMO where cisco gets it all wrong YOU need to understand the application behind. after all any server is just a network device with this approach.

back on topic -- I'll check if stickiness really works as it is supposed. from waiting for call disconnect one can assume that the server never gets the call cleared event. (@g - see no sniffer required...)

Gilles Dufour · ‎05-19-2010

For those following this thread and interested in getting better at troubleshooting, this is a perfect example of why people fail to solve their problems.

They start from a server error message and try to understand what happened somewhere in the network.

Since they can't figure out what caused the error, they start changing the configuration in all directions without knowing the source of the problem.

Even, if luckily they change the right component they would still be unable to explain why it fixed the problem.

This is a very common error.

The best way to troubleshoot a network device is to capture a sniffer trace.

Why ?

Because a network device is there to transport traffic (most often tcp/ip)...and usually knowing the details about the application is useless.

In some cases, it is indeed necessary to go deeper in the packet in order to perform a more inteligent routing/switching decision.

For example, when you do http cookie stickyness or radius loadbalancing.

But even in that case, the network device will work on a packet per packet basis and for each of them decide what to do with it.

With the sniffer trace you can first compare successful client-server exchanges compare to failed ones.

You can then see what is different (asymetric path ? traffic misrouted ? traffic blocked ? packet corrupted ? fragmentation ? ...)

All network problems which can lead to many different error messages on the server.

Messages that will be different depending on the application, the hardware, the OS, the vendor, ...

k-gerasymenko · ‎05-18-2010

Of course, there is a some log.

Certainly one thing is clearly (from the test) that the RADIUS-client and RADIUS-server are directly working correct. In this case the all calls are successful.

I have run the sniffer. The sniffer have displayed all right in the RADIUS packets.

Why do I see more calls on the serevers site than it have sent the client? It would be well to run the debug on the ACE if the one will have.

Filip Talpa · ‎05-18-2010

what does the output from show sticky database show?

well ACE has a debug mode. but it requires a special code to be loaded -- much like you do on nexus switches.

Gilles Dufour · ‎05-19-2010

Can you share the sniffer trace ?

Do you know which flow in the trace the server reported as failure ?

Gilles.

k-gerasymenko · ‎05-19-2010

I have founded out follow from the documentation of the ACE:

"The ACE does not load balance RADIUS accounting on/off messages. Instead, it
replicates those messages to each real server in the server farm that is configured
in the RADIUS LB policy.: