I have about 50 routers and layer-3 switches that autheticate via tacacs+. The AAA server used to be on a Linux machine running open-source tacacs+ built by me. I have a perl script that will log into all 50 devices at the same time to collect statistics. This script is multi-threaded. Everything is working fine so far.
I recently out-sourced the AAA function to a 3rd party company, not by my choice. The 3rd party uses Cisco ACS version 4.2 with the latest patch running on Windows 2003 Enterprise Server with 16GB RAM and quad processors with quad-cores, IBM x3650-M2 hardware. The connectivity between the 3rd party and my company is through a DS-3 connection. Maximum bandwidth over this DS-3 connection is less than 10Mbps at most.
I noticed that for the past 3 months I have multiple failures with this perl script due to authentication failure with the ACS server. If I just run the script again a few routers/switches, there are no issues; however, whenever I started the script to log into 50 devices all at the same time, it will fail. If I made the configuration on all routers/switches to point back to the old open-source tacacs+ server, the issue goes away. The minute I switched back to the
new ACS server, the issue came back. If I modified the script to hit one device at a time, it works fine. I think it is the ACS server can not handle a lot
of AAA requests at the same time.
Does anyone know how many concurrent connections that an ACS 4.2, with latest patches on Windows 2003 Enterprise Server with lot of memory and CPU power, can handle? I can't seem to find this anywhere on Cisco website.
Thanks in advance.
Is there any kind of Host IPS or Cisco Security agent installed on the Cisco ACS server running on Windows?
If so, check if there are any alerts for connections being blocked.
You may get some idea of con connections from this link.
There is a known issue of auth per sec with v 4.1.4, but I am not aware of anuthing on v4.2.
Bug id - CSCsd46457.
I think the ACS Tacacs server by default will have a limited number of connections... memory failing but think it might be about 40. So if you are doing 50 concurrently that might be an issue. With the old ACS (pre 4.0) there were numerous registry tweaks that we could use to increase the max concurrency but not sure that Cisco do now as its all in SQL Anywhere and/or locked within the appliance.
So out of interest, whats the timeout/re-try config on the devices... maybe they need to be a little looser. It could just be that the devices are timing out too quickly? You should have at least 10 seconds.
retry is 20 seconds.
if the number of connections is limited to 40, how would ACS is scalable in a Service Provider environment where there are hundred of customers and thousand of routers and switches that require aaa authentication. Are you saying that the ACS 4.2 performance is worse than open source tacacs+ that cisco released years ago?
I wondered if Cisco has a fix for this? Thanks in advance.
No, Im not saying ACS cannot cope.
Concurrency and latency are very different things. ACS CSTacacs can handle many 100s of simple authentications/authorisations per second with users in the internal database. If 1000s of devices all send traffic in the same instant it would take some seconds to work through the backlog of traffic.
Also, worth considering that a limited number of tasks within ACS (or threads) can actually handle a much greater number of "logins" because they are generally multi-message allowing ACS to keep lots of plates spinning.
If users are in an external databases the latency (per authentication) can increase depending on where the users are (eg Windows AD) and if bad enough can have a serious effect on the overall authentication rate. At which point customers normally turn to load balancing.
If your device timeouts are 20 seconds (totally reasonable) I suggest the issue is more likely to be something else... a bug, perhaps specific to v4.2?
"If 1000s of devices all send traffic in the same instant it would take some seconds to work through the backlog of traffic."
I never had any issues with cisco open-source tacacs. I can send traffics from 1000+ devices to open-source tacacs box without any issues.
I was informed by my providers that other customers are also experiencing the same thing. I guess when we first started, we did not have any issues because we were the very first customer. As more and more customers get added to ACS, the acs server can not handle a large amount of conncurrent connections, I guess.
Wonder if this issue is fixed in ACS 5.1?
According to Cisco support, when we approached them regarding intermittent login issues with Cisco ACS server, the limit for 4.2 on a windows box is 200 connections per box.
This is a hard limit, and your logs will show it if you are reaching it.
We have over 700 routers doing AAA authentication events simulatneously every 10 minutes (due to an EEM applet) on our network, hitting two ACS servers which are the latest 4.2 patch release.
They aren't too terribly specific, but in our case it is TACACS authentication, via the ACS server, which connected to our Active Directory domain for username and password verification.
If you have the s/w version of ACS you could use tactest (in the bin folder) to fire individual authentications at ACS, Im sure it has a loop function too so you can send a 1000 authentications in one hit. Its a bit clumsy but using multiple instances of tactest you can hammer ACS until it breaks, just keep adding more tactests and use perfmon to watch the ACS authentication rate counters.
At the end of the day, ACS will be limited to how fast your AD server is running. This is MUCH slower than internal pap/chap/mschap authentications and can take 10 to 100x longer.