I have a IPCC enterprose version 7, cvp version 4.0.1 Callmanager 5.1.3. I have the equipment broke into two data centers that are in seperate cities. I recently had a failure where a switch in one of the data centers had a interface start flapping causing a stack to spike at 100% my question is when the switch started flapping the agents starting loosing access so they couldn't login and calls were not getting to them , why wouldnt the PG at the failing site just go unavailable. We had to manually fail the PG so that the system would start handling calls. Is there a setting in the PG's that handle this type of scenario?
Hope this makes sense Any help or advice would be appreciated.
The PG will generally only fail when it either a) loses connection to the peripheral (CUCM) or to the Central Controller (Router). If both of these connections remain up, then it won't automatically failover.
Are you using CAD or CTIOS Agents? I assume from the description that the Agent PGs are split between the two sites, in a Clustering over the WAN arrangement?
What I amtrying to say is I have two data centers each one has a pg in it. In one of the data centers the switch that the PG was attached to had the stack port start flapping causing the switch to make the interface flap. The stack spiked at 100% so the pg was going up and down, causing th CAD at the agents to log off and on.. We had to fail the pg manually to allow the othe one to take over. So the question is is there nothing in the PG config that says if Isee a failure like this , take me offline and use the other pg for ? 30 mins then try to become active again.... Does this make sense.
Sorry, I was actually asking bilashece when he wrote:
"Redundancy plan over sites on PG failover is not so wise, am i right?"
I understood your original post.
Please note the following:
1. Only one PG Side will be active - this includes everything from the PIM connection to the CUCM, through to OPC that communicates with the CTI GW.
2. Only one CG side will be active. The active CG will communicate to the active PG. Please note that the active CG side will switch when no agents are logged in, and that the active CG side may not be the same as the active PG side.
3. Both CTIOS sides will be active. These will both communicate with the active CG.
4. Both CAD sides will be active. These will communicate with both CTIOS servers, and with the active CG.
I think your issues resulted because the active PG & CG were at the other data centre, while the CAD agents were trying to connect to CTIOS and CAD at the data centre with the failed switch.
CAD will detect failures of CTIOS and CG, and will switch each client over to the other CTIOS and CG side. Generally speaking, the CAD client should then remain on the side it has connected to, until the agent attempts to login again, or another failure is detected.
The problem with a switch "flapping" issue is that it can be difficult for any system to detect as a complete failure. The reason for this is that each client and server sends heartbeat messages to the connected servers, and declares a network failure when it misses several consecutive missed heartbeats.
When the switch port is "flapping", it is possible that you won't miss the required number of heartbeats, and therefore can't declare a failure.
So, basically, a partial failure is really hard to consistently detect and manage, because it would require the system to analyse the pattern over time.
Thank You... I agree with all you said and it was put very well. I know we can deal with hatrdware failures, failures that are solid great but applications that misbehave how can you even start to design error recovery around that..You can set a error threshold too low and it affects day to day operation.
I was just making sure that there was no parameters that might be adjustable ie the heartbeat so instead of lets say 5 failures in 5 minutes before we take action I could drop it to 5 failures in 3 minutes , just a example but thanks you did a great job of confirming wht I suspected.
I believe you can control the ICM behaviour in the specific area you mention through the registry, although I have not checked for that exactly. There is a ton of stuff in the registry. You could probably screw it up too. ;-)
Too true Geoff. The challenge is tweaking the client registry settings to balance between normal network behaviour (where one or two heartbeats may be delayed/lost every so often), with detecting a true network outage (which is why the settings are normally 5 consecutive heartbeats).