Has anyone deploy IPCC enterprise to support DR ? if so, please share your exprience on your deployment !!!
I am having a hard time getting my IPCC environment to work as design. I have a duplexed IPCC 6.0 enterprise environment with side A and side B located at separate locations connecting via a full DS3 MPLS connection and I am having a hard time getting the IVRs to go active when PG1a is offline. When I shutdown PG1a the IVR PIMs on PG1b do not go active for some reason and I get a busy signal on all of my tollfree numbers.
Please share your experience if you have ran into an environment like mine !!! Thanks very much in advance !!!
We are running IPCC enterprise 7.1.5. duplexed system pg's, dual IVR's. Side A is located at our primary, and side b is at our DR location. we have a DWDM connection, so straight fiber to our DR location.
do you have a jtapi connection for side b configured in call manager? When side a is active, what do your process windows reflect (in regards to side b.)
I have split CCM cluster so I do have 2 CCM Subs at the DR site and PG1b and IVR2 are talking to the CCM Sub at the DR site. Side A is always active and side B shows idle as far as the processes is concerned. I tested other components and they worked when side A is offline. When I shutdown RoggerA, RoggerB goes active and everything still work and the same with IVR1 when IVR1 goes offline IVR2 goes active.
For some reason the IVRs couldn't go active when PG1a is offline.
My next question is: In a duplexed environment when you run setup on the PG. Do you have to select the preferred side (A or B) or you select No side preference ?
Thank you very much !! I appreciate your response !!!
Because you are running ICM v6.0, you have some restrictions in how the system is designed. I suspect that you might have a mis-configured system.
I'm assuming this design:
Site 1: RoggerA, PG1A, IVR1, CCM Sub1, CCM Sub2
Site2: RoggerB, PG1B, IVR2, CCM Sub3
How many PIMs do you have configured on PG1A and on PG1B and what devices are they connected to?
Below is exactly what I have
Site 1: RoggerA, PG1A, IVR1, CCM Pub, CCM Sub1.
Site2: RoggerB, PG1B, IVR2, CCM Sub2, CCM Sub3.
I agree with you that there might be a mis-configured setting or settings somewhere. I re-ran setup on the Roggers and the PGs but couldn't find anything.
I have 3 PIMs on each PG (CCM PIM, IVR1 PIM and IVR2 PIM). My setup is pretty straight forward but don't know what I have done wrong here. I am open to any suggestions !!
thank you very much !!! I appreciate the response !!!
you need an additional, dummy PG, to achieve full redundancy, no need to have a PIM active on that one, establish a side preference, this would allow the CC to stay up and connected to a majority of PGs if one side goes out of service(basically your additional PG will connect to the active CC side, being A or B and allow for it to be working, this is fully described in the SRND).
Second point, I hope the MPLS is just used for the connection between CC and PG, if it is used also for private links then your network configuration would not be supported, since it would not be able to deal with the ICM high priority traffic latency and failover requirements.
Third, what is going exactly out of service in the failover scenario? PG? PIMs? How is the PG seen by the CC? Is PGAgent still in service?
PS Remember to rate useful posts accordingly please.
Ok, so the basics first, we want to make sure that the PGs are correctly running in Duplex. There are lots of ways to determine this but we'll just look at a couple of simple indicators. We will look at the 2 of the processes on each of the PGs.
- On the PGs, look at the title bar for the MDS (Message Delivery Service) process. One PG should show âInSvc PR-Enb Clkâ and the other should show âInSvc Pr-Dsb Clkâ. We're looking for MDS to be in service, they are Pr (paired) while one MDS is enabled while the other is disabled. A normal functioning system (with event tracing at 0 - or normal) won't have many scrolling events in the process window.
- On the PGs, look at the title bar for the PGAG process. One PG should show âInSvc A:Active B: Idleâ or âInSvc A:Idle B: Activeâ while the other PG will show âNot Activeâ. This shows that one of the PGs is InSvc or actively connected to the Call Router and which call router it's actively connected to (A or B). The other PG should be Not Active. The non-active side (if you were to view the actual logs) would show that it has 3 idle connections to Call Router A and 3 to Call Router B. A normal functioning system (with event tracing at 0 - or normal) won't have many scrolling events in the process window.
- In both cases, your most recent log events in these windows should be showing something along the lines of 'reporting metering statistics'. If it looks like either of these processes is trying to do something with events like connection retry or connection failed then we've got a low layer problem.
Let's start there and let us know.
I checked the processes on both PGs and all seem normal, pretty much exactly like what you described. I do not see any connection failed or connection retry messages but see a lot of "reporting meter statistics" messages. All seem normal when both sides are up.
Thank you very much !!! I really appreciate your response !!! Please let me know what else I should look for !!!
Ok Danny, that's good. Now we need to try to figure out why the PIMs won't go active. This part is service impacting.
Some background: PGs are a real-time synchronized design in that MDS is busy populating the OPC data between both sides of a duplex pair. You can actually watch messages occur in both OPC windows from the PIMs regardless on which side the PIMs are active on. It's also completely natural for the active PIM to move from one side to the other based upon network events, i.e. don't always expect the PIM to only ever be active on any one side. It is also completely natural for a PG with multiple PIMs to have the active PIMs shared between the two sides, i.e. not all on side A or not all on side B.
I would like to try to see if we can get each of the PIMs active on the B side but individually. Remote into both PGs and confirm that all the PIMs are active on PG1A. Here is the service impacting part but we're only doing it to one peripheral at a time. On both PGs, open the PIM3 window so that you can watch what's happening. PG1A is active and PG1B should show idle. Crash the PIM3 process on PG1A by clicking on the X on the window, closing it. PIM3 on PG1B should immediately try to connect to IVR2, and ideally go active. The PIM3 process on PG1A will be restarted by Node Manager after a few seconds and will be in an idle state. If PG1B PIM3 doesn't go active it will eventually switch back to PG1A or you can crash the PG1B PIM3 process and it will go over to PG1A post haste.
If PIM3 didn't go active on PG1B, then dump the logfile and post it up.
Open a command prompt on PG1B and type:
Such as C:\ cdlog cust pg1b
This will bring you to the ICM logfiles directory (you can also cd down to it \icm\cust\pg1b\logfiles)
dumplog pim3 /last /o
this will dump the logfile into a txt file named pim3.txt in this directory. Post the logfile up here and let us know what time you did the failover test.
Feel free to do the same for PIM1 and PIM2 to try to get them to go active on PG1B. I'm just trying to focus each problem at a time in baby steps. Riccardo has valid points but that's a bit higher in the stack.
P.S. what is your private network between the PGs?
You probably want to try sub2 to site1 or like Ricardo stated add 3rd PG this will keep the connectivity.
Esentially you have all components are working independently but not in sync.
The third PG can be any where.
Sorry, for not getting back to you sooner I was out of town for a few days.
I am planning on scheduling an outage at the end of this month. I will test out your scenario and gather logs.
The private network and the visible network are connecting to the same WAN link (a full DS3 MPLS circuit) with QoS enabled.
I understand it's difficult to say anything without the logs but do you have any ideas what might be causing the IVR PIMs not go active on PG1b ?
Thank you very much !!! I really appreciate your responses !!!
your design is not supported, sorry to tell, did you get a specific BU agreement?
You could verify it in the SRND.
Otherwise I would suggest you to check with your local sales rep and who deployed it.
It is not supported to share the same WAN link for Private and Public connection, further more if it is a MPLS one.
Recently we made some exceptions at the A2Q process in the event latency and redundancy could meet some specifics, which are specified in the 7.x SRND.
There is an opportunity for there to be a few things wrong. The basics would be PIM setup and make sure that the correct IP Address is in there for the IVR and that you can ping the IVR from the PG. Is the TCP port correct? Each IVR only has one TCP port for the ICM Sub-System, typically port 5000. Is the PG setup for Service Control Interface? Is the Peripheral ID in the PIM correct (based on your design, I would guess 5002 for PIM3)?
My intent was to take some very basic troubleshooting steps. While I agree that this design, as described, doesn't comply with SRND standards, there is no reason that PG1B shouldn't be connecting to the PIMs. We can work on systemic failover next as well as articulate design scenarios around dummy PGs and network link separation based upon your needs.
Where are your A and B sites? Same city, same state or cross country?
The A side is in Ca and the B side is in Tx and the latency between the two locations is within acceptable range (35-40ms). I did read the SRND and understand that Cisco recommended a private P-t-P link for the private network. But I am pretty sure other customers have done it successfully with my type of design, I still think there is a mis-configured setting somewhere.
Thank you very much Jeff !!! I really appreciate everyone taking time out of your busy schedule responding to my post !!!!
I hope an answer is found for you. I'm watching this thread like a hawk.. We are experiencing our own headaches with heartbeats between our PG's.. The whole DR scenario just seems to be very flaky for us. I 'm hoping to gleen some knowledge off of this, for our own scenario...
I understand you have a straight fiber link between side A and side B, if that is the case you might want to look into your QoS settings for the private network. Heartbeat should not be an issue on this type of connection.
My setup is quite simple but it's frustrating when things don't work right. I am reaching out for any help I can get and Jeff Marshall has been very helpful.
Thanks to everyone for responding !!!
'Watching like a hawk', well in that case I hope I'm not the field rodent! ;-)
Danny: you are correct; there are customers that deviate from the high availability designs in the SRND. Of course, don't admit that in public. Most are successful. Some (considering Murphy's Law) who have never had a WAN problem, suddenly see a major outage that takes down their call center for days (how much was that T-1?). Others just live with the inconvenience. I've worked with this product for too long to disrespect a 'by the book' design but I understand business and monitory pressures and concessions can be made. I think your issue is something else though. We'll get to the bottom of it.
Cherilynn: your concern seems to speak to a different issue. As I understand it, you have a normally functioning geographically split duplex PG pair that periodically disconnects. Is that correct? If by 'flaky' you mean 'not working as I thought it would in my mind' then we have some work to do and this audience is happy to help.
ICM is an incredibly robust and simple product at the core. Its design is centered on very basic networking needs that were available on the market 12 years ago and the only serious network-side enhancement has been the inclusion of QoS packet marking starting in v6.0 and enhanced in v7.0. The network has to be able to recognize that there are High, Medium and Low traffic streams to/from the real-time components of the system and (a) prioritize those streams correctly from the server and (b) prioritize them correctly across the network. There are some other basic failover caveats and conditions but if the network isn't right then nothing else will function as intended. To look at it another way, if the foundation of the house isn't straight, plumb and level then the rest of the house will never be true. These are all fixable issues but may require serious foundational adjustments and in some cases, investment.
P.S. I'm heading to VoiceCon next week, I hope everyone can stop by and say 'Hi' at the Dimension Data booth.
I hesitate to say normally functioning. We've been live since 9/17/07. the longest we've gone without issues is 29 days. We've done 5 ES patches, had a bug named in our honor, and one ET patch. More TAC cases than I know what to do with. We seem to have heartbeat issues, yet network shows nothing even CLOSE to getting to that point. We have (and here's my term flaky) issues happening with our agents and CTIOS - Agent A, consult transfers a call to Agent B. But Agent B doesn't get the call, however gets connected to Agent C, who resided in same skillgroup. To say flaky, to both myself and our end users... yeah.
Our PG's are geographically split. 10 miles, with a fiber connection. We have separate VLANs, and the heartbeat is on it's own EOMPLS. I feel like we're all just grasping at straws, and the confidence factor of this solution to our business end is dwindling away fast.
Can you describe your installation a bit including software versions and product type (e.g. UCCE or System UCCE)?
I won't excuse ESs and Bugs but this is software and these things happen - I (and others here) have more than our share of these things attributed to our names. What is inexcusable is the dwindling confidence factor - there is no reason why someone should disbelieve in a system of this caliber. This is wrong and in my mind product issues are always fixable. Kudos to you in participating in this forum, that's a huge step to success in my mind. Not that this forum will guarantee success in solving your issues but it tells me that you are interesting in personally investing in success.
The challenge here is indentifying issues from the bottom up. If you are having DR issues then I consider that foundational and is either a design disconnect or a programmatic issue. Your consultative transfer issue is higher up - also fixable but the foundation is key to success.
What can we help you with?
This has been an going issue for quite sometime and I'll be a happy man if I can get to the bottom of it. I am going to try see if I can get an outage window sooner, I would like to run the test scenario you suggested and gather logs.
We have the environment in production for about 4 years now. We did run into some issues at the beginning but in all and all the environment has been very stable for us. I'll be a very happy man if I can get DR working. I am not a contact center guy but I have learned a lot working with it and I did work with an integrator to put the environment together. And when we install environment we did install it with side A and side B on the same network and both sides were located in the same building but the mistake was we never did any DR test until we moved side B over to our DR location in Tx so I don't know if DR was working before the move.
Thank you very much Jeff !!! I am glad that we have you on the forum !! I am gonna try to get the logs posted up ASAP. But in the meantime let me know if you come up with anything !!!
I won't be attending VoiceCon but looking forward to see you at Networkers if you're gonna be there.
I am scheduling an outage to bring down our call center this week. I am going to try the test scenario you suggested and gather logs. Hope you have time to go over the logs for me.
Thank you very much !!
I was able to schedule an outage window last week and went through the test scenario you suggested and I wanna give you a quick update. I did just as you suggested shuting one process at a time but this time the test was successful, all processes failed over to PG1B and the call center was working normally on PG1b. The only thing that I changed since the last unsuccessful test that I did a few months back was (please take a look at the attachment). I am not sure if the change I made has made any difference. I am planning on scheduling another test but this time I will turn off PG1a completely instead of shuting down one process at a time and will see how PG1b react.
Please take a look at the attachment and let me know what you think !!!!
thank you very much Jeff !!! I appreciate all your help !!!
At this point, you can feel comfortable that the individual components are configured correctly. The changes that you made do make sense but shouldn't have seriously impacted failover. Since the B side PG is LAN connected to Call Router B, I would have most certainly set that to 'local' in the lower portion as well as to prefer Call Router B at initial installation. Great job!
Let us know how the server failover works out for you. Now we're working a bit higher up in the stack and proving success along the way.
Can you elaborate on the "dummy PG" and how I can use it to get around the "majority" issue? If I have a fully mirrored Side A and a Side B solution. I want B to be able to stay active in the event the entire A side data center went down. Since it will only have "half", it won't go active.