We are running the following setup
Approx 140 seat contact center-
IPCC Enterprise 7.1.5-
System PG's, duplexed for DR -
IPIVR 4.5(2)SR02ES09_Build092, again, split for DR-
Nuance ASR (split for DR)-
Call Manager 5.1.3 (1 pub, 6 subs: pub, sub 1,3,5 located at site A, subs 2,4,6 located at DR site B.)-
We went "live" on 9/17/07. Since go live, we have had multiple tac cases, where we appear to have heartbeat issues with the PG's. We are connected to our DR location, via a DWDM connection using optical. It's our own fiber stretched between the 10 mile location.
EVERYTHING we run, as a company, is load balanced between both sites. We have NEVER shown a network latency of greater than 1ms. Yet, every 3-4 weeks, we seem to have the PG's get out of sync(?), and logging will inevitably show that they missed heartbeats.
This has been our longest standing issue, and we've never really gotten down to the bottom of it. The first handful of TAC cases, resulted in ES patches, which seem ( and I use this term, because it has always come back) to handle the situation at the time. I honestly can't say for sure if the patch is what took care of the problem, or recycling the PG's took care of the problem. Yes, I agree that the patches took care of issues that probably would have crept up - but the fact that we continually keep going down the same path is just eating at me. And this is where our business users feel the pain, and their interpretation of our new phone system is less than stellar. (I'm being kind, for the terms that have been used.) I KNOW this is a good system, but I don't think we've truly hit on what has caused our heartbeat issues. And of course, when those PG's have an issue, the effect is dramatic for our contact centers.
Okay, and here's one more thing I found while digging around. We're running McAfee on those PG's... It's blocking IRC ports - which in our log shows a blocked IRC attempt, but it was the IP that the heartbeat rides on!! Now, I can't pinpoint it to exact times of heartbeat failures, but I can't imagine that this is a good thing.... !!
Is anyone running McAfee, and are there recommended settings???
Cherilynn, is it possible for you disable/deactivate McAfee on some/all of your P.G.'s, and see if that would mitigate the issue?
I wouldn't do it if it puts your network at risk, but if you've been through a bunch of ES's and you're still not getting the results you desire, the test may get you actionable results the fastest.
Best of luck getting your deployment optimized!
We're working with our server/security team regarding the anti-virus.
Here's a question for all of you: I have two documents from cisco - IPCC/ICM Best Practices, and the Staging Guide for ICM 7.x Those two documents conflict when it comes to the settings of the NICs.... The best practices state that gigabit should be set to auto/auto. The staging guide says NO AUTO!
Which one is right? is anyone out here, using gigabit NIC and switchports and have the speed/duplex set to auto? or do you have it hardcoded to 100Full.
I agree with others about hard setting the NICs to the actual setting and make sure it is the same on the switchports.
Another thought, are all of your servers using the same NTP source?
Okay, but if our NIC and the switchport are Gigabit - why wouldn't we leave those at auto, as recommended?
Yes, all our servers are using the same NTP source!
My short answer and personal experience is to always take the path of removing doubt when it comes to configuring stuff. I know that's not a solid technical answer, so I asked a couple Cisco networking veterans. They both agreed without hesitation that ALWAYS hard set NICs and switchports when it comes to servers...regardless of what the application is. Auto-negotiation is fine for client PC's, but they looked like they were about to have heartburn when I asked about auto-negotiation for servers. lol
As I read through some of the previous posts regarding the anti-virus, we use Symantec. I configured it to exclude files with extensions with "ems" and "hst". I can't remember which PDF recommends to do that. Also, I have it set to "Trust files on remote computers running Auto-Protect". I think looking at your anti-virus settings is something worthwhile to check (again...to remove doubt if nothing else). :)
There looks to be some great feedback on your issues in the forum. There are two different questions/comments out there and here is my take on them.
- NIC Configuration. The current documentation out there is confusing. There has been strong historical precedence relating to lower speed LANs that setting the Speed and Duplexing on both ends of a link is absolutely necessary. If the LAN is 10Mbps or 100Mbps then it is absolutely necessary to hard-set these values. The GigE specification on the other hand (I'm told by experts around this) is designed for a negotiation and has been accepted by Cisco to leave GigE systems at Auto negotiate (I'm pretty sure that this is echoed in the UCM documentation). This has been our deployment standard for over a year and we haven't had any negative results from customer and internal deployments. I'm not sure if Cisco AS is this way or not.
- AV design/configuration is very important to consider. I would have to wonder why the TAC wouldn't have already reviewed this with you. The Security Best Practice Guide has the best detailed information in Chapter 13, Page 123. Here is the link: http://www.cisco.com/en/US/docs/voice_ip_comm/cust_contact/contact_center/icm_enterprise/icm_enterprise_7_2/reference/guide/security_guide.pdf. I agree with Chris on disabling AV for a period of time and determining the impact with that application. I would probably go as far as setting performance monitoring on the processors, memory and disk queuing for a period of time both before and after disabling AV and determining if there is a delta there.
Some other items of note:
- The PGs MDS process would be the process to watch relating to heartbeat failure. The challenge is getting the right value of tracing established and capturing the failure.
- What testing was accomplished on your network before turn-up? ICM has a decent tool called ICM NETGEN that can simulate QoS enabled ICM traffic. We run this at a reasonable load and then we statistically increase the load until we figure we are at 200-400% of the expected load (i.e. try to break the network) before we turn-up the system for user acceptance testing. This is a reasonable way to confirm that QoS is enabled across the network.
- Is Microsoft Packet Scheduler enabled on these servers? I don't believe in this tool and it's not part of our best practices.
- In ICM setup, are you using IP Addresses or hostnames in the hostnames fields? We changed to straight IP Addresses after proving many buggy DNS issues in Win2k and continue this today in Win2k3 on the ICM server components.
- I don't recall ever having a Win2k3 time issue killing a duplex PG pair without identified clarity in the logs. Not to say that it can't happen.
- What version of Win2k3 is installed on the PGs? Standard or Enterprise?
- IRC ports are not anywhere close to the ICM ports, look deeper in the logs and make sure it's not trying to block the ICM ports. This doesn't seem likely or you would see it as a permanent failure, not one that approaches after a duration of uptime.
- Is there any commonality on when the failure occurs such as time of day?
- Is the PG build a corporate image or a 'fresh' OEM install? If corporate, what other components are installed outside of default selections?
- Confirming no GPOs are enabled in the OU outside of the optional ICM security template?
Whew.... I have a bunch of stuff to review with the appropriate teams, so I can provide accurate data.
I'll start with what I know.
NIC issue: we're pushing back and saying "drop it." Find the appropriate reason. Our NIC's aren't it. Glad to see that others are using auto/auto and it's working.
AV issue: Working with our team to turn off any extras with McAfee. That document will help!!!
PG build: Corporate image ( I think we may have to visit this...to find what all is installed.. I know Acronis, McAfee...) Win2k3 R2: Enterprise Edition, SP1. 3.00GHz, 3.25 GB RAM. Packet Scheduler is not enabled.
There isn't a specific time of day that we are seeing the failures. We had 2, on the 19th. within 30 minutes of each other. Missed 3 heartbeats. At least I was able to get logs. This is just brutal.
We are using IP addresses vs. hostnames. (we had already gone down that rough road, prior to go -live.)
We'll look into the anti-virus around the IRC and see what we can find.
I was not in this position when the intial build and network /QoS testing was done.. I will have to circle around with those who were...
This forum is helping us TREMENDOUSLY and I can't thank you all enough!
Bumping this conversation back to the top of the heap.
How are things going?
I have one comment about a corporate server image. At this moment, I'm having challenges with a customer that used a corporate image across all their ICM servers. I have a PG pair that is acting up with odd anomalies that just cannot be easily explained (e.g. ICM setup not completing correctly, registry entries missing, windows explorer working on one and not the other and a host of other odd occurrences). Obviously the internal IT staff is proud of their image and we're getting push-back on loading a clean OEM install (they'll reimage without question though). I've run into this before with other customers.
For you, I'd seriously consider rebuilding these PGs from the OEM windows CD and see if that makes a difference. Although I don't have any serious qualms about Win2k3 R2, I would just concentrate on Win2k3 Standard and the original release. R2 has stuff that is completely unnecessary for ICM and it's my opinion that we should avoid additional OS bloat if at all possible. Your issues may not stem from this but we have to start at the foundational basics.