This discussion is locked

Ask the Expert: Troubleshooting Unified Contact Center Enterprise

Unanswered Question
Jan 29th, 2014

            Read the bioWith Goran Selthofer

Welcome to the Cisco Support Community Ask the Expert conversation. This is an opportunity to learn and ask questions about integrating Unified Contact Center Enterprise into your environment and troubleshooting the many features that are available with the Unified Contact Center Enterprise solution.

Cisco Unified Contact Center Enterprise delivers intelligent contact routing, call treatment, network-to-desktop computer telephony integration (CTI), and multichannel contact management over an IP infrastructure. It combines multichannel automatic call distributor (ACD) functionality with IP telephony in a unified solution. This makes it easier for your company to rapidly deploy a distributed contact center infrastructure.

Goran Selthofer is a team lead for the Cisco TAC EMEAR Contact Center team based in Brussels. He has supported UCCE, UCCX, CVP, and UCCE applications for the past seven years within the Cisco TAC. He has more than 13 years of overall experience in the industry, with broad experience in Cisco Unified Communications infrastructure solutions as he has been also working for Cisco Gold Partner prior to joining Cisco TAC. Goran also provides internal training to TAC engineers on Contact Center topics. He graduated with a master's degree at the Technical Military Academy - Belgrade University. He also holds CCIE certification (number 27211) in voice as well as VMware Certified Professional certifications. 

Remember to use the rating system to let Goran know if you have received an adequate response. 

Goran might not be able to answer each question due to the volume expected during this event. Remember that you can continue the conversation in Collaboration, Voice and Video community,  sub-community, Contact Center discussion forum shortly after the event. This event lasts through February 14, 2014. Visit this forum often to view responses to your questions and the questions of other community members.

I have this problem too.
2 votes
  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 5 (37 ratings)
Jackson Braddock Mon, 02/03/2014 - 15:11

Hi Goran,

Thank you for covering this topic. My question is which logs do I need to check if I have issues with UCCE calls routing?



Goran Selthofer Tue, 02/04/2014 - 02:05

Hi Jackson!

First of all I want to thank you for participating!

The most important thing to know first is THE CALL FLOW!

Knowing your call flow in details will reveal all nodes and processes which you should or can troubleshoot within logs.

Now, basically, there are different types of nodes being: Central Control, Peripheral Gateways, different peripherals and CTI services (server and desktops). Each of those have their own specifics in setting and collecting traces.

Therefore, we have published the following Tech Note to help partners/customers with setting and collecting logs:

More details around that and much more serviceability is given within following guides:

Please let me know if this is sufficient for you!

Once again, thanks for participating!


lohjintiam Mon, 02/03/2014 - 23:40

Hi Goran,

What is the best/proper way to troubleshoot replication issue between Rogger/Logger A & B?

Is there an easier way to monitor the communication between A & B?



Goran Selthofer Wed, 02/05/2014 - 05:17

Hi JT!

Thank you for the question!

…and…it is a very a good one … so it requires a very long answer

Ok, so usual confusion on that topic comes from the fact that users often think this should be similar to Microsoft SQL Replication. Thus, users expect to see something like GUI or visual presentation of that replication.

However, MSSQL REPL is not used here. Therefore, we need to understand architecture before we can think of ‘monitoring’ it. Also, to be very clear from the beginning, there is no ‘easy way’ of ‘monitoring’ it as there is no ‘tool’ for that.

Now, first, we need to separate Router from Logger because they get their data in a different way hence they have different way of syncing that and that is why they are to be observed separately.

Routers have MDS (Message Delivery Service process). Loggers do not have that process. However, Loggers use MDS of ‘same side’ Router. Logger on one side will never talk with Router on another side.

MDS is a sync zone, meaning every bit of data which comes to Router on one side is replicated through MDS to the Router on another side. Knowing that UCCE architecture utilizes two types of networks, MDS uses PRIVATE network for that communication. It is very active process since Routers sync their MEMORY. Therefore, that needs to be a perfect sync.

However, data which Router gets, router commits to the local DB and since Router doesn’t have DB, it means router commits that to the same side Logger’s DB. That is how Logger gets data. So, data bit which came from PGA to RTR (not relevant how at this point) ends up first in MDS on Router A side (assuming PGA has active link with RTRA) which is replicating it to Router B where it ends in both routers’ memory. Now, EACH router commits that to its own respective Logger.

Bottom line here is the following:

  • Routers sync their memory and that cannot be easily ‘monitored’ but RTR processes are designed in such a way that if there is a difference then they will for sure complain and it will go even up to the point that one side process will not even be able to start or it will restart if not able to go in sync. So:
    • MDS though would be good start point to check if something goes wrong as it will report process or peer disconnects
    • Also, RTTEST tool can be used to check if any failover happened and when or from which side sync was done.
  • Loggers do get their data from respective Routers but Loggers also have a possibility to ‘sync directly’. This kind of sync is done via socket connection by RECOVERY (RCV) process and it can be monitored via RCV logs (in a basic logical fashion way – is there any errors or unusual behavior or not). So:
    • RCV process logs for checking if it is all healthy on that side
    • ICMDBA tool to quickly see if replication of new data is happening (Space Used Summary option from Data menu when Logger DB is selected) by monitoring Max date.

2014-02-04 12_10_57-bru-vaas-vc - vSphere Client.png

Maybe not as you hoped to be but I have tried to give an overall perspective for other users reading it later as well…



lohjintiam Wed, 02/05/2014 - 23:43

Hi Goran,

Great explanation!

In the same line, I'm trying to get some clarification in regards to the automated truncation process in both UCCE and CVP database

a) there are default retention days for certain UCCE tables (some 14, 100, 1095, etc)

b) if not mistaken CVP is also 1095 days

My questions

a) Will ICMDBA start to auto truncate the tables once the threshold has been passed? (80%). How will it select which tables/data need to be truncated first?

b) If it's compulsory for me to keep all data at least for 1096 days, those retention period can be changed to reflect that? Dependency on sql db & disk space of course

c) Can we disable this automated truncation process?

c) How does it work in CVP report server?



Goran Selthofer Thu, 02/06/2014 - 10:40

Hi JT,


Again very interesting questions!

Ok, so I will have to limit to UCCE side in this answer and leave CVP for other people in other sessions.

But I think CVP part is already well described in the CVP SRND/Guides.

Here is the story about PURGING in UCCE.

There are 2 categories with total of 3 types of PURGE which can happen from UCCE point of view:

Category 1: Scheduled Purge
1. Daily scheduled purge

based on this RETENTION parameters:
HKEY_LOCAL_MACHINE\SOFTWARE\Cisco Systems, Inc.\ICM\\Distributor\RealTimeDistributor\CurrentVersion\Recovery\CurrentVersion\Purge\Retain

Tables are purged usually at 00:30 every day - controlled by this parameter:
HKEY_LOCAL_MACHINE\SOFTWARE\Cisco Systems, Inc.\ICM\\Distributor\RealTimeDistributor\CurrentVersion\Recovery\CurrentVersion\Purge\Schedule\Schedule

Category 2: Emergency Purge
There are 2 parameters to control this under this path:
HKEY_LOCAL_MACHINE\SOFTWARE\Cisco Systems, Inc.\ICM\cim\Distributor\RealTimeDistributor\CurrentVersion\Recovery\CurrentVersion\Configuration\Purge\Automatic

1. AdjustmentPercentage Purge on 80%
2. PercentFull on 90%

Both are set to purge 1% when DB reaches respectively 80% or 90%.


Reason for this is very simple: ICM processes are in charge of filling data into DB hence ICM needs to keep DB under 80% in order to compensate for the data burst while at the same time ensuring proper performance on process level interacting with DB.

Now, how is the PURGE done is very simple: Purge oldest data but fist but start from Tables starting with letter A.
So, usually it will be oldest data in Agent tables to be purged first.

Here is the drawaback of that approach:
Since it is set to purge 1% of data, just to make DB go under 80% usage, so to 79%, that means that if Agent table is purged with certain number of rows which dropped usage of DB to 79% then purge will stop. However, if there are still incoming data into DB making DB to go to 80% again, then again PURGE will be triggered with the same logic - start from A and purge 1%. So, if your DB is bouncing between 80% and 70% then it can easily happen that your Agent_ tables are purged totally thus making your reporting which depends on those tables not possible.

90% purge works the same way however, when it reaches 90% no new data will be allowed into DB.

So, you can argue with this but you have to keep in mind one SIMPLE RULE:

This is an EMERGENCY action.

Your tasks as system admin or system architect is to design the sytem in that way to AVOID reaching 80% full DB at any time.

So, answer to your a) question is above. keep in mind that it is not ICMDBA tool who is doing that but the code itself.
Answer to your b) question is also above (keys for retention). Of course, you should use ICMDBA tool here, option to Estimate your DB size based on required retention periods and then ensure you have that disk space there already before increasing retention times.
Asnwer for c) - NO. Definitelly NO!


lohjintiam Thu, 02/06/2014 - 13:15

Hi Goran,

That clarifies several doubts

To confirm

a) There is no dependency between scheduled & emergency purge. i.e. tables with pass the retention period will still be purged regardless of how full/empty the database is?

b) Is there a reference/link/doc that states all the current default retention period?

c) What is the trend seen for financial customers in relation to the retention period? Higher retention for interval/halfhour tables & lower retention for detail/event based tables?

d) If data grows faster than initial calculation, i would still be able to expand the database (subject to disk space availablity)? This link is also applicable for the current version?



Goran Selthofer Fri, 02/07/2014 - 00:12

Glad to hear that JT!

to answer your questions:

a) There is no dependency between scheduled & emergency purge. i.e. tables with pass the retention period will still be purged regardless of how full/empty the database is?


b) Is there a reference/link/doc that states all the current default retention period?

You can find it in Admin Guide:

I don't think HDS is mentioned there but for 'All Other Historical Tables' in HDS I believe it is 1095 days.

c) What is the trend seen for financial customers in relation to the retention period? Higher retention for interval/halfhour tables & lower retention for detail/event based tables?

Answer: CORRECT. However, mind that there are also some different rules forced by law in some countries telling how long data should be kept.

d) If data grows faster than initial calculation, i would still be able to expand the database (subject to disk space availablity)? This link is also applicable for the current version?


However, PLEASE do not take that as 'a primary line of success' - meaning - I will just put now what I 'think' it is good as anyhow we can expand it later. That decision might cost your customer some data loss since UNTIL you are enaged back to expand it, almost for sure there has been a problem already and data started to drop.

Therefore, probably daily we have at least one TAC CASE opened asking 'where is my data'. This is because improper estimation is done during the deployment about retention periods compared to DB size. So, customer wanted to retain 3 years of data and retention periods are set according to that WISH. and that is nothing more than a WISH. However, in order for that to become reality then DB size also needs to follow that WISH. Well, DB size was left to 40 GB and then 'suddenly' everyone is wondering 'why I am losing data since I have configured retention period on 3 years'

I hope I have given you a clue - why is that

Also, if reporting is so important to customer, we do recommend HDS on both sides and regular backups and DB maintenance.



lohjintiam Mon, 02/10/2014 - 01:03

Hi Goran,

In relation to log reading that Cisco TAC does, is there a reference/list of the common errors that will appear in the respective processes.

For example

a) Connection to Central Controller side A failed

b) Connection to Central Controller side B failed

c) Connectivity with duplexed partner has been lost due to a failure of the private network, or duplexed partner is out of service

d) others

Other logs typically have certain key identifier if that particular log is just info, warning, error, fatal, etc Something like this will definitely speed up the troubleshooting process.



Goran Selthofer Mon, 02/10/2014 - 12:06

Hi JT!

Good questions!

However, the least Cisco TAC is doing when reading logs is that it is using some ‘magic cheat sheet’ to decode all traces.

No, we simply work as per experience and read traces knowing or getting ‘good’ examples or simply reading ‘error’, ‘exception’, ‘fail’, ‘timeout’ keywords and take it from there.

It is a long and hard process to read logs and the more you do it the more it starts to get some meaning - like Matrix ••J

So, bottom line: No, there is no reference/list of common ‘process’ errors except for what is already published for maybe Router here:

Also, as described in one of the above posts you can use checke tool to see what is the peripheral error mapping – code to description:

  • Peripheral Error Code Descriptions

A quick way to obtain the description for UCCE Peripheral Error Codes is to log onto a UCCE system is to open a command prompt and navigate to C:\icm\bin directory and "checke where error code is the peripheral error code that you have identified. In this example we would use c:\icm\bin>checke 12005

Now, although most of the processes are not completed from serviceability point of view to document/list all possible errors, intention of BU is to directly write in logs as much details as it can be done to give more clues of what is happening.

Examples of Error messages in logs:

Failed to update the database.

The Update succeeded at the controller but was not propagated back to the Distributor.

Check the status of UpdateAW on the Distributor.


Failed to update the database.

Another user has changed the configuration data. Re-retrieve the data and try save again.

If the problem persists, you need to reload your local database. You can do this using

the Initialize Local Database tool.



Cisco TAC

david.macias Tue, 02/04/2014 - 06:25


Troubleshooting Finesse issues seems to be a huge pain, do you have any tips for that?  For example, sudden logout errors, sudden failover messages, etc.

Thank you.

Goran Selthofer Wed, 02/05/2014 - 08:40

Hi David!

Thank you for being part of the event!

Indeed! Finesse can be a bit tricky as it is a fairly new product...compared to CTIOS or CAD.

However, luckily we have our internal engineers in CAP, BU and TAC, who are creating more and more use cases in internal and external knoweledge databases about Finesse.

As a result the following pages are made externaly visible the same way as we have them internaly:

I definitelly advise that you check those!


Problem Solving process:

Client Error: Client requests constantly result in "503 Service Unavailable" Error:

Replication issues:

Or, here is an useful tip which you might not find there yet:

How to check the Health of your Finesse Server

The SystemInfo API doesn't require authentication and will provide you with either an "IN_SERVICE" or "OUT_OF_SERVICE" status

Point your browser to the following url


The status wil only show IN_SERVICE when all Finesse components are on-line.


Not to repeat links, I will also post in the next reply below the link for troubleshooting Finesse Agent Login Trace with the Use of Logs since below question is more specific to that part...

I hope I have given you a clue but if anything else is needed, you know where to find us



david.macias Fri, 02/07/2014 - 08:52


Going back on the Finesse issue(s).  I'm experiencing an issue where the phonebooks changes aren't reflecting to the agent desktop.  Any thoughts on how to troubleshoot this?

Thank you.

Goran Selthofer Fri, 02/07/2014 - 10:17


Unfortunatelly, Finesse version is not shared and also more details like - i.e. phonebook changes are not reflecting but is it consistent for both servers if duplex server deployment) or only for one...all changes from certain time, all the time or intermittent...etc...etc...

In above post I shared Problem Solving process where there are lots of questions shared which might help to isolate.

This is probably not very popular to be asked to answer as some think that is a waste of time but this is how TAC resolves more than 65% of cases believe it or not

Those questions actually come from well-known Kepner-Tregoe Problem Analysis methodology and are used in troubleshooting diffent issues, not only in IT. Every Cisco TAC engineer is required to pass KT training so to be able to use it.

OK, so back to the issue, I will assume you are not on 10.0 release hence it might be that you are hitting known issue:

CSCul20619    CCE and CCX: PhoneBook update not shown on desktop after DB restart

There are some issues in seeing this defect from outside currently but it is marked as external so will be visible in the future. Anyway, the workaround is to restart Cisco Tomcat.

Please check if that resolves it for you and let me know. (Note: restart Cisco Tomcat out of production hours).

Now, if you want to troubleshoot Finesse for that issue then here is what usually you do for logs:

Substitute your primary Finesse server IP Address in this url for collecting the logs.


Capture of the Web Services logs, but in this sequence:

All on primary Finesse Server:

1)      Agent logs out of Finesse

2)      Stop Tomcat Service

3)      Start Tomcat Service

4)  Make some changes to phone book (note what exactly)

5)      Agent logs into Finesse

6)      Agent attempt to make call and options window is open showing available phone books.

7)      Collect Web services logs from the time Tomcat is restarted until just after the attempt to make a call and missing or incomplete phone books are observed.

Be careful, this is service impacting, so do it after hours. Also note, Tomcat restart might resolve the issue as well as mentioned above so you might not be able to reproduce it.

How to collect the Error and Desktop logs for review:

1.  When agent sees that issue on the desktop  have the agent hit "Send Error Report" on the desktop.This will send the client side logs to the Finesse server.
2.  Use the cli command to collect all Finesse logs - file get activelog desktop recurs compress
3.  Collect CTI server logs from the time of Finesse tomcat restart to the time the agent sees the issue on the desktop. (Healthcheck)

I hope this helps!

Have a great weekend!



chitrangad.pathak Tue, 02/04/2014 - 22:58


Thank you for initiating this discussion. Want some help on advanced troubleshooting like how to read the logs once they are collected. Is there any tool available for the partners to do that?

Also When working with Agent State Trace, what exactly happens at the logging level


Chitrangad Pathak

Goran Selthofer Wed, 02/05/2014 - 09:05

Hello Chitrangad!

Nice to see you here, thanks for posting the question!

Well, honestly, there is no 'tool' which is used by TAC to read UCCE traces. Far from that . We use text editors with coloring schemes when reading logs and that is a long manual process.

One exception is a basic Call Flow tool which is distributed with CVP software. That one is promissing but currently it is still not widely used as it requires pre-created log templates and currently there are only few. However, it works very well for CVP SIP tracing.

However, back to UCCE logs, we are not using any special tools and that is why TAC would generally ask you to provide as much info as possible (like ANI, time stamps, Agent ID, Extension ...etc) in order to analyze logs as it is not enough just to send logs to TAC. So, unfortunatelly there is no magic buttons or crystal balls (YET! ) which TAC or partners and customers can use when reading logs.

With that being said, foundation for reading logs is to really understand processes and tasks, to know the exact call flow and expected behavior and to gather as much info as possible about BAD but also GOOD examples.

Now, related to the second part of your question, with intention to make it more actual by introducing Finesse in the same story and also as I have promised David from above post, I invite you to read great example written by my good friend and colleague Linda, who has created the following:

Finesse Agent Login Trace with the Use of Logs

I think this should give you a pretty good overview of what is happening there...

Otherwise, there is also another example from her which is yet to be published and it is about:

How to Identify CTI Server errors in Finesse Logs


This document will show you how to quickly identify CTI Server peripheral errors in the Finesse Logs.



Cisco recommends that you have knowledge of Cisco Finesse,  Voice Operating System (VOS) CLI command prompt and UCCE CTI Server messages.

Components Used

The information in this document is based on Cisco Finesse Version 9.1(1).

The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, make sure that you understand the potential impact of any command.

Finesse Error-Desktop-webservices Log File

First locate any CTI server errors log in the Error-Desktop-webservices log.  This log will help you identify if the Finesse Server is receiving error messages from CTI Server.  In this log we see the following message at 14:43:50.838.

0000063104: Jun 19 2013 14:43:50.838 -0400: %CCBU_pool-6-thread-57-3-CMD_FAILED: %[ENTITY_ID=2005][ERROR_DESCRIPTION=errorCode 70][command_name=LOGIN]: Received failed command response

The error message indicates ENTITY_ID=2005 or agentID 2005 encountered an error when trying to login.

Finesse Desktop-webservices Log File

Open the Desktop-webservices log for the same timestamp that the error was found in the Error-Desktop-webservices log.

Search the log file to locate the error message.  Search on the timestamp of the error (14:43:50) or the failure code, i.e.failureCode=70 to locate the error message.   The webservices log will provide the "peripheralErrorCode" in this example the peripheral error code received from the backend cti server is 12005.

0000063103: Jun 19 2013 14:43:50.836 -0400: %CCBU_CTIMessageEventExecutor-0-6-DECODED_MESSAGE_FROM_CTI_SERVER: %[cti_message=CTIControlFailureConf [failureCode=70, peripheralErrorCode=12005, text=null]CTIMessageBean [invokeID=19976, msgID=35, timeTracker={"id":"ControlFailureConf","CTI_MSG_NOTIFIED":1371667430836,"CTI_MSG_RECEIVED":1371667430836}, msgName=ControlFailureConf, deploymentType=CCE]][cti_response_time=0]: Decoded Message to Finesse from backend cti server

CTI Server Log

Open the CTI Server log for the same timestamp that the error was found in the Error-Desktop-webservices log and locate the peripheral error code 12005 with approximately the same time stamp.  In this example search on 14:43.

14:43:48:898 cg1A-ctisvr SESSION 5: MsgType:SET_AGENT_STATE_REQ (InvokeID:0x4e08 PeripheralID:5001 AgentState:LOGIN

14:43:48:898 cg1A-ctisvr SESSION 5:         AgentWorkMode:AWM_UNSPECIFIED NumSkillGroups:0 EventReasonCode:50004 ForcedFlag:1

14:43:48:898 cg1A-ctisvr SESSION 5:         AgentServiceReq:0 AgentInstrument:"2005" AgentID:"2005" )

14:43:48:898 cg1A-ctisvr Trace: ProcessSetAgentStateRequest - sessionID 5

14:43:48:898 cg1A-ctisvr Trace: *** AddToAssociateAgentList();           ADDED: SessionID=5 AgentID=2005 PeripheralID=5001

14:43:48:898 cg1A-ctisvr Trace: CSTASetAgentState: InvokeID=0x2f0b1596 Dev=2005 AgentMode=LOG_IN AGID=2005 SG=-1(0xffffffff))

14:43:48:898 cg1A-ctisvr Trace: PrivateData: EventReasonCode=50004 WorkMode=0 NumAdditionalGroups=0 PositionID= SupervisorID= ClientAddress=

14:43:48:900 cg1A-ctisvr Trace:

14:43:48:900 cg1A-ctisvr Trace: CSTAUniversalFailureConfEvent: InvokeID=0x2f0b1596 Error=GENERIC_UNSPECIFIED_REJECTION

14:43:48:900 cg1A-ctisvr Trace: PRIVATE_DATA: PeripheralErrorCode=0x2ee5(12005)

14:43:48:900 cg1A-ctisvr SESSION 5: MsgType:CONTROL_FAILURE_CONF (InvokeID:0x4e08 FailureCode:CF_GENERIC_UNSPECIFIED_REJECTION

14:43:48:900 cg1A-ctisvr SESSION 5:         PeripheralErrorCode:12005 )

14:44:06:483 cg1A-ctisvr Trace:

The CTI server log will show that CTI Server received a SET_AGENT_STATE_REQ for agent 2005 using AgentInstrument 2005. This message was sent to CTI server from Session 5 or the Finesse Server.  CTI server responded to the request with a PeripheralErrorCode:12005.

To determine which session your Finesse server you can use procmon.  Using procmon we can verify that Session 5 is with a client ID of Finesse.  Where is the IP address of our primary Finesse Server.

C:\icm>procmon ucce cg1a ctisvr


Session   Time Ver Flags   ClientID         AgentID AgentExt Signature


       5 67:44:42 16   AUX R Finesse     Finesse  (

       7 66:53:16 16   AUX R Finesse     Finesse   (


Note: Detailed instructions on how to use Procmon can be found here.

Peripheral Error Code Descriptions

A quick way to obtain the description for UCCE Peripheral Error Codes is to log onto a UCCE system is to open a command prompt and navigate to C:\icm\bin directory and "checke where error code is the peripheral error code that you have identified.  In this example we would use c:\icm\bin>checke 12005

C:\icm\bin>checke 12005

Error 12005


   Level 1 = Login could not be performed - Possible causes are Invalid Instrument; Media Termination Problem o

r other CM issue

   Level 2 = AddCallObserver failed - Please see PIM log for more details

   Level 3 =


This command will provide a description of the error code and potential causes for the error.  In our example agent 2005 attempted to log-in with an Invalid Instrument.



Jackson Braddock Wed, 02/05/2014 - 16:12

Goran, thanks for the great detailed answer!  One more question for you. If my CIM (Cisco Interaction Manager) is integrated with UCCE, what is the best point to start activity routing issues? Thanks again!


Goran Selthofer Thu, 02/06/2014 - 05:05

Hi Jackson,

You are more than welcome! Feel free to use rating system below each answer so that others can benefit from useful answers...

Now, my favorite topic - "routing issues in CIM-ICM environment" - thanks for asking!!!

OK, this one is very simple, let me explain:

- CIM can work standalone but also can work integrated with ICM

- in case CIM is a standalone, CIM will do all routing decisions

- in case CIM is integrated with ICM, then CIM will extended routing decisions to ICM. That is why the service from CIM side which makes this possible is called EAAS - EXTERNAL Agent Assignment Service.

- So, EAAS talks with PIM on MRPG side of ICM. Therefore, this makes life much easier as they are using MRI (MR Interface) hence they will have some standardized behavior.

- Now, exactly that PIM on your MRPG is THE Border Line to start troubleshooting routing issues.

- AND it is VERY SIMPLE - here is how you do it:

* First enable MR tracing on that MR PIM. Let's take example that ICM instance name is ACME and MRPG node is PG2A where PIM1 is mrpim talking to CIM.

So, open cmd line on MRPG box, procmon to that pim and enable all MR tracing:

> procmon acme pg2a pim1

>>>>ltrace (this will list traces currently enabled)

>>>>trace *.* /off (first you want to disable everything which is enabled currently to avoid noise in logs)

>>>>trace mr* /on (this enables ALL MR traces)

mr_msg_config                         1       On
mr_msg_comm_session            2       On
mr_heartbeat_messages            3       On
mr_msg_incoming_mr                4       On
mr_msg_outgoing_mr                 5       On
mr_msg_incoming_inrc               6       On
mr_msg_outgoing_inrc                7       On
mr_msg_outgoing_csta               8       On
mr_function_call                         9       On
mr_ECC_variables                      10      On

>>>>trace *heart* /off (this DIASABLES HEARBEATS as that will be too noisy in logs)

mr_heartbeat_messages            3       Off


so in few seconds with above commands you enabled MR tracing.

To make long story short:

- when new email comes in, RX will pull it inside CIM and then after going via DB to generate ActivityID it will eventually hot CIM Workflow and there it will reach to integrated queue. That will make EAAS send activity to ICM for routing.

What exactly will happen is that EAAS will send NEW_TASK message to ICM via MRPIM:

08:56:30 pim1     Trace: Application->PG:
Message = NEW_TASK; Length = 73 bytes
   DialogueID = (1) Hex 00000001
   SendSeqNo = (1) Hex 00000001
   MRDomainID = (5002) Hex 0000138a
   PreviousTask = -1:-1:-1
   PreferredAgent = Undefined
   Service = (0) Hex 00000000
   CiscoReserved = (0) Hex 00000000
   ScriptSelector: EIM_SS
ECC Variable Name:
Value: 1041

NEW_TASK needs to provide MRD ID and ActivityID.

If there are no available agents for this activity ICM will fail to route it and will send NEW_TASK_FAILURE_EVENT:

09:56:30 pim1     Trace: MR_Peripheral::On_Router_DialogFail:
DIALOG_FAIL  RCID=5001 PID=5001 FailureType=2 NumOfEvents=1 DID=1 DIDRelSeqNo=0 ReasonCode=11
09:56:30 pim1     Trace: Function==>MR_Peripheral::Send_ToApp_NewTaskFailureEvent
09:56:30 pim1     Trace: PG->Application:
Message = NEW_TASK_FAILURE_EVENT; Length = 12 bytes
   DialogueID = (1) Hex 00000001
   SendSeqNo = (1) Hex 00000001
   ReasonCode = (209) Hex 000000d1

However, if all is good, expected behavior is that ICM sends DO_THIS_WITH_TASK message with exact Agent ID back to the application.
After this CIM is in charge of assigning this actvity to that agent.

Here is the example: activity 1057 routed to agent 5010 (SkilTargetID from t_Agent table on ICM side) or that is 1004 from CIM side (USER ID from EGPL_USER table).

13:51:15 pim1     Trace: Application->PG:
Message = NEW_TASK; Length = 73 bytes
   DialogueID = (2) Hex 00000002
   SendSeqNo = (1) Hex 00000001
   MRDomainID = (5002) Hex 0000138a
   PreviousTask = -1:-1:-1
   PreferredAgent = Undefined
   Service = (0) Hex 00000000
   CiscoReserved = (0) Hex 00000000
   ScriptSelector: EIM_SS
ECC Variable Name:
Value: 1057


13:51:15 pim1     Trace: PG->Application:
Message = DO_THIS_WITH_TASK; Length = 90 bytes
   DialogueID = (2) Hex 00000002
   SendSeqNo = (1) Hex 00000001
   IcmTaskID = 150881:202: 1
   SkillGroup = (5013) Hex 00001395
   Service = Undefined
   Agent = (5010) Hex 00001392
   AgentInfo: 1004
   Call Variable 1:
   Call Variable 2:
   Call Variable 3:
   Call Variable 4:
   Call Variable 5:
   Call Variable 6:
   Call Variable 7:
   Call Variable 8:
   Call Variable 9:
   Call Variable 10:
ECC Variable Name:
Value: 1057

So, bottom line, start from MR logs to see if NEW_TASK and DO_THIS_WITH_TASK are present for the same activity. If yes, then ICM job is done and issue is on CIM side. If DO_THIS_WITH_TASK is not there then it fails on ICM side.
This will point you furhter which side you need to investigate more.

I hope this helps!


oabulaban Thu, 02/06/2014 - 08:29

Hi Goran,

I have a question about redundancy amd failures... how does UCCE handle it if connection between side A and B got disconnected ( both public and private WAN links) ? and how can one recover the data after they reconnect together...

also, how is the "RecoveryKey" calculated, and is it possible to reset it manually?

Sent from Cisco Technical Support Android App

Goran Selthofer Thu, 02/06/2014 - 11:13


Thanks for your questions!

I hope you find this event useful!

Ok, since there are more and more questions coming I will have to limit my answers to shorter ones

Ok, so your first question is answered in SRND guide in this chapter:

Check under "Response to failures of both networks"

However, I also invite you to read all other scenarios there as it really describes what happens when only Private network fails, or only Public network fails... or when only one Logger fails, or PG...

So, it will tell you that PG can buffer some data, also it tells you there that Logger can be 12 hours down and if more then you will need to do MANUAL sync of DBs....etc...

I am sure you will get very useful information from that document!

Now, your question about RecoveryKey calculation. I will use explanation I learned from BU while working on some case:

RecoveryKeys are numbers automatically generated by the CallRouter and once the data is replicated to the HDS are no longer used.
So, the system automatically generates the Recovery Key in all of ICM historical tables as they are written from the "temp" tables into the real ICM tables by the Recovery process.  This process generates the key based on the date/time of the Logger (down to the nano-second) and it figures out the number seconds that have passed since the starting date of 1/1/1995.  This gives us a Julian Date in Seconds from our starting point to be able to compare dates/times. 

This seconds value is multiplied by 1,000 to create the actual key -- based on the fact that we don't expect to be able to write more than 1,000 records per second into any one table.  Each time a table is
written to, the key is incremented for the next record within the same table.  This key is set when the Logger HistLogger process starts on the Logger -- it even spits out a message telling you that this is
the new recovery key for all historical data.

The recovery keys are kept unique by this concept on a table-by-table basis -- every table starts with the same key value --  then it is incremented in that table by one on every record written in that table.

When the data is then replicated from the Logger to the HDS, it is not written based on the recovery key, it is written based on the table itself--in alphabetical order taking each new item in their recovery key order within the table and writing to the HDS.

So, answer to your question - can you manually reset those keys - is ABSOLUTELY NO.

I hope this helps!



Kishore Bhupal Thu, 02/06/2014 - 09:49

Hi Goran,

I've been tryin to get the finesse up and running, but i'm getting the invalid user id/ password error when i try to login..i've entered valid parameters in the finesse admin page..and have also verified to user mapping of the awdb  in the sql server management studio (i'm using windows authentication).

Is this related to awdb connection?


Kishore B

Goran Selthofer Thu, 02/06/2014 - 11:17

Hi Kishore!

Thanks for the question!

I had my colleague Zaid Salama preparing this while still being very busy with his own work so I want to thank him for that!

Ok, so here we go:

Regarding this question, I believe we will need more information on the Finesse version, UCCE version,..etc however from the first look on the description, I would expect that the issue is related to the AW, if you are sure that the username and password are correct, the user has the correct privileges, and the connectivity between both sides are good, it might be that you are using NTLMv2 authentication on the AW.

The Docuemnation defect "CSCuj95347 Document Finesse JDBC driver cannot authenticate using NTLMv2" confirms that NTLMv2 is not supported by Finesse as Finesse used a third party driver, that driver doesn't support NTLMV2. The resolution to this is to disable the NTLMv2 on AW, that can be done by running the following:

1) Disable NTLMv2 on the AW server hosting the AWDB and reboot the AW server.

2) Administrative Tools > Local Security Policies > Security Settings


Policies >Security Options

3) Network Security: LAN Manager authentication Level has "Send NTLMv2 response only" - Choose "Send LM & NTLM responses" or any option that requires only NTLMv2 reponses.

4) Network Security: Minimum session security for NTLM SSP based (including secure RPC) clients has "Require NTLMv2 session security" - UNCHECK this

5) Network Security: Minimum session security for NTLM SSP based (including secure RPC) servers has "Require NTLMv2 session security" - UNCHECK this

6) Reboot AW server.

Hope this helps!

Omar Deen Thu, 02/06/2014 - 21:13

Hi Goran,

Thank you for doing this

I've always wondered... why doesnt TCD, or any other table within an ICM database hold the information as to who disconnected the call? I find it combersome to have to pull CDR logs and try to correlate it to TCD records just to find out who disconnected the call. I'm sure there's a good reason to this and I'd certainly like to hear your input on it.

Additionally, is there an easier way to correlate TCD and CDR records? Can RTMT play a role here?

Similar to what was asked here:


Goran Selthofer Thu, 02/06/2014 - 23:42

Hi Omar!

Welcome to the party!

Yeah, "who disconnected call" investigation was always part of the call control side of things... historically, ICM worked with ACDs and never had to worry about call control hence that part was always left out since from ICM point of view it was not something to be concerned of as it is irrelevant for ICM calculations.

Well, times are changing so if there is real demand for this I believe that pressure will and should come from customer's side by contacting local Cisco Account Managers and asking them to open Product Enhancement Request (PER) for that. Of course, BU will need them to provide some business case / justification. As it is less of a technical than capacity and priority issue. Keep in mind that it will require developer hours and possible protocol changes. That is why there is no action on that side as risks are higher than gain (considering call control part already has that info).

OK, but now, your real question about how to correlate TCD and CDR.

Not sure if you have seen this, but more than 3 years ago I have already answered that question here:

I hope that will give the complete answer as I have outlined Cradle to Grave mapping there.

In case someone is not able to access that, i will paste it here:

Mapping CDR to TCD


The UCManager CDR does have a mapping to ICM Termination Call Detail record, but you need to do some conversion.

The CDR globalCallID_CallManagerID and globalCallID_CallId is combined to create the TCD PeripheralCallKey.

The globalCallID_CallManagerID is moved into the high order byte. To shift it over the properly, multiply it by the hexadecimal value 1000000 and add it to the globalCallID_CallId.


(globalCallID_CallId * 0x1000000) + globalCallID_CallId = PeripheralCallKey

These IDs are not unique because the same PeripheralCallKey and CallID are re-used in redirect, transfer and conference scenarios.

Also, this only works with in a single cluster. So in a multiple cluster environment, you need to map Cluster CDRs to a specific PeripheralID.

Cradle to Grave Call Tracking in ICM


The RouterCallKeyDay, RouterCallKey, and RouterCallKeySequenceNumber will track a call from its first route until its final call leg.

The RouterCallKeyDay and RouterCallKey combine to provide common attribute across the calls.

The RouterCallKeySequenceNumber gives you some sense of order of when calls were created. (gselthof: so note, 'some sense' is not guaranteed order!!!)

In a multi-peripheral environment, this requires routing between peripherals. This means calls to the IVR need to be translation routed, and calls to other agent clusters need to be routed as well.

Identifying Routed Agent TCDs


You will want to filter out the TCDs created for the CVP call legs, and calls are generated for agents for internal agent to agent calls.

Use the AgentSkillTargetID to identify agent, SkillGroupSkillTagetID to identify SkillGroup, and CallTypeID to identify Call Type / program.

If all three of these values are filled in, you know you got a call that was routed to an agent.

Sometimes more than one TCD will meet these three criteria for the same PeripheralCallKey In those cases, the one with the lowest RouterCallKeySequenceNumber will identify the first call answered by the agent.



The CallDispositionFlag is the best indicator to find out if a call was handled or not. There are a bunch of CallDispositions. The CallDispositionFlag distills the results down to 7 categories.

You can find details on what the CallDispositionFlags are in the schema help or schema guide.

I hope this helps!


Maheshwar Tayal Fri, 02/07/2014 - 04:00

Hi Goran, Thanks for details, Can you please confirm if SIP refer is supported for call transfer with UCCE considering we have CUBE as well as 3rd party SBC

And My 2nd Question is for Blending support, inbound and outbound voice calls for same agent. Do we need specific license for same.


Goran Selthofer Fri, 02/07/2014 - 09:25

Hi Maheshwar!

Thanks for participating!

Let me try to address this by pointing to this link:

That is where we confirm when SIP refer transfer is supported. Please check.

For the second question, if it is about Outbound Option (Dialer) then I believe yes, you need special licenses for that. You can check ordering guide here:

I am not from Pre-Sales or Sales side but I can see this is being mentioned there: Unified CCE Agent Licenses for Voice Applications with Outbound Option

Unified CCE Outbound Option requires purchase of at least Unified CCE Enhanced or Premium Agent voice application licenses and the appropriate number of Dialer Port Licenses.


Have a great weekend!



Omar Deen Fri, 02/07/2014 - 09:23

Thanks for the in-depth reply!

oabulaban Sun, 02/09/2014 - 00:38

Thanks Goran for your clarification and links for failover and RecoveryKey

A couple of things related here:

- From my understanding about your explanation of how RecoveryKey is generated, does this mean that the RecoveryKey of ANY UCCE system is similar? As it depends on time, this means that today's RecoveryKey is always bigger than yesterday's, even if those were 2 different clusters at 2 different customers?

--> if that's the case, then why in a technology refresh system I can't migrate the HDS table AFTER the logger? This is based on the upgrade guide of UCCE, where all scenarios of upgrade have the HDS being upgraded first (or at the same time as the logger)

- In case side A and side B historical data (from loggers and also hds) are not equal due to some failure happening at some point (we checked icmdba and number of records is different), what is the best way to fix that?

Goran Selthofer Mon, 02/10/2014 - 11:20


Ok, let me get to the answers quickly:

- Yes. But they are independent between systems. One system will not create exact as other system. However, it is true that RC can only increase but looking from the own initial base on the system. RC between different customers should not even be discussed as it is not something which can or should be used anyway.

- Install/Upgrade Guide:

If you complete the upgrade of the main Administration & Data Server within the Logger purge window (usually 14 days), you can replace the temporary Administration & Data Servers with the upgraded Administration & Data Servers for reporting. The data replication process fills in any missing data.

So that means that since you would probably need your AW for some tasks during the upgrade, that you migrate it before/at the same time. However, if you don’t want then you can setup NEW TEMPORARY one for that and then migrate real AW/HDS later as said above.

- Recommendation is that you don’t bother with data holes as that is why you have two HDS on both sides. Since it can happen that you have data hole then you can just point and take reports from the side which has data. That action is anyhow just limited to the time you will need reports for that particular missing period. You collect reports and you are done. You don’t need to bother with that anymore. Not sure why you would need to really keep them in total sync as they are there to compensate for those data holes. If you want to keep them in total sync then stop icm services and do full backup of HDS1, start services on HDS1, copy that file over to HDS2, stop services on HDS2, delete HDS DB on HDS2 and recreate DATA and LOG size parts as for the HDS1 so that you can restore that HDS1 backup on HDS2 box (ICMDBA has limit to 32GB so once you create with ICMDBA then use SQL Studio to expand file parts to match HDS1 settings). Once you restore, truncate recovery table on HDS2 and start services on HDS2. Now you have both with same data.



Shalid Kurunnan... Sun, 02/09/2014 - 21:25

Dear Goran,

Thank you for initiating the session and well explanation of the solutions.

having a query on CAD.

frequently, the agent log statistics in supervisor desktop doesnt display anything. it shows blank. could you please let us know how can we troubleshoot this issue. we frequently restart Cisco Enterprise service, Recording and Statitcs service and if the issue not resolved we go for Chat service and sync service.

but some time the issue doesnt get resolve. so kindly request you to give us how it works and what will be responsible for this.



Goran Selthofer Mon, 02/10/2014 - 11:23

Hi Shalid,

Thanks for the question!

OK, so CAD is one of the components which is integrated with CTI/CTIOS levels on ICM side. However, CAD is managed on its own with separate NT services and tools (like PostInstall).

Now, you didn’t send the version of your CAD (as there has been some changes in replication) but in general, Recording and Statistics Service is responsible.

Also, not sure if you are using Flat Files or SQL replication for RASCAL as that works totally in a different way. Also, people often mix LDAP and RASCAL replications.

Be Aware! LDAP and RASCAL are separate and independent databases.

  • In current versions, RASCAL uses XML files (flat files) or SQL on the UCCE PG as the datastore (Informix in UCCX). It stores data in three tables: FCRasRecordLog, FCRasCallLogWeek and FCRasStateLogToday. Flat files are in \Program Files\Cisco\Desktop\databaseTeamName folder on both the Primary and Secondary CAD Servers.
  • Sync Service uses LDAP (OpenLDAP), which syncs with the ICM AWDB to pull in agent, team and skill config information for the CAD Logical Call Center (LCC). Additionally, workflow group and phonebook customization is also stored within LDAP.

I imagine that you might have maybe issues with RASCAL replication there in your environment and that with restarts you are just triggering back working side. Due to the limitation of this ask/answer sessions I would invite you that you open case with Cisco TAC when you experience such issue so that it can be troubleshooted. In brief, Flat Files are NOT guaranteed mechanism of retaining statistics.

As a first aid, I can offer often used procedure done to re-establish broken or corrupted replica:

Login as LOCAL USER ADMIN on PGA side.

Start Post Install.

UNCHECK Rascal replica (select ‘Off’ for Recording and Statistics Replication and click Apply)

Follow prompts.

Once that is done then close Post Install and launch it again.

Now select ON for the same option and follow prompts.

This should re-establish replica if everything else is good.



oabulaban Mon, 02/10/2014 - 00:17

Hi Goran,

When it comes to UCCE (ICM + CVP in comprehensive mode), what is the proper way to configure the system to be able to conference & warm transfer to an IVR, while maintaining the CTI data?

Note that we usually go for "send back to originator" configuration only when it comes to the network vru label. Is this correct?

Also, would that method require additional resources (media resources, licenses, ..) ?


Goran Selthofer Mon, 02/10/2014 - 13:21


Hmmm… interesting questions… although this topic is limited to UCCE side of a story in order to avoid CVP as a potentially big session on its own, let me try to address this with quick answer here:

First, let’s clarify what UCCE means just to avoid confusion. UCCE is not ICM+CVP. UCCE is ICM+CUCM. And ICME is ICM+TDM ACD.

However, I do acknowledge that we have the same issue within some of our documents when referring to one or another.

Ok, now, let’s see about the transfers!

Not sure about the exact call flow and requirements there but here is where we specify which transfers are supported:

Transfer and queue calls with Unified CVP

Now, specifically for this there is a special chapter:

Unified ICME Warm Consult Transfer/Conference to Unified CVP

As far as ‘send to originator’ option goes, this is what needs to be known:

Three types of DNs work with Send To Originator: VRU label returned from ICM, Agent label returned from ICM, and Ringtone label.

Send To Originator does not work for the error message DN because the inbound error message is played by survivability and the post-route error message is a SIP REFER. (Send To Originator does not work for REFER transfers).

Note: For Send To Originator to work properly, the call must be TDM originated and have survivability configured on the pots dial peer.

On the top, there has been lots of discussions around this in the past on cisco forums. One of the best contributors there is Geoff and here you can find some example steps coming from his kitchen:



Cisco TAC

oabulaban Mon, 02/10/2014 - 13:28

Thank you very much for your replies

Sent from Cisco Technical Support Android App

Jay Schulze Tue, 02/11/2014 - 09:43

Hello Goran,

We upgraded a couple clients onto our HCS environment. Since we have had a couple outages where the A side loses connection with the B side. Normally this is related to some network interruption and it appears that way in the logs. However when I look in the system event viewer on the call server I see the following:

Log Name:      System

Source:        Tcpip

Date:          2/10/2014 6:39:07 PM

Event ID:      16501

Task Category: None

Level:         Information

Keywords:      Classic

User:          N/A

Computer:      USPHXXXXX


Computer QoS policies successfully refreshed. Policy changes detected.


The Advanced QoS Setting for inbound TCP throughput level successfully refreshed.  Setting value is not specified by any QoS policy. Local computer default will be applied.

I can match these up before every outage. As you guys know after 8.5 Cisco switched from packet scheduler based qos to the group policy. So I'm wondering if anyone else has seen this in 9.0. The first time I thought maybe it was coeincedence but since have seen it on other outages on completely seperate instances. The thing I wonder is if this is just an affect of an outage but I see this before is loses connection to the call server's duplexed partner. So believe it may actually be the cause. Any info you could provide on these messages I would appreciate. Because it is the first time we are seeing it with the upgrade to UCCE 9.0

Goran Selthofer Wed, 02/12/2014 - 06:30

Hi Jay!

Thanks for joining!

So, from MS site more info just when this is being read by someone who wants to have all info in one place :

Event ID 16501 — QoS Policy Update

Quality of Service (QoS) policies are applied to a user or computer account by using Group Policy. The QoS policies are applied to a Group Policy object (GPO), which is then linked to an Active Directory container, such as a domain, site, or organizational unit (OU), that contains the user or computer account.

Event ID 16504 — Advanced QoS Settings

"Advanced Quality of Service (QoS) settings provide additional controls for IT administrators to manage computer network use and DSCP markings. Advanced QoS settings apply only at the computer level, whereas QoS policies can be applied at both the computer and user levels."

OK, honestly, I cannot say I have seen this in 9.0  yet but I did see that in 8.5. Maybe 3-4 times so far.

One was related to the following known issue but that got resolved with 8.5.4 and 9.0.1:

but that is only if you also see something like this in even viewer logs:

Tcpip  16710    None    "QoS failed to read or validate the ""Local Port"" field for the computer QoS policy ""CISCO-ICM-QOS-RPORT-5000***""."

Another cases got resolved with following below recommendations to the letter (hard to pinpoint as sometimes multiple changes are done at the same time):



*** Ensure full redundant supported design is in place: Visible and Private should not share the same NIC, vNIC, vSwitch, switch...:

*** Upgrade to last NIC drivers from NIC vendor or for UCS compatible as per:

*** Disable TOE, TCP Chimney, RSS and IPV6 on all servers. Disable LRO for ESXi ...:


(for CSS)

*** Ensure that old ICM Security Template policy supplied for Windows 2003 is not applied to Windows 2008. That security hardening on windows 2008 is neither needed neither supported.

Please refer Security Best Practices Guide for Cisco UCCE 8.x,

"Note: Account policies are overwritten by the domain policy by default. Applying the Cisco Unified ICM Security Template does not take effect. These settings are only significant when the machine is not a member of a domain. Cisco Recommends that you set the Default Domain Group Policy with these settings."

Not directly related as above is about account policy but it is applying Domain Group Policy which in turn can also have above QoS features.

*** Upgrade of BIOS:

Now, generally, for troubleshooting some of those 'connectivity' issues, you can start by using Client/Server tools within UCCE or much better ICMNetGen. So, within c:\icm\bin you will find all.

I am attaching document to explain usage of ICMNetGet and Client/Server tool.



Jay Schulze Wed, 02/12/2014 - 10:54

Thanks Goran. Some really good info in your post. And we've seen lots a problems if your VM settings are incorrect. So that is always the first thing we visit when approaching an issue as this.

I have to think it's some type of defect. Interesting the defect in previous release. Although we don't see that message. Just one no one has seen yet. And tough to catch because the problems masks itself as a network failure.

We are going to try and re-run setup and disable all the policies. See if that takes care of the issue. If it does, then we know 100% this is the cause and we'll have to work with TAC on the defect.

But think you for taking a look and for the docs. Really appreciate the Netgen. Have never used that.

Omar Deen Tue, 02/11/2014 - 11:45

Hi Goran,

I understand that MS SQL Server will consume as much memory as possible since it wants to cache as much as possible into memory, however, I'm having a difficult time interpreting the Serviceability Best Practices Guide. I have a customer that's concerned with their memory utilization on their AW-HDS and Rogger, but I need to be able to decifer the formula in the guide to really dig into this more.

  1. Is memory utilization a good indicator to monitor strenuous events the server might be experiencing? I say no, but I'd like to hear the experts opinion.
  2. Referencing the image below:


Let's take for example a Rogger that has 6GB of memory (6442450944 bytes) , I will use numbers that's displaying on the Rogger as of this writing:

Commited: 6442450944 bytes

Utilized: 6123421696 bytes

Free: 319029248 bytes

According to the image, anything less than 20% is crossing the threshold for available memory on the server.

(6442450949 - 6123421696) = 319029248 / 6442450949 * 100 = ~4.95 or 5%, which matches exactly what's in task manager. So the server is significantly lower than the 20% threshold, but again, how can this be a real indicator if SQL Server will take as much as it can? Am I misinterpreting all this? What exactly would be a good indicator, from an infrastructure perspective, that would tell me that a server is in fact healthy. I've attempted to leverage the counters that are collected by the node manager and saved to c:\icm\log, but again, I'm not sure what exactly is a good indicator to prove a healthy system.

Thank You

Goran Selthofer Thu, 02/13/2014 - 10:04

Hi Omar!

Sorry for the delay but you got me with these calculations now!

Ok, I am not sure where did you get figures from for that calculation as I was trying to use the same approach in my lab.

I will paste a screenshot here so that it is cleary which figures i used.

Now, one very important thing there. When it is mentioned like "Measurement Counter: Memory – Committed Bytes" that means Windows Perf Monitor counter.

Also, check page 104 in that document, under "8.1 Platform Health Monitoring Counters" there is a table for health monitoring. It lists Performance Objects. So, those are Windows Perf counters as well. You can use those for health check but with note below about SQL.

Now, calculation. Here is my snapshot:

2014-02-13 18_35_24-bru-vaas-vc - vSphere Client.png

MEM physical = 6143 MB and that is 6441402368 bytes

MEM Sat = 80% of MEM physical and that is 4914 MB which is 5153121894 bytes.

My MEM 95% is Commited Bytes = 4833980416 bytes

So Mem p = 4833980416 / 5153121894 * 100 = 93.8

If you want to check Indicator Counter: Memory - Available Bytes with 20% threshold then my calculation is:

Counter: Memory - Available Bytes = 1756 MB
Total Memory: 6143 MB
Current threshold: 1756/6143*100=28.6%

In the end, SQL is indeed taking as much as it can and that is of no concern as per Microsoft:;en-us;321363

"When you start Microsoft SQL Server, SQL Server memory usage may continue to steadily increase and not decrease, even when activity on the server is low. Additionally, the Task Manager and the Performance Monitor may show that the physical memory that is available on the computer steadily decreases until the available memory is between 4 MB and 10 MB.

This behavior alone does not indicate a memory leak. This behavior is typical and is an intended behavior of the SQL Server buffer pool.

By default, SQL Server dynamically grows and shrinks the size of its buffer pool (cache), depending on the physical memory load that the operating system reports. As long as sufficient memory (between 4 MB and 10 MB) is available to prevent paging, the SQL Server buffer pool will continue to grow. As other processes on the same computer as SQL Server allocate memory, the SQL Server buffer manager will release memory as needed. SQL Server can free and obtain several megabytes of memory each second. This allows for SQL Server to quickly adjust to memory allocation changes. When you start Microsoft SQL Server, SQL Server memory usage may continue to steadily increase and not decrease, even when activity on the server is low. Additionally, the Task Manager and the Performance Monitor may show that the physical memory that is available on the computer steadily decreases until the available memory is between 4 MB and 10 MB.

This behavior alone does not indicate a memory leak. This behavior is typical and is an intended behavior of the SQL Server buffer pool.

By default, SQL Server dynamically grows and shrinks the size of its buffer pool (cache), depending on the physical memory load that the operating system reports. As long as sufficient memory (between 4 MB and 10 MB) is available to prevent paging, the SQL Server buffer pool will continue to grow. As other processes on the same computer as SQL Server allocate memory, the SQL Server buffer manager will release memory as needed. SQL Server can free and obtain several megabytes of memory each second. This allows for SQL Server to quickly adjust to memory allocation changes."

On the top, if there is a legal concern of low memory and paging is often, then adding extra memory to VM is totally fine.



Omar Deen Thu, 02/13/2014 - 10:28

Wow, comprehensive and detailed. Thank you for the reply Goran!

Goran Selthofer Fri, 02/14/2014 - 09:01

hehehe... thanks Omar! Please feel free to use those stars below the answer to let me know about the value of it!

and again, many thanks for engaging and asking those questions so that others can also benefit from it!

Kishore Bhupal Tue, 02/11/2014 - 23:51

Hi Goran,

My question is about the AGPG failover mechanism..we faced a problem in one of our recent implementation where AGPG side B would not take over if  we shut the services on side A, finally we had to involve TAC to sort it out.

He changed a set of registry settings on both the sides and it started working.

I wanted to know what all registry settings would I have to look into in such situations.



Goran Selthofer Wed, 02/12/2014 - 07:00

Hi Kishore!

Here is the simplest answer: NONE!

Well, what i mean to say is that there are no requirements to tweak registry settings for 'green field' deployments to make PGs to work duplex.

This is done by running PG Setup.

I am not sure which exact case you are referring to and what exact regsitry settings were changed but I can assume the following was the problem:

- ports used by MDS process were not matching.

Now, this can happen if you try to deploy more than 3 PGs on the same box (yeah, I know... customers will always say "No, we did not do it" ).

But, I am just giving one possible example. In that case, since only 2 are supported per box, MDS ports might get reused. Then, if you go back and forth via installer and changing PG numbers and/or sides then again you can end up in a similar situation due to the fact that previous setup was not yet finished.

Again, there should be no registry tweaks done normaly to make duplex works but you can ping me case number so I will take a look.


Goran Wed, 02/12/2014 - 00:21

Hi  Goran,

Thank you for initiating this discussion.

     My question is what monitoring tools can be used to monitor UCCE in a perfect manner. Whether it is Cisco products or if you have an suggestion about third party tools.



Goran Selthofer Wed, 02/12/2014 - 06:41

Hi Mahmoud!

Many thanks for joining this session!

Huh, honestly that would be very much welcomed by TAC as well but unfortunately there is no 'perfect' solution so far for monitoring UCCE as such.

However, there are lots of tools outhere already enabling to monitor some parts of the system or solution like:

- OPS Console in CVP

- Router Log Viewer on AW

- Script Editor in monitoring mode.

- Monitoring connections in CCMP

- Monitors and tools within CIM (Cisco Interaction Manager)

- CUIC reporting

...mentioning of course only available tools within product capable of 'real-time' monitoring or data capturing/reporting if you like...

All of those will give you some info which might be of interest to you but certainly those are not what you want to be final in your NOC.

Now, we do supply SNMP MIBs and so far that was the best bet for monitoring UCCE. I will leave to other customers/partners here to comment if and how valuable they find it is. But Certainly, by embedding UCCE SNMP within already deployed SNMP monitoring infrastructure, that is going to help.

For all above reasons, the last initiative is to try to align with Cisco Prime:

So, keep your eye on that in the future as we do have some plans to provide UCCE/CVP 'monitoring' integration there as well!



Goran Selthofer Fri, 02/14/2014 - 09:04

Thank you ALL for being part of this discussion! I enjoyed it and hope you did as well!

Hopefully, we all and the rest of audience who reads this later will benefit from some of the questions/answers provided during this session!

All the best!



This Discussion

Related Content