I think it is a load related issue. Servers are up, its lights are on, you can even ping its IP addresses, but are you sure about the state of the port it is listening on, which is actually controlled by the applications running on that server.
As you are aware a mere running server is not a good indication that the its particular port which CSS is regularly health-checking upon is alive. Are your server monitors monitor the ports.
The solution would be check the layer2 & 3 connectivity, make sure they are okay. Check the servers healthy state, check if it has got enough resources for the application to run comfortably. A packet capture using utilities like Ethereal in that server segment would fetch a lot more details on who is initiating the TCP sessions and who is not responding or resetting the those sessions.
The root cause could be that the health-check run by the CSS every 5 sec is failing and the services go down, ofcourse in your case occasionally, could be that the mail server is busy processing mails and not able to respond to the CSS health-check queries.
About your query on how to increase the timeout values, I think you are after how to increase the CSS health-check timers values from the default 5-3-3.
I am against it as in reality there seems to be an issue with the servers that needs attention, but what we are trying to do by increasing the timers is that we are hiding it under the carpet without putting efforts to resolve it on a permanent basis. Though in some exclusive cases I have indeed increased those timers, in situations like the banking environments where the mainframe or the database server takes time to respond back to queries during their peak hour operations.
Try the following to configure them under each 'Services'
1/ keepalive frequency - Specify the keepalive message frequency default is 5 seconds (2 - 255 seconds).
2/ keepalive maxfailure - Specify how many times this service can fail to respond to a keepalive message before it is considered offline. The default is 3 failures ( range is 1 - 10).
I would recommend increasing the maxfailure value to 5 or 6, before trying the other method I mean the default frequency value.
It has been more than a week since I added the "flow permanent port1 445" and have not received the annoying email about missed poll even once. I used to get 2-3 per day before this. I think this resolved the problem.
Moquery is the command line cousin of Vizore, it's very helpful and efficient sometimes during the troubleshooting. This article aims to provide moquery cheat sheet to the users for some most common seen scenarios.
Here is the checklist before customers/partners contact Cisco TAC:
Firmware Version of APIC and Switch
Download Switch and APIC techsupport logs
Problem description (Symptoms with details)
Business impact (eg, what kind of services...
moquery usageAPIC moquerySwitchmoquery
This document discuss a common issue observed during the VMM integration & VM workload migration to ACI fabric.
VMware Virtual machines are hosted in Cisco UCS-B seri...