I have an issue with a pair of resilient CSM's (WS-X6066-SLB-APC) that have been running on a Live Production System for over two years without a problem. We have a resilient pair of 6509's with a CSM and a SSL module in each.
About a week ago, the Standby CSM started to complain that it couldn't see any heartbeats from the Active CSM with the following messages:
%CSM_SLB-4-REDUNDANCY_WARN: Module 3 FT warning: Standby is Active now (no heartbeat from active unit)
%CSM_SLB-6-REDUNDANCY_INFO: Module 3 FT info: State Transition Active -> Standby
%CSM_SLB-6-REDUNDANCY_INFO: Module 3 FT info: State Transition Standby -> Active
This caused confusion and the Active CSM remained active with the following message:
%CSM_SLB-4-REDUNDANCY_WARN: Module 3 FT warning: Active/Active Collision staying active
Problems then started with the configured serverfarms as ARP error messages were logged with incorrect MAC addresses being reported. The serverfarms were also taken out of service as the health probes failed. The overall effect was that Production went offline!
Because it is a Production network, the emphasis at the time was to get it back up as soon as possible. Within minutes of viewing the logs (and the phone ringing red hot), I shutdown the Standby CSM in SW2 with the 'no power enable module 3' command. Immediately the Active CSM stopped reporting any issues and after about 5 minutes, the serverfarms became operational again and service was restored.
It took a week for an agreed outage with the customer, but I eventually got a 2hr window to investigate the problem. I swapped out the Standby CSM with another, but the same thing happened when I enabled the power. The FT VLAN was up and active and to prove the two were communicating, I promoted the Standby CSM by giving it a priority of 30. They both negotiated their new status and the Standby CSM successfully became active with CSM1 agreeing. I then reverted to the original configuration and the errors started again.
I have had to disable the power to the Standby CSM for the system to stabilize again. We have another resilient pair with identical FT configuration working perfectly at another Data Centre. Can anyone suggest what's gone wrong?
Whilst I've not managed to arrange an outage to re-enable the power to the standby CSM yet, I have run the 'sh mod csm X tech ft' several times and the HB (tx rx) counter is not incrementing at all. Is this correct?
I ran a 'sh ver' and calculated the number of days the switch had been up at the point where the CSM's went active/active and it was 828 days.
The CSM's are running 4.1(1), but the bug mentions it was first seen in 4.1(2), but I think it is worth considering. The bug also mentions upgrading the firmware to at least 4.1(9.5) and we will now plan an outage to upgrade both CSM's.
Do you think that just a power cycle of the active CSM would temporarily resolve the situation?
Introduction This article will help you understand the steps on how to
download the UCS licenses from the Cisco Systems website and then
installing it on the UCS. The redacted (blue lines) just covers up
certain numbers for privacy please do not take them...
Introduction This article will help you understand and educate the
customer on how to clear their "expired licenses"
(license-graceperiod-expired) from their UCS-M. If a customer just
purchased a license and needs a step by step guide on how to download
Introduction Prepositioning is a powerful tools on the WAAS platform but
it is not always easy to figure out why your jobs are failing when
trying to retrieve the files.Here is a method that should help you to
figure out the reason why they are not succes...