CSM Redundancy (FT) issues

Unanswered Question
Oct 30th, 2007

Hi,

I have an issue with a pair of resilient CSM's (WS-X6066-SLB-APC) that have been running on a Live Production System for over two years without a problem. We have a resilient pair of 6509's with a CSM and a SSL module in each.

About a week ago, the Standby CSM started to complain that it couldn't see any heartbeats from the Active CSM with the following messages:

%CSM_SLB-4-REDUNDANCY_WARN: Module 3 FT warning: Standby is Active now (no heartbeat from active unit)

%CSM_SLB-6-REDUNDANCY_INFO: Module 3 FT info: State Transition Active -> Standby

%CSM_SLB-6-REDUNDANCY_INFO: Module 3 FT info: State Transition Standby -> Active

This caused confusion and the Active CSM remained active with the following message:

%CSM_SLB-4-REDUNDANCY_WARN: Module 3 FT warning: Active/Active Collision staying active

Problems then started with the configured serverfarms as ARP error messages were logged with incorrect MAC addresses being reported. The serverfarms were also taken out of service as the health probes failed. The overall effect was that Production went offline!

Because it is a Production network, the emphasis at the time was to get it back up as soon as possible. Within minutes of viewing the logs (and the phone ringing red hot), I shutdown the Standby CSM in SW2 with the 'no power enable module 3' command. Immediately the Active CSM stopped reporting any issues and after about 5 minutes, the serverfarms became operational again and service was restored.

It took a week for an agreed outage with the customer, but I eventually got a 2hr window to investigate the problem. I swapped out the Standby CSM with another, but the same thing happened when I enabled the power. The FT VLAN was up and active and to prove the two were communicating, I promoted the Standby CSM by giving it a priority of 30. They both negotiated their new status and the Standby CSM successfully became active with CSM1 agreeing. I then reverted to the original configuration and the errors started again.

I have had to disable the power to the Standby CSM for the system to stabilize again. We have another resilient pair with identical FT configuration working perfectly at another Data Centre. Can anyone suggest what's gone wrong?

The FT config is as follows:

CSM1: (ACTIVE)

module ContentSwitchingModule 3

ft group 1 vlan 80

priority 20

failover 3

preempt

!

CSM2: (STANDBY)

module ContentSwitchingModule 3

ft group 1 vlan 80

priority 10

failover 3

preempt

!

Thanks

Phil.

I have this problem too.
0 votes
  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 5 (2 ratings)
Loading.
Gilles Dufour Tue, 10/30/2007 - 07:18

Phil,

you have to determine if the active csm is sending or not those heartbeat.

apparently when the standby gets active, there is no problem. So the standby can send the heartbeat correctly.

Get a sniffer trace of the active port-channel and see if heartbeat are coming out.

Capture many 'sho mod csm X tech ft' and see if the HB (tx rx) counter is increasing on both active and standby.

Check this for several minutes while capturing the sniffer trace.

Gilles.

p.bailey Tue, 10/30/2007 - 08:45

Gilles,

Thanks for the quick response. I'll give that a go as soon as I can.

Are there any known issues which I should be aware of with the CSM's regarding the heartbeats suddenly not been sent?

Phil.

Gilles Dufour Tue, 10/30/2007 - 12:28

is your csm up for more than 828 days ??

CSCsk43903: CSM goes active-active over 828 days

It's the only recent ddts I know where the csm would stop sending HB.

Gilles.

p.bailey Wed, 10/31/2007 - 00:56

Gilles

Whilst I've not managed to arrange an outage to re-enable the power to the standby CSM yet, I have run the 'sh mod csm X tech ft' several times and the HB (tx rx) counter is not incrementing at all. Is this correct?

I ran a 'sh ver' and calculated the number of days the switch had been up at the point where the CSM's went active/active and it was 828 days.

The CSM's are running 4.1(1), but the bug mentions it was first seen in 4.1(2), but I think it is worth considering. The bug also mentions upgrading the firmware to at least 4.1(9.5) and we will now plan an outage to upgrade both CSM's.

Do you think that just a power cycle of the active CSM would temporarily resolve the situation?

Regards

Phil.

Actions

This Discussion