Cisco ACE4710 Fault Tolerance Problem

s.mansha · ‎12-11-2011

Hi,

I have a pair of 4710's (ACE01 and ACE02) in fault tolerant config. I have followed standard config. guidelines. I have the following problem:

(1) I reload ACE01 and ACE02 seems to take control.

(2) After the reload completes, ACE01 does not accept ssh login, therefore I have to login via async router, then when I do 'sh arp' command on ACE01, it thinks about it for about 2 mins and I get the following message:

ace01/Admin# sh arp
Context Admin
rpc call failure. retval = -998
ace01/Admin#

(3) Then after about 4 or 5 mins of ACE01 coming back up, I lose SSH connectivity on ACE02, then I login via async router onto ACE02 and I get the following message on that:

Arpmgr busy, Possible ARP flood, 526801 arp pkts were dropped over last60 secs

(4) In order to get out of this state, I have to break the fault tolerant link and shutdown the primary network link (shutdown the switchport that the ACE units connect to), then reload both devices again and then I can get SSH login.

Pleae could someone help me, I don't understand what is going on, I have googled the above messages and they said that it might be related to a bug on the switches that the ACE units connect to (2960's), I have subsequently upgraded the switches but still no luck.

Here is the basic fault tolerant config.

interface gigabitEthernet 1/1
switchport access vlan 711
no shutdown
interface gigabitEthernet 1/2
shutdown
interface gigabitEthernet 1/3
shutdown
interface gigabitEthernet 1/4
description Fault Tolerant (ea-ste10-ace02)
ft-port vlan 999
shutdown

ft interface vlan 999
ip address 1.1.1.1 255.255.255.252
peer ip address 1.1.1.2 255.255.255.252
no shutdown

ft peer 1
heartbeat interval 300
heartbeat count 10
ft-interface vlan 999

ft group 1

peer 1

priority 200

associate-context ea

inservice

Software Version: A5(1.0)

I also have: 1 Admin context and 1 user context

I really would appreciate some help/guidance as I am struggling and I have a deadline to meet as we are going live soon with this system.

Regards

Sajjad

Surya ARBY · ‎12-11-2011

HA cannot work because the interface carrying the FT Vlans is down.

Your ARP storm comes probably from a split brain event.

s.mansha · ‎12-12-2011

Hi,

Sorry the config. was slightly incorrect, the FT vlans was not shutdown and there wasn't a split brain event.....

I have also downgraded to Version A4(2.1a) in order to eliminate faults with the v5 (1.0) and still I am getting the same problem

Surya ARBY · ‎12-12-2011

What do give the variuous "show ft xxx" commands on both units when the issue occurs ?

s.mansha · ‎12-13-2011

Hi,

Here is the 'show ft group det' result after ACE01 has been reloaded.....the behaviour seems to be normal......but again the problem is still there....

ea-ste10-ace01/Admin# sh ft group detail

FT Group                     : 1
No. of Contexts              : 1
Context Name                 : ea
Context Id                   : 1
Configured Status            : in-service
Maintenance mode             : MAINT_MODE_OFF
My State                     : FSM_FT_STATE_ACTIVE
My Config Priority           : 200
My Net Priority              : 200
My Preempt                   : Enabled
Peer State                   : FSM_FT_STATE_STANDBY_HOT
Peer Config Priority         : 100
Peer Net Priority            : 100
Peer Preempt                 : Enabled
Peer Id                      : 1
Last State Change time       : Tue Dec 13 10:48:32 2011

Running cfg sync enabled     : Enabled
Running cfg sync status      : Running configuration sync has completed
Startup cfg sync enabled     : Enabled
Startup cfg sync status      : Startup configuration sync has completed
Connection sync enabled      : Enabled
Bulk sync done for ARP: 0
Bulk sync done for LB: 0
Bulk sync done for ICM: 0
ea-ste10-ace01/Admin#

ea-ste10-ace02/Admin# sh ft group detail

FT Group                     : 1
No. of Contexts              : 1
Context Name                 : ea
Context Id                   : 1
Configured Status            : in-service
Maintenance mode             : MAINT_MODE_OFF
My State                     : FSM_FT_STATE_STANDBY_HOT
My Config Priority           : 100
My Net Priority              : 100
My Preempt                   : Enabled
Peer State                   : FSM_FT_STATE_ACTIVE
Peer Config Priority         : 200
Peer Net Priority            : 200
Peer Preempt                 : Enabled
Peer Id                      : 1
Last State Change time       : Tue Dec 13 10:48:57 2011

Running cfg sync enabled     : Enabled
Running cfg sync status      : Running configuration sync has completed
Startup cfg sync enabled     : Enabled
Startup cfg sync status      : Startup configuration sync has completed
Connection sync enabled      : Enabled
Bulk sync done for ARP: 0
Bulk sync done for LB: 0
Bulk sync done for ICM: 0
ea-ste10-ace02/Admin#

Francesco Casotto · ‎12-13-2011

Hi Sajjad,

it would seems the result of an arp flood/storm (triggered by a loop?), during the issue you could confirm it by executing:

show processes cpu

and see the cpu usage of arp_mgr (arp is handled on the control plane in ACE) and maybe take a trace monitoring on the switch the port connected to ACE01 to see what the traffic actually is. It could also help to have on the switches "mac-address-table notification mac-move" to detect loops.

Should the above not clarify the issue I would suggest opening a TAC SR to get this investigated further.

Cheers,

Francesco