cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1239
Views
0
Helpful
5
Replies

Cisco ACE4710 Fault Tolerance Problem

s.mansha
Level 1
Level 1

Hi,

I have a pair of 4710's (ACE01 and ACE02) in fault tolerant config. I have followed standard config. guidelines. I have the following problem:

(1) I reload ACE01 and ACE02 seems to take control.

(2) After the reload completes, ACE01 does not accept ssh login, therefore I have to login via async router, then when I do 'sh arp' command on ACE01, it thinks about it for about 2 mins and I get the following message:

ace01/Admin# sh arp
Context Admin
rpc call failure. retval = -998
ace01/Admin#

(3) Then after about 4 or 5 mins of ACE01 coming back up, I lose SSH connectivity on ACE02, then I login via async router onto ACE02 and I get the following message on that:

Arpmgr busy, Possible ARP flood, 526801 arp pkts were dropped over last60 secs

(4) In order to get out of this state, I have to break the fault tolerant link and shutdown the primary network link (shutdown the switchport that the ACE units connect to), then reload both devices again and then I can get SSH login.

Pleae could someone help me, I don't understand what is going on, I have googled the above messages and they said that it might be related to a bug on the switches that the ACE units connect to (2960's), I have subsequently upgraded the switches but still no luck.

Here is the basic fault tolerant config.

  interface gigabitEthernet 1/1
  switchport access vlan 711
  no shutdown
interface gigabitEthernet 1/2
  shutdown
interface gigabitEthernet 1/3
  shutdown
interface gigabitEthernet 1/4
  description Fault Tolerant (ea-ste10-ace02)
  ft-port vlan 999
  shutdown

ft interface vlan 999
  ip address 1.1.1.1 255.255.255.252
  peer ip address 1.1.1.2 255.255.255.252
  no shutdown

ft peer 1
  heartbeat interval 300
  heartbeat count 10
  ft-interface vlan 999

ft group 1

  peer 1

  priority 200

  associate-context ea

  inservice

Software Version: A5(1.0)

I also have: 1 Admin context and 1 user context

I really would appreciate some help/guidance as I am struggling and I have a deadline to meet as we are going live soon with this system.

Regards

Sajjad

5 Replies 5

Surya ARBY
Level 4
Level 4

HA cannot work because the interface carrying the FT Vlans is down.

Your ARP storm comes probably from a split brain event.

Hi,

Sorry the config. was slightly incorrect, the FT vlans was not shutdown and there wasn't a split brain event.....

I have also downgraded to Version A4(2.1a) in order to eliminate faults with the v5 (1.0) and still I am getting the same problem

What do give the variuous "show ft xxx" commands on both units when the issue occurs ?

Hi,

Here is the 'show ft group det' result after ACE01 has been reloaded.....the behaviour seems to be normal......but again the problem is still there....

ea-ste10-ace01/Admin# sh ft group detail

FT Group                     : 1
No. of Contexts              : 1
Context Name                 : ea
Context Id                   : 1
Configured Status            : in-service
Maintenance mode             : MAINT_MODE_OFF
My State                     : FSM_FT_STATE_ACTIVE
My Config Priority           : 200
My Net Priority              : 200
My Preempt                   : Enabled
Peer State                   : FSM_FT_STATE_STANDBY_HOT
Peer Config Priority         : 100
Peer Net Priority            : 100
Peer Preempt                 : Enabled
Peer Id                      : 1
Last State Change time       : Tue Dec 13 10:48:32 2011

Running cfg sync enabled     : Enabled
Running cfg sync status      : Running configuration sync has completed
Startup cfg sync enabled     : Enabled
Startup cfg sync status      : Startup configuration sync has completed
Connection sync enabled      : Enabled
Bulk sync done for ARP: 0
Bulk sync done for LB: 0
Bulk sync done for ICM: 0
ea-ste10-ace01/Admin#

ea-ste10-ace02/Admin# sh ft group detail

FT Group                     : 1
No. of Contexts              : 1
Context Name                 : ea
Context Id                   : 1
Configured Status            : in-service
Maintenance mode             : MAINT_MODE_OFF
My State                     : FSM_FT_STATE_STANDBY_HOT
My Config Priority           : 100
My Net Priority              : 100
My Preempt                   : Enabled
Peer State                   : FSM_FT_STATE_ACTIVE
Peer Config Priority         : 200
Peer Net Priority            : 200
Peer Preempt                 : Enabled
Peer Id                      : 1
Last State Change time       : Tue Dec 13 10:48:57 2011

Running cfg sync enabled     : Enabled
Running cfg sync status      : Running configuration sync has completed
Startup cfg sync enabled     : Enabled
Startup cfg sync status      : Startup configuration sync has completed
Connection sync enabled      : Enabled
Bulk sync done for ARP: 0
Bulk sync done for LB: 0
Bulk sync done for ICM: 0
ea-ste10-ace02/Admin#

Hi Sajjad,

it would seems the result of an arp flood/storm (triggered by a loop?), during the issue you could confirm it by executing:

show processes cpu

and see the cpu usage of arp_mgr (arp is handled on the control plane in ACE) and maybe take a trace monitoring on the switch the port connected to ACE01 to see what the traffic actually is. It could also help to have on the switches "mac-address-table notification mac-move" to detect loops.

Should the above not clarify the issue I would suggest opening a TAC SR to get this investigated further.

Cheers,

Francesco

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: