×

Warning message

  • Cisco Support Forums is in Read Only mode while the site is being migrated.
  • Cisco Support Forums is in Read Only mode while the site is being migrated.

Cisco ACE4710 Fault Tolerance Problem

Unanswered Question
Dec 11th, 2011
User Badges:

Hi,


I have a pair of 4710's (ACE01 and ACE02) in fault tolerant config. I have followed standard config. guidelines. I have the following problem:


(1) I reload ACE01 and ACE02 seems to take control.

(2) After the reload completes, ACE01 does not accept ssh login, therefore I have to login via async router, then when I do 'sh arp' command on ACE01, it thinks about it for about 2 mins and I get the following message:

ace01/Admin# sh arp
Context Admin
rpc call failure. retval = -998
ace01/Admin#


(3) Then after about 4 or 5 mins of ACE01 coming back up, I lose SSH connectivity on ACE02, then I login via async router onto ACE02 and I get the following message on that:

Arpmgr busy, Possible ARP flood, 526801 arp pkts were dropped over last60 secs

(4) In order to get out of this state, I have to break the fault tolerant link and shutdown the primary network link (shutdown the switchport that the ACE units connect to), then reload both devices again and then I can get SSH login.


Pleae could someone help me, I don't understand what is going on, I have googled the above messages and they said that it might be related to a bug on the switches that the ACE units connect to (2960's), I have subsequently upgraded the switches but still no luck.


Here is the basic fault tolerant config.


  interface gigabitEthernet 1/1
  switchport access vlan 711
  no shutdown
interface gigabitEthernet 1/2
  shutdown
interface gigabitEthernet 1/3
  shutdown
interface gigabitEthernet 1/4
  description Fault Tolerant (ea-ste10-ace02)
  ft-port vlan 999
  shutdown


ft interface vlan 999
  ip address 1.1.1.1 255.255.255.252
  peer ip address 1.1.1.2 255.255.255.252
  no shutdown

ft peer 1
  heartbeat interval 300
  heartbeat count 10
  ft-interface vlan 999


ft group 1

  peer 1

  priority 200

  associate-context ea

  inservice


Software Version: A5(1.0)

I also have: 1 Admin context and 1 user context


I really would appreciate some help/guidance as I am struggling and I have a deadline to meet as we are going live soon with this system.


Regards

Sajjad

  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 0 (0 ratings)
Loading.
Surya ARBY Sun, 12/11/2011 - 23:59
User Badges:
  • Silver, 250 points or more

HA cannot work because the interface carrying the FT Vlans is down.


Your ARP storm comes probably from a split brain event.

s.mansha Mon, 12/12/2011 - 06:34
User Badges:

Hi,

Sorry the config. was slightly incorrect, the FT vlans was not shutdown and there wasn't a split brain event.....

I have also downgraded to Version A4(2.1a) in order to eliminate faults with the v5 (1.0) and still I am getting the same problem

Surya ARBY Mon, 12/12/2011 - 06:41
User Badges:
  • Silver, 250 points or more

What do give the variuous "show ft xxx" commands on both units when the issue occurs ?

s.mansha Tue, 12/13/2011 - 04:08
User Badges:

Hi,

Here is the 'show ft group det' result after ACE01 has been reloaded.....the behaviour seems to be normal......but again the problem is still there....


ea-ste10-ace01/Admin# sh ft group detail

FT Group                     : 1
No. of Contexts              : 1
Context Name                 : ea
Context Id                   : 1
Configured Status            : in-service
Maintenance mode             : MAINT_MODE_OFF
My State                     : FSM_FT_STATE_ACTIVE
My Config Priority           : 200
My Net Priority              : 200
My Preempt                   : Enabled
Peer State                   : FSM_FT_STATE_STANDBY_HOT
Peer Config Priority         : 100
Peer Net Priority            : 100
Peer Preempt                 : Enabled
Peer Id                      : 1
Last State Change time       : Tue Dec 13 10:48:32 2011

Running cfg sync enabled     : Enabled
Running cfg sync status      : Running configuration sync has completed
Startup cfg sync enabled     : Enabled
Startup cfg sync status      : Startup configuration sync has completed
Connection sync enabled      : Enabled
Bulk sync done for ARP: 0
Bulk sync done for LB: 0
Bulk sync done for ICM: 0
ea-ste10-ace01/Admin#


ea-ste10-ace02/Admin# sh ft group detail

FT Group                     : 1
No. of Contexts              : 1
Context Name                 : ea
Context Id                   : 1
Configured Status            : in-service
Maintenance mode             : MAINT_MODE_OFF
My State                     : FSM_FT_STATE_STANDBY_HOT
My Config Priority           : 100
My Net Priority              : 100
My Preempt                   : Enabled
Peer State                   : FSM_FT_STATE_ACTIVE
Peer Config Priority         : 200
Peer Net Priority            : 200
Peer Preempt                 : Enabled
Peer Id                      : 1
Last State Change time       : Tue Dec 13 10:48:57 2011

Running cfg sync enabled     : Enabled
Running cfg sync status      : Running configuration sync has completed
Startup cfg sync enabled     : Enabled
Startup cfg sync status      : Startup configuration sync has completed
Connection sync enabled      : Enabled
Bulk sync done for ARP: 0
Bulk sync done for LB: 0
Bulk sync done for ICM: 0
ea-ste10-ace02/Admin#

Francesco Casotto Tue, 12/13/2011 - 10:00
User Badges:
  • Cisco Employee,

Hi Sajjad,


it would seems the result of an arp flood/storm (triggered by a loop?), during the issue you could confirm it by executing:


show processes cpu


and see the cpu usage of arp_mgr (arp is handled on the control plane in ACE) and maybe take a trace monitoring on the switch the port connected to ACE01 to see what the traffic actually is. It could also help to have on the switches "mac-address-table notification mac-move" to detect loops.


Should the above not clarify the issue I would suggest opening a TAC SR to get this investigated further.


Cheers,

Francesco

Actions

This Discussion