I have a pair of 4710's (ACE01 and ACE02) in fault tolerant config. I have followed standard config. guidelines. I have the following problem:
(1) I reload ACE01 and ACE02 seems to take control.
(2) After the reload completes, ACE01 does not accept ssh login, therefore I have to login via async router, then when I do 'sh arp' command on ACE01, it thinks about it for about 2 mins and I get the following message:
ace01/Admin# sh arp
rpc call failure. retval = -998
(3) Then after about 4 or 5 mins of ACE01 coming back up, I lose SSH connectivity on ACE02, then I login via async router onto ACE02 and I get the following message on that:
Arpmgr busy, Possible ARP flood, 526801 arp pkts were dropped over last60 secs
(4) In order to get out of this state, I have to break the fault tolerant link and shutdown the primary network link (shutdown the switchport that the ACE units connect to), then reload both devices again and then I can get SSH login.
Pleae could someone help me, I don't understand what is going on, I have googled the above messages and they said that it might be related to a bug on the switches that the ACE units connect to (2960's), I have subsequently upgraded the switches but still no luck.
Here is the basic fault tolerant config.
interface gigabitEthernet 1/1
switchport access vlan 711
interface gigabitEthernet 1/2
interface gigabitEthernet 1/3
interface gigabitEthernet 1/4
description Fault Tolerant (ea-ste10-ace02)
ft-port vlan 999
ft interface vlan 999
ip address 22.214.171.124 255.255.255.252
peer ip address 126.96.36.199 255.255.255.252
ft peer 1
heartbeat interval 300
heartbeat count 10
ft-interface vlan 999
ft group 1
Software Version: A5(1.0)
I also have: 1 Admin context and 1 user context
I really would appreciate some help/guidance as I am struggling and I have a deadline to meet as we are going live soon with this system.