01-05-2012 02:00 PM - edited 02-21-2020 04:31 AM
Hi,
Got a pair of 5510's in Active/Passive failover config. Been in and running for about 9 months. Running 8.3(1)
About 3 weeks ago I started seeing the units failing over on an almost daily basis. Failover has hardly ever happened before then.
Can anyone suggest the best troubleshooting approach please.
I only have two interfaces monitored - "inside" and "outside", the failover cable is a direct cable [ie not through a switch - I know this is not the recommended approach, but as failover does actually work, I don't think this is an issue]. I have set up email traps to alert me when failovers happen, the sequence of events appears to be that I lose communication between the two units on the outside interface ....
<161>Jan 05 2012 14:22:01: %ASA-1-105005: (Secondary) Lost Failover communications with mate on interface outside
<161>Jan 05 2012 14:22:08: %ASA-1-103001: (Secondary) No response from other firewall (reason code = 1).
<161>Jan 05 2012 14:22:20: %ASA-1-103001: (Secondary) No response from other firewall (reason code = 4).
There are about 19 messages come through in total, some are "mirrored" (i.e. the Primary and Secondary both send the same message, effectively, each saying it cannot communicate with the other.)
The ASAs are connected through a pair of trunked CISCO 2960 switches, I can't tell you exactly what is plugged in where (ie if they are both in the same switch, or one in each).
I must admit, I suspect a network fault rather than the ASAs themselves, but not quite sure how to go about troubleshooting.
I do see lots of errors from "sh failover" though...
Stateful Failover Logical Update Statistics
Link : stateful Ethernet0/3 (up)
Stateful Obj xmit xerr rcv rerr
General 343913 0 788701 18875
sys cmd 24252 0 24251 0
up time 0 0 0 0
RPC services 0 0 0 0
TCP conn 178732 0 331045 9438
UDP conn 128365 0 356049 9437
ARP tbl 12469 0 76279 0
Xlate_Timeout 0 0 0 0
IPv6 ND tbl 0 0 0 0
VPN IKE upd 26 0 684 0
VPN IPSEC upd 69 0 393 0
VPN CTCP upd 0 0 0 0
VPN SDI upd 0 0 0 0
VPN DHCP upd 0 0 0 0
SIP Session 0 0 0 0
Logical Update Queue Information
Cur Max Total
Recv Q: 0 25 1748693
Xmit Q: 0 2048 511460
Any suggestions please?
Rgds
Simon B
01-06-2012 06:22 AM
It seems symptomatic of a congested (or faulty) outside network. You could:
a. not monitor the outside interface for failover purposes - probably not good since in the event of a true failure failover would not occur.
b. change the failover polltime / holdtime parameters to make the process more forgiving of delays due to congestion of whatever outside network problems are preventing the units from getting the expected poll response.
There is "debug fover" but that would probably not be useful in and of itself since failover seems to be working as intended per your log messages.
Of course, if the outside network really is that unreliable, that's at least as big a problem as the ASAs failing over.
01-06-2012 07:13 AM
Thanks for your response....
To be honest, after I'd sat down and written the post above it cleared my brain and after I thought about it a bit more I came to pretty much the same conclusion as you.
FYI, the ASAs are actually sat in a commercial data centre which is also our ISP, so on the one hand, I wouldn't expect any network issues at their LAN level, but on the other hand they are susceptable to all sorts of DoS attacks etc, so if there is any interference on the public network, it could casue what I am seeing.
I have a spare interface I can utilise for a "dummy" network to use as another monitored interface, so I can can continue monitoring on the outside interface as well as the "inside" but increase the criteria to "2 failed interfaces" before failing over.
Thanks again for confirming my suspicions
Rgds
Simon
01-09-2012 04:32 AM
Hmmm, the plot thickens.
Done some more investigating over the weekend, and it seems a bit more serious than I initially thought.
it seems that when failover is happening, it's not due only to a monitored interface not responding - it seems that the partner is actually reloading/rebooting itself completely.
I had a failover event on Sunday morning, my Primary shows a device uptime which equates to the time of failover.
I had another failover this morning at about 07:30, this time the Secondary reloaded/rebooted.
Currently I'm running with Primary Active and Secondary Standby.
I cleared the failover statistics, but I can see the error count on the secondary inceasing steadily - no transmit errors on the Primary, just receive errors on the Secondary.
Link : stateful Ethernet0/3 (up)
Stateful Obj xmit xerr rcv rerr
General 238 0 16113 767
sys cmd 238 0 238 0
up time 0 0 0 0
RPC services 0 0 0 0
TCP conn 0 0 6926 509
UDP conn 0 0 7893 258
ARP tbl 0 0 1048 0
Xlate_Timeout 0 0 0 0
IPv6 ND tbl 0 0 0 0
VPN IKE upd 0 0 2 0
VPN IPSEC upd 0 0 6 0
VPN CTCP upd 0 0 0 0
VPN SDI upd 0 0 0 0
VPN DHCP upd 0 0 0 0
SIP Session 0 0 0 0
This is pointing me towards a dodgy port on the Secondary ASA, or a bad cable maybe. I'm pretty sure the units are connected with a direct cable (no switch).
I guess the first thing to do is change the connected failover port?
I'm still nervous about this reload/reboot business though, anyone got any suggestions?
Rgds,
Simon
01-09-2012 06:12 AM
Dodgy indeed. If it's a bad port on the ASA itself, it'll take a hardware replacement to set it right. I hope you have Smartnet.
You're running 8.3(1) so you're pretty current. Of course the TAC would probably steer you to 8.4(2). You must have enough memory or you wouldn't be on 8.3 - you do have the recommended - right? One thought is that if you upgraded memory recently to get there, the DIMM may be seated imperfectly. I've seen memory cause reboots like that (along with old buggy code but yours is pretty recent).
01-09-2012 06:30 AM
Thanks again for rapid response
Both units were installed back in April 2011 and I believe they have 1024 MB, they have not been touched since the initial install, so a loose memory module is unlikley I would have thought.
I've seen another post somewhere on here where a guy was experiencing similar issue with the receive erros and basically did a couple of hard reboots whch cleared it.
I'm just concerned that both units seem to be reloading, unless it's symptomatic of a dodgy failover link.
(Unfortunately I don't have Smartlink, I do have unused ports so for now I can try changing if necessary).
Will be doing the hardware reset later today, will post back the outcome.
Rgds
S
01-16-2012 12:15 PM
Just an update for anyone who may be interested. We power-cycled the Secondary unit about a week ago, the Primary failed over about 24 hours later. But this time the Secondary has stayed Active for a week, the longest since we started seeing the issue (around mid-December)
Tonight we have power-cylced the Primary as well, we will manually failover to the Primary and see what happens.
Rgds
S
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: