cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1970
Views
0
Helpful
6
Replies

ASA 5510 Failing over far too often

simonbilton
Level 1
Level 1

Hi,

Got a pair of 5510's in Active/Passive failover config. Been in and running for about 9 months. Running 8.3(1)

About 3 weeks ago I started seeing the units failing over on an almost daily basis. Failover has hardly ever happened before then.

Can anyone suggest the best troubleshooting approach please.

I only have two interfaces monitored - "inside" and "outside", the failover cable is a direct cable [ie not through a switch - I know this is not the recommended approach, but as failover does actually work, I don't think this is an issue]. I have set up email traps to alert me when failovers happen, the sequence of events appears to be that I lose communication between the two units on the outside interface ....

<161>Jan 05 2012 14:22:01: %ASA-1-105005: (Secondary) Lost Failover communications with mate on interface outside

<161>Jan 05 2012 14:22:08: %ASA-1-103001: (Secondary) No response from other firewall (reason code = 1).

<161>Jan 05 2012 14:22:20: %ASA-1-103001: (Secondary) No response from other firewall (reason code = 4).

There are about 19 messages come through in total, some are "mirrored" (i.e. the Primary and Secondary both send the same message, effectively, each saying it cannot communicate with the other.)

The ASAs are connected through a pair of trunked CISCO 2960 switches, I can't tell you exactly what is plugged in where (ie if they are both in the same switch, or one in each).

I must admit, I suspect a network fault rather than the ASAs themselves, but not quite sure how to go about troubleshooting.

I do see lots of errors from "sh failover" though...

Stateful Failover Logical Update Statistics

        Link : stateful Ethernet0/3 (up)

Stateful Obj           xmit       xerr       rcv        rerr     

General                    343913     0          788701     18875    

sys cmd            24252      0          24251      0        

up time            0          0          0          0        

RPC services            0          0          0          0        

TCP conn           178732     0          331045     9438     

UDP conn           128365     0          356049     9437     

ARP tbl            12469      0          76279      0        

Xlate_Timeout            0          0          0          0        

IPv6 ND tbl            0          0          0          0        

VPN IKE upd           26         0          684        0        

VPN IPSEC upd           69         0          393        0        

VPN CTCP upd           0          0          0          0        

VPN SDI upd           0          0          0          0        

VPN DHCP upd           0          0          0          0        

SIP Session           0          0          0          0        

Logical Update Queue Information

Cur           Max           Total

Recv Q:           0           25           1748693

Xmit Q:           0           2048           511460

Any suggestions please?

Rgds

Simon B

6 Replies 6

Marvin Rhoads
Hall of Fame
Hall of Fame

It seems symptomatic of a congested (or faulty) outside network. You could:

a. not monitor the outside interface for failover purposes - probably not good since in the event of a true failure failover would not occur.

b. change the failover polltime / holdtime parameters to make the process more forgiving of delays due to congestion of whatever outside network problems are preventing the units from getting the expected poll response.

There is "debug fover" but that would probably not be useful in and of itself since failover seems to be working as intended per your log messages.

Of course, if the outside network really is that unreliable, that's at least as big a problem as the ASAs failing over.

Thanks for your response....

To be honest, after I'd sat down and written the post above it cleared my brain and after I thought about it a bit more I came to pretty much the same conclusion as you.

FYI, the ASAs are actually sat in a commercial data centre which is also our ISP, so on the one hand, I wouldn't expect any network issues at their LAN level, but on the other hand they are susceptable to all sorts of DoS attacks etc, so if there is any interference on the public network, it could casue what I am seeing.

I have a spare interface I can utilise for a "dummy" network to use as another monitored interface, so I can can continue monitoring on the outside interface as well as the "inside" but increase the criteria to "2 failed interfaces" before failing over.

Thanks again for confirming my suspicions

Rgds

Simon

Hmmm, the plot thickens.

Done some more investigating over the weekend, and it seems a bit more serious than I initially thought.

it seems that when failover is happening, it's not due only to a monitored interface not responding - it seems that the partner is actually reloading/rebooting itself completely.

I had a failover event on Sunday morning, my Primary shows a device uptime which equates to the time of failover.

I had another failover this morning at about 07:30, this time the Secondary reloaded/rebooted.

Currently I'm running with Primary Active and Secondary Standby.

I cleared the failover statistics, but I can see the error count on the secondary inceasing steadily - no transmit errors on the Primary, just receive errors on the Secondary.

Link : stateful Ethernet0/3 (up)

        Stateful Obj    xmit       xerr       rcv        rerr

        General         238        0          16113      767

        sys cmd         238        0          238        0

        up time         0          0          0          0

        RPC services    0          0          0          0

        TCP conn        0          0          6926       509

        UDP conn        0          0          7893       258

        ARP tbl         0          0          1048       0

        Xlate_Timeout   0          0          0          0

        IPv6 ND tbl     0          0          0          0

        VPN IKE upd     0          0          2          0

        VPN IPSEC upd   0          0          6          0

        VPN CTCP upd    0          0          0          0

        VPN SDI upd     0          0          0          0

        VPN DHCP upd    0          0          0          0

        SIP Session     0          0          0          0

This is pointing me towards a dodgy port on the Secondary ASA, or a bad cable maybe. I'm pretty sure the units are connected with a direct cable (no switch).

I guess the first thing to do is change the connected failover port?

I'm still nervous about this reload/reboot business though, anyone got any suggestions?

Rgds,

Simon

Dodgy indeed. If it's a bad port on the ASA itself, it'll take a hardware replacement to set it right. I hope you have Smartnet.

You're running 8.3(1) so you're pretty current. Of course the TAC would probably steer you to 8.4(2). You must have enough memory or you wouldn't be on 8.3 - you do have the recommended - right? One thought is that if you upgraded memory recently to get there, the DIMM may be seated imperfectly. I've seen memory cause reboots like that (along with old buggy code but yours is pretty recent).

Thanks again for rapid response

Both units were installed back in April 2011 and I believe they have 1024 MB, they have not been touched since the initial install, so a loose memory module is unlikley I would have thought.

I've seen another post somewhere on here where a guy was experiencing similar issue with the receive erros and basically did a couple of hard reboots whch cleared it.

I'm just concerned that both units seem to be reloading, unless it's symptomatic of a dodgy failover link.

(Unfortunately I don't have Smartlink, I do have unused ports so for now I can try changing if necessary).

Will be doing the hardware reset later today, will post back the outcome.

Rgds

S

Just an update for anyone who may be interested. We power-cycled the Secondary unit about a week ago, the Primary failed over about 24 hours later. But this time the Secondary has stayed Active for a week, the longest since we started seeing the issue (around mid-December)

Tonight we have power-cylced the Primary as well, we will manually failover to the Primary and see what happens.

Rgds

S

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: