Surprises on Replacing failed primary unit

Answered Question
Mar 17th, 2010

Dear friends,

I have done failover for firewalls umpteen number of times but yesterday it failed for some reason.

I had replaced the failed primary unit with a fresh one and i had expected that it will detect the secondary unit as active and try to begin config replication from it but rather it wiped off the secondary unit's config. I dont think that i faulted in the sequence but let me share with you what i did:

1. Put the four or five lines of failover configuration (except the failover command) and did a no shut on the failover interface (management0/0)

2. Ran the failover command

Instead of getting the config from the active unit, it started forcing the configs to the other unit. To restore, i had to reload the active unit to restore its config. After that i reloaded the fresh unit and now the failover happened as expected.

I think that i should forced a reload of the new unit before trying to establish failover.

Has anyone tried this in a fail-proof way during production hours? if yes, can you please share with me the steps?

I did not ask for downtime because i was confident but i resulted in bringing down the ASA for 5 minutes because of the unexpected failover action.

Thanks a lot

Gautam

I have this problem too.
0 votes
Correct Answer by Kureli Sankar about 6 years 10 months ago

Gautham,

So, you haven't given me everything that went on   The sequence of events is very critical.

What you saw is expected also.  Just verified that as well.

7. I then said no failover to disable failover on the new primary unit.

8. I then went to secondary active unit and said failover as failover was disabled

9. I then went back to primary unit and said failover  ---> you did this too quickly

10. This is where blank config replication started !!

They above is expected also. You had issued "fail" in the primary unit during the negotiation process.  Both untis come up during the negotiation process the primary unit becomes active. You should have waited until the secondary unit gave you "No response from Mate" - verified its "sh fail" output and then enabled "fail" on the primary things would have worked fine. I have verified this as well.

The key here is to make sure to issue "sh fail" on the secondary unit make sure it shows this unit "Active" and not "Negotiation".

Also, there is no need to reload the secondary when it is by itself.  Understand that when the secondary unit becomes active, it assumes the primary unit's mac address and ip address and would proxy arp for it.  Now, if you reload it it would continue to use its own mac as the active mac.  Then you will introduce a brand new primary with an all new mac address.

The take away from this is when ever you replace a unit whether it is primary or secondary make sure the stitting unit's "sh fail" is what you expect which is "this unit active" and "other unit failed" before you introduce the other unit and enable failover on it.

Sorry, you went through this but it worked per design and expected.

-KS

  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 5 (1 ratings)
Loading.
Jennifer Halim Wed, 03/17/2010 - 06:04

One thing that I can think of is the replacement primary unit is actually configured as
"failover lan unit secondary" --> by default all ASA comes as secondary failover. Hence when you have both the original secondary and the replacement as secondary, you experienced the issue.

gautamzone Wed, 03/17/2010 - 06:09

I made sure that i said "failover lan unit primary" on the unit before enabling failover.

Thanks

Gautam

Kureli Sankar Wed, 03/17/2010 - 18:55

Gautam,

I believe you may have reloaded the secondary/active unit by itself when it was only unit. Did you?

I will hold my explanation for later.

-KS

gautamzone Wed, 03/17/2010 - 21:03

Dear Kureli,

I reloaded the secondary / active unit only when i lost the running config.

After i reloaded the secondary /active unit, i  disconnected the failover cable so that

the primary unit does not again send its config to secondary active when it comes back up.

I then reloaded the primary unit and only then everything was ok.

Thanks a lot

Gautam

Kureli Sankar Thu, 03/18/2010 - 15:28

Gautham,

Pls. confirm the following:

Did you make sure to issue "sh fail" in the only secondary active unit and make sure it said this unit active and other unit failed?

Here is what may have happened.

1. Primary unit failed and you requested an RMA or what ever.

2. Secondary unit went active

3. failover was disabled on the secondary/active unit (no failover only in the running config).

4. now you replaced the primary unit with the correct failover lines, issued no shut on the mgmt interface and enabled failover.

5. now you enabled failover on the secondary unit as well - this would have said detected an active mate

6. This took the blank config from the primary and wrote it to the secondary.

7. you disconnected the cables.

7. upon which you loaded the secondary - which came up with the failover line enabled.

8. Now you added the primary back and it received the config from the secondary/active as expected.

I verified the above with a pair of ASAs.

-KS

gautamzone Fri, 03/19/2010 - 01:36

Dear kureli,

Thanks a lot for the efforts you took. I really appreciate it.

Here's the exact sequence of steps that happened:

1.  When primary unit failed, secondary got active and i dont remember if sh fail showed "secondary- not detected" or "secondary - failed"

2.  I replaced the faulty primary unit with another primary unit and said no shut on the m0/0 failover interface and also put all the failover commands except "failover" command.

3. I made sure that the new primary unit runs the same code (i checked only the main code version, i did not check the asdm version similarity). The asdm versions were different on both boxes though.

4. After powering up the box and connecting cables, i said failover. It then prompted me saying that SSL license is not the same on both units and disabling failover.

5. I applied for an activiation key from [email protected] and then got the SSL license from them.

6. Next day i went back to the customer and installed the license key. After installing the license key, i said failover. It gave me the message "No response from mate"

7. I then said no failover to disable failover on the new primary unit.

8. I then went to secondary active unit and said failover as failover was disabled

9. I then went back to primary unit and said failover

10. This is where blank config replication started !!

11. Reloaded secondary unit to undo the blank running config

12. Went to Primary unit and disconnected the failover cable. Rebooted the primary unit and connected the failover cable.

13. Secondary came up as active, primary then came up, and this time primary honored the secondary as active and did config replication

14. All was well then!!

Not sure still why this happened and it was a bit shameful for me to see this happening after 3.5 years of firewalling experience.

Anyways, i am willing to learn and improve from now on.

Probably next time, i would try to make sure that i apply the failover configs, reload, and while reload connect the failover cable.

I think the learning lesson is that if the unit reloads, the reloaded unit always honors the currently active unit and does not try to override its role.

This is what worked for me.

Thanks a lot

Gautam

Correct Answer
Kureli Sankar Fri, 03/19/2010 - 06:32

Gautham,

So, you haven't given me everything that went on   The sequence of events is very critical.

What you saw is expected also.  Just verified that as well.

7. I then said no failover to disable failover on the new primary unit.

8. I then went to secondary active unit and said failover as failover was disabled

9. I then went back to primary unit and said failover  ---> you did this too quickly

10. This is where blank config replication started !!

They above is expected also. You had issued "fail" in the primary unit during the negotiation process.  Both untis come up during the negotiation process the primary unit becomes active. You should have waited until the secondary unit gave you "No response from Mate" - verified its "sh fail" output and then enabled "fail" on the primary things would have worked fine. I have verified this as well.

The key here is to make sure to issue "sh fail" on the secondary unit make sure it shows this unit "Active" and not "Negotiation".

Also, there is no need to reload the secondary when it is by itself.  Understand that when the secondary unit becomes active, it assumes the primary unit's mac address and ip address and would proxy arp for it.  Now, if you reload it it would continue to use its own mac as the active mac.  Then you will introduce a brand new primary with an all new mac address.

The take away from this is when ever you replace a unit whether it is primary or secondary make sure the stitting unit's "sh fail" is what you expect which is "this unit active" and "other unit failed" before you introduce the other unit and enable failover on it.

Sorry, you went through this but it worked per design and expected.

-KS

JEFF SPRADLING Fri, 07/20/2012 - 11:03

I was just preparing to replace  the primary ASA in an HA pair and could not find a solid answer to this  question.  I found that, indeed, the primary ASA started replicating  it's blank config to the secondary as soon as I connected the LAN  Failover cable.

Here's the steps to keep this from happening:

configure the primary for failover -

failover lan unit primary

failover lan interface LANFail GigabitEthernet0/2

failover replication http

failover link stateful GigabitEthernet0/3

failover interface ip LANFail 172.16.100.1 255.255.255.0 standby 172.16.100.2

failover interface ip stateful 172.16.101.1 255.255.255.0 standby 172.16.101.2    

Configure all interfaces with the primary IP (no standby needed at this point)

'no shut' on all active interfaces

no failover active         <------- (critical! Forces the primary to standby)

connect lan failover cable (the only one needed at this point)

Secondary will start replicating to primary.

Once  the replication is complete (show failover, ensure primary is "standby  ready", you can connect the remaining cables and do a 'failover active'  on the primary.

Hope this helps others...

Actions

This Discussion