Can ISP Multihoming on routers affect Inbound NAT on downstream firewalls?

noemi.berry · ‎01-31-2015

Could inbound static NAT on an ASA firewall be affected if the upstream path were disrupted in a BGP failover?

Here's the situation:

- 2 routers running multihomed BGP/iBGP to 2 ISPs

- Downstream of the routers is a firewall that does inbound static NAT to reach about 20 servers with 20 IP addresses.

- All of a sudden, all 20 servers are unreachable from the Internet, major outage.

- The servers were still able to reach out (ping 8.8.8.8).

- The servers were still reachable through a site-to-site tunnel on the firewall, from another internal location.

--- But : Shutting down one of the ISP links on the routers resolved the outage within a few minutes.

Routing problem, right? No, because the firewall kept its site-to-sites up (no NAT), the servers were still able to reach out (outbound dynamic NAT), and monitoring systems at another data center could reach the servers over the site-to-sites (no NAT) . Only the Internet (pingdom) could not reach into the servers (inbound static NAT). But how could shutting down an ISP link fix that?

This scenario repeated after the ISP link was re-enabled, but it took about a week for the inbound outage to happen. Once again, shutting down an ISP link resolved it within minutes.

Is it possible that the ISP link shut-down, and the resulting minutes-long BGP reconvergence to ISP#2, could give the static-NAT on the firewalls a little kick in the rear? Timeouts, state re-sets? The upstream router state "should" be completely transparent to the firewall, but somehow an upstream change "fixed" the inbound-reachability problem downstream.

Another clue is that this ran fine for about a year, until the original FortiGate firewall was replaced with the ASA 5585. So the firewall is my prime suspect.

We're running on one ISP now, which has to be fixed, but need to understand the problem before touching anything. Any ideas?

thank you!

Jon Marshall · ‎01-31-2015

Are the public IPs you are using for your servers provider independent or are you using a block from one ISP and then advertising that block to the other ISP as well.

Jon

noemi.berry · ‎01-31-2015

Option B (non-portable).

The link that was shut down is to the ISP that carries that block, so I can see that the routing would be interrupted for a minute or so...But how could that possibly fix an outage of reaching into servers behind a firewall? seems impossible.

Jon Marshall · ‎01-31-2015

No, because the firewall kept its site-to-sites up (no NAT), the servers were still able to reach out (outbound dynamic NAT),

I haven't used ASAs for a while and I know the NAT is different from that of 8.2 but if you have static NAT statements for the servers then traffic outbound should use those translations as well as inbound traffic ie. they wouldn't use dynamic NAT.

If the addressing is non portable then is the ISP that doesn't own the block advertising it for you as well ?

If they do this would also mean that the ISP that owns then would also need to advertise your specific block ie. they couldn't include it in their summary to upstream providers or else all traffic would come in by the other provider as it would be the more specific route.

I'm not saying it isn't the firewall but I'm trying to understand how your IP addressing is meant to work in terms of connectivity from the internet.

Is the IP you use for dynamic NAT/PAT from the same subnet as the one used for the servers.

Jon

noemi.berry · ‎02-01-2015

>If you have static NAT statements for the servers then traffic outbound should use those translations as well as inbound traffic ie. they wouldn't use dynamic NAT.

OK I had to think about this, and comb through the configs, and (~shudder~) the manual. Still not sure.

Here's why: The static NAT for the outaged-servers is defined through objects, with no src or dest specified, just using defaults, like this :

(sanitized version w/fake IPs):

! Server public IP address

object network SERVER-PUBLIC-IP

host 1.1.1.201

! Server private IP address

object network SERVER-PRIVATE-IP

host 10.0.0.201

! Server static NAT

object network SERVER-PRIVATE-IP

nat (INSIDE,OUTSIDE) static SERVER-PUBLIC-IP

So, yeah, looks like the static-NAT would apply for inside-originated traffic to reach out (e.g. server 10.0.0.201 ping's 8.8.8.8.

Also, the static-NAT is in Section 2; whereas the general dynamic NAT is "after-auto" and so in Section 3:

! Firewall's own public IP address

object network FIREWALL-OUTSIDE-IP

host 1.1.1.4

! General vanilla dynamic NAT for everyone else

nat (INSIDE,OUTSIDE) after-auto source dynamic any FIREWALL-OUTSIDE-IP

So I think you're right, it DOES look like the static-NAT would take precedence for internal-originated traffic.

So that only deepens the mystery.

How is it that when the Internet could not reach into the servers, the servers could reach out? How is it that the routing was never disrupted during this outage -- Yet briefly disrupting the routing with an ISP-link shutdown and a brief reconvergence (pretty fast, carrying only 18K routes or so, filtered up to a /22) would resolve a server-reach-in-ability outage?

Twice, shutting down the ISP link seemed to fix the server-reach-in outage within minutes, while all other routing stayed up. I know this all sounds very unlikely, so I'm turning to you all for idea of what could be missing (yes, TAC case, Cisco SE, Googling, RTFM, etc, all the usual resources :) -- and forum questions!

thanks....

Jon Marshall · ‎02-02-2015

Sorry, I missed your update to this thread.

It's come to something when you have to go back to the manuals :-)

I'm still getting to grips with the way the NAT works now so I'm glad you did it and not me.

It's difficult to say without knowing how your public addressing works and how you have your BGP configured.

Could you have another look at my last post about the IP addressing questions.

In addition is the IP on the outside interface of your firewalls from the same range as you use for the servers ?

Jon

noemi.berry · ‎02-02-2015

>In addition is the IP on the outside interface of your firewalls from the same range as you use for the servers

Whoops, missed this. Yes. It's a public /24 that has a single heavily-used IP address for the firewall's outside interface, including terminating tunnels. At the time of the outages, all the NATs used that one interface too; now the general dynamic NATs have been spread out to a few other IPs there.

Also, the upstream routers each have an interface in that same one public subnet, with a VRRP VIP for the firewall to point up to.

e.g:

ISP1 5.5.5.1 --------- 5.5.5.2 Rtr1 1.1.1.1 -------

------- 1.1.1.4 FW 10.0.0.1 --------- 10.0.0.201 server

During the outage, 1.1.1.4 was always reachable. 10.0.0.201 was always able to reach out. But no one oculd get into 10.0.0.201 until Rtr1's upstream (5.5.5.2) was shut down.

(sorry for the awful art, and I'm leaving out Rtr2)

Reza Sharifi · ‎01-31-2015

ISPs should not have effect in your inbound NAT.

Are the firewalls configured in active/stand-by or active/active mode?

If this worked fine for a year with FortiGate and was replaced with ASA and the problem started, I would also suspect the problem is the firewall. Have you open a ticket with TAC about this? It could be a bug in the code and an upgrade or downgrade could fix the issue. Anything in the firewall logs when this happens?

HTH

noemi.berry · ‎01-31-2015

A pair of ASAs HA'd in active/standby. Unfortunately no logs yet for failover state, working on it :)

noemi.berry · ‎01-31-2015

>ISPs should not have effect in your inbound NAT

OK I'm going to get really icky-sticky detailed here, I know this is a very obscure question (and that wasn't it :) ).

Is it remotely possible that a routing re-convergence on 2 upstream routers could somehow affect a downstream static NAT? (e.g. states get reset, timers expire). It seems way out-there!

But that IS what was experienced -- in a good way. Shutting down one ISP link seemed to fix the problem of the all the servers being unreachable.

Again, during the server-unreachability-outage, the servers were able to ping *out*, and the IPSec tunnels to the firewall stayed up, and the servers were reachable from another data center. No routing outage.

The strange thing is that kicking the upstream routing (by shutting down one ISP link) *seemed* to fix the server-reachability outage, twice -- or was this one mondo coincidence?!

Unfortunately this happened months ago before I started, so I can only go from what was collected at the time, which did not include firewall logs, so just trying to piece everything together.