DNS Issue after Upgrading FWSM from 2.3 to 4.1(3)

ben_johnson · ‎12-07-2010

Hi,

We recently upgraded our FWSM from version 2.3 to version 4.1, without the intermediate hop to a 3.x version first. We were advised by a Cisco representative that this intermediate upgrade to 3.x was no longer necessary.

Overall, the upgrade appears to have been successful in that all of the access-list entries continued to work as expected and general connectivity was maintained. However, one major issue we did come up against was that our DNS servers were no longer able to resolve external addresses. We were able to verify that external connectivity was still possible as we could ping IP addresses outside of our network.

We have VERY open rules with respect to the DNS servers, which sit in an internal network, allowing them to intitiate connections to the outside world on any protocol and any port. We confirmed that those rules still existed after the upgrade. We ended up with log files filled with the following types of messages (sanitized):

===== log extract:

Dec 7 05:58:58 uscfw %FWSM-2-106007: Deny inbound UDP from 218.244.147.140/53 to 2.2.2.2/16196 due to DNS Response
Dec 7 05:58:58 uscfw %FWSM-2-106007: Deny inbound UDP from 222.76.210.211/53 to 2.2.2.2/17487 due to DNS Response
Dec 7 05:58:58 uscfw %FWSM-2-106007: Deny inbound UDP from 124.238.255.41/53 to 2.2.2.2/60578 due to DNS Response
Dec 7 05:58:58 uscfw %FWSM-2-106007: Deny inbound UDP from 222.73.40.34/53 to 2.2.2.2/35361 due to DNS Response
Dec 7 05:58:58 uscfw %FWSM-2-106007: Deny inbound UDP from 192.31.80.30/53 to 2.2.2.2/22946 due to DNS Response
Dec 7 05:58:58 uscfw %FWSM-2-106007: Deny inbound UDP from 192.43.172.30/53 to 2.2.2.2/45231 due to DNS Response
Dec 7 05:58:59 uscfw %FWSM-2-106007: Deny inbound UDP from 210.192.121.162/53 to 2.2.2.2/57119 due to DNS Response
Dec 7 05:58:59 uscfw %FWSM-2-106007: Deny inbound UDP from 199.71.0.63/53 to 2.2.2.2/60976 due to DNS Response
Dec 7 05:58:59 uscfw %FWSM-2-106007: Deny inbound UDP from 69.63.176.200/53 to 2.2.2.2/25847 due to DNS Response
Dec 7 05:58:59 uscfw %FWSM-2-106007: Deny inbound UDP from 192.12.94.30/53 to 2.2.2.2/2727 due to DNS Response
Dec 7 05:58:59 uscfw %FWSM-2-106007: Deny inbound UDP from 222.73.40.33/53 to 2.2.2.2/47323 due to DNS Response
Dec 7 05:58:59 uscfw %FWSM-2-106007: Deny inbound UDP from 219.142.74.45/53 to 2.2.2.2/24312 due to DNS Response
Dec 7 05:58:59 uscfw %FWSM-2-106007: Deny inbound UDP from 218.244.147.85/53 to 2.2.2.2/1861 due to DNS Response
Dec 7 05:58:59 uscfw %FWSM-2-106007: Deny inbound UDP from 59.151.39.138/53 to 2.2.2.2/16776 due to DNS Response
Dec 7 05:58:59 uscfw %FWSM-2-106007: Deny inbound UDP from 219.142.74.45/53 to 2.2.2.2/58687 due to DNS Response
Dec 7 05:58:59 uscfw %FWSM-2-106007: Deny inbound UDP from 202.108.45.136/53 to 2.2.2.2/37314 due to DNS Response
Dec 7 05:58:59 uscfw %FWSM-2-106007: Deny inbound UDP from 192.43.172.30/53 to 2.2.2.2/59260 due to DNS Response
Dec 7 05:58:59 uscfw %FWSM-2-106007: Deny inbound UDP from 69.63.176.200/53 to 2.2.2.2/57812 due to DNS Response
Dec 7 05:58:59 uscfw %FWSM-2-106007: Deny inbound UDP from 192.31.80.30/53 to 2.2.2.2/16606 due to DNS Response
Dec 7 05:58:59 uscfw %FWSM-2-106007: Deny inbound UDP from 222.73.40.42/53 to 2.2.2.2/47978 due to DNS Response
Dec 7 05:58:59 uscfw %FWSM-2-106007: Deny inbound UDP from 69.63.178.21/53 to 2.2.2.2/47019 due to DNS Response
Dec 7 05:58:59 uscfw %FWSM-2-106007: Deny inbound UDP from 222.73.40.34/53 to 2.2.2.2/60616 due to DNS Response
Dec 7 05:59:00 uscfw %FWSM-2-106007: Deny inbound UDP from 208.78.71.3/53 to 2.2.2.2/41135 due to DNS Response
Dec 7 05:59:00 uscfw %FWSM-2-106007: Deny inbound UDP from 218.244.147.85/53 to 2.2.2.2/11196 due to DNS Response
Dec 7 05:59:00 uscfw %FWSM-2-106007: Deny inbound UDP from 1.1.1.1/53 to 203.57.182.180/33579 due to DNS Response
Dec 7 05:59:00 uscfw %FWSM-2-106007: Deny inbound UDP from 1.1.1.1/53 to 203.57.182.199/33274 due to DNS Response

====== end: log extract

[edit:] The majority of these messages seem to refer to responses coming back to our servers from outside and this is true of the logs for the whole period of time we were testing the upgrade.

Our servers are represented by 1.1.1.1 and 2.2.2.2 (not their actual addresses).

I've combed through the configurations to compare the 2.3 config with the committed 4.1 config and only found two lines that reference DNS:

dns-guard

inspect dns (this one was part of the global policy map "policy-map global" config)

Shutting down either or both of those commands didn't change the behaviour; we were still unable to resolve outside names. We eventually had to revert to version 2.3 because we were at the end of the planned outage window. Our current firewall logs still show these "Deny ... due to DNS Response" messages in large volumes, but the difference is that our servers are now predominantly the _source_ not the destination in the log messages.

We're still intending to roll forward to version 4.1 at a later date, but I'd appreciate any advice others may have to offer with regard to this issue. If you need any further info or config snippets then please let me know.

Many thanks,

Ben.

Kureli Sankar · ‎12-07-2010

Those messages are usually benign. dns-guard only allows one response per request. In your case you are saying that you were unable to get name resolution. This is strange.

http://www.cisco.com/en/US/docs/security/fwsm/fwsm40/system/message/logmsgs_external_docbase_0900e4b18059d73b_4container_external_docbase_0900e4b180ef4f45.html#wp1279764

What other data did you gather from the time of the problem?

Did you by any chance get captures in ingress and egress interface?

Did you see response arrive on the outside interface for queries sent from the inside? Did the inside clients see those responses?

Was TAC involved when during your MW?

-KS

ben_johnson · ‎12-07-2010

Hi KS,

Unfortunately our MW was so short we didn't get an opportunity to do packet captures. The TAC were not involved in the upgrade process. Internal name resolution was working fine, but that's not saying much because the DNS servers are in the same firewall security zone as all internal clients. It was only when DNS traffic had to cross the firewall that the issue presented itself.

A couple of digs were attempted from the DNS server during the period of the problem.

dig @localhost some.server.outside -- this one worked

dig @1.1.1.1 some.server.outside -- this one failed (1.1.1.1 being substituted for the DNS server's actual IP address)

After reverting to the old version/config both of the above tests worked. I'm not a DNS guy so I don't really know what that means.

I've scanned logs on the DNS server and nothing seems to stand out during the outage period that didn't also exist either side of the outage period. Again, I'm not a DNS guy so I don't really know what I'm looking for. I looked at the novell-named.run log file and the 'messages-20101207.bz2' log file.

Thanks,

Ben.

[edit] There are also definitely firewall rules allowing traffic to pass IN to our network from outside to communicate with our DNS servers on port 53 for both TCP and UDP.

Kureli Sankar · ‎12-08-2010

Ben,

I wish you engaged TAC. Anyway, the next time please do. For now without captures I am unable to say whether the responses came back in or not.

I am also unsure of how you reloaded both the units.

Did you copy the code onto both units.

1. reloaded the secondary/standby

2. once secondary came up you failed to it

3. Then reloaded primary

Is this what you did or did you reload both units at the same time? If so did you make sure to reload the primary first, give it a min. and then reload the secondary just so, the active MAC of the failover will be unchanged?

Pls. engage TAC next time around. Also, make sure to have 2-3 MW and open a pro-active case ahead of time and get an engineer assigned prorior to the MW.

Thanks,

KS

ben_johnson · ‎12-09-2010

"I am also unsure of how you reloaded both the units.

Did you copy the code onto both units.

1. reloaded the secondary/standby

2. once secondary came up you failed to it

3. Then reloaded primary

Is this what you did or did you reload both units at the same time? If so did you make sure to reload the primary first, give it a min. and then reload the secondary just so, the active MAC of the failover will be unchanged?"

Hi KS,

We did not have two FWSM blades in failover mode, we had a single blade in the production chassis. To perform the upgrade we used a second FWSM blade that we had purchased, set it up identically to the production blade and then did the upgrade in our test 6500 chassis. We then inserted the newly upgraded FWSM blade into the production chassis and switched all firewalled traffic through the new blade, leaving the working production FWSM OS and config untouched. This way we were able to fail back successfully to a working system at the end of our allotted maintenance window.

We will likely attempt the cutover again next Tuesday and will involve the TAC if that's the case.

Thanks for your help.

Ben.

Kureli Sankar · ‎12-09-2010

Ben,

I wish you had done this.

1. On your new blade in the lab - get the 2.3 code and the confiig from 2.3

2. Upgraded the exisiting production blade from 2.3 to 4.x

This way the MAC address would not have changed for all the globals and the interface IP address.

Pls. read my blog here about proxy arp and grat arp: https://supportforums.cisco.com/community/netpro/security/firewall/blog/2010/10/27/asapix-proxy-arp-vs-gratuitous-arp

It could be that some adjacent layer 3 devices still had the old MAC address from when you diverted traffic to the new-lab FWSM that you put in place.

arp time out is 4 hours.

Also, these syslogs clearly say that one response for dns request was already allowed and it was dropping the rest of them so, I am unclear if this could have been caused due to arp issues.

-KS

ben_johnson · ‎01-04-2011

Hi KS,

I'm planning to attempt the upgrade again this coming Tuesday (11-JAN-11).

Given the trouble we had last time I'm somewhat hesitant to put 2.3 on the 'new' blade and rely on that as my fallback position as this could give me the exact same issue - at least, in my mind it could. The FWSM is the only gateway out of (and around) the network so I can't afford to take it out for too long.

Regards,

Ben.

Kureli Sankar · ‎01-04-2011

I can certainly understand. I do see the case 616214141 that you opened with us. How long of a window are we talking?

Make sure that TAC gets on the FWSM about 30 min. prior to the window so, the engineer can get familiarized with the setup and setup captures, spans and syslogs that are needed.

Again, my suggesting would be what I said earlier. Keep the lab setup with the old code and upgrade the exisiting production blade to 4.1(x) so, there won't be any MAC address change.

Pls. open a pro-active case ahead of time and specify the exact data and time of maintenance window with the timezone and refer the case above.

-KS