Hi there. I have an interested problem I"m hoping someone could help out with. I have a a server dual homed to two 4507R's, which are connected to a failover pair of Pix 535's. This particular server receives a TCP data feed from a data source that sends about 3Mbps continuously throughout the day.
We've been noticing that every 2 hrs (+- 10 mins), the feed gets disconnected. The interesting thing is that all we have to do to recover the feed is stop/start the server application that is receiving the data.
After observing the condition for a couple of days, I noticed that the problem happens every two hours, plus or minus 10 mins. We can practically set our watches to it. When the traffic stops flowing, I notice that one of the interfaces on the firewall increments a receive discard error (as observed through Solarwinds Orion). The interface in question is a trunk link that carries traffic to the firewall for 3 different vlans. The link is fiber.
I was able to capture traffic right at the time the feed stopped receiving data, and I observed the following:
1. There long periods where the source was sending data, and seq #'s were increasing, but my server wasn't sending anything back.
2. After a second or two of the behavior seen in #1, I started receiving a few TCP Previous Segment lost messages, and a large numbers of TCP Dup Ack's.
The packet sniffer is looking at the interface on the 4507R that the firewall is connected to. The traffic flows like this:
Server ---> 4507R ---> PIX 535 (inside) ----> PIX535 (DMZ) ----> 4507R ----> provider network device.
I'm really at a loss as to what to look at next. The server get's rebooted each night, so if it was related to the something going awry on the server I wouldn't expect to see it each day. The firewall is running 6.3.5, but we do not see any problems on any other feeds or connections that pass through the firewall. The predictability of the error is also puzzling. Can anyone offer some insight into where I could look next?
Thanks for your help!
The predictability of the problem and the way it is resolved makes me wonder if it could be related to a timeout configured on the PIX.
Does the output of 'show run | in timeout' show any timeouts configured with a value of 2 hours?
Also, do you have any syslogs from the time of the problem? These may contain some useful information.
Hey Mike, thanks for the response.
I looked at the timers and I didn't see anything at 2 hrs. I haven't changed anything from the default either:
arp timeout 300
timeout xlate 3:00:00
timeout conn 1:00:00 half-closed 0:10:00 udp 0:02:00 rpc 0:10:00 h225 1:00:00
timeout h323 0:05:00 mgcp 0:05:00 sip 0:30:00 sip_media 0:02:00
timeout sip-disconnect 0:02:00 sip-invite 0:03:00
timeout uauth 0:05:00 absolute
Unfortunately, I didn't have my syslog setup right on that firewall when the problem happened last. I've fixed that and should be able to provide syslogs for when it happens today.
Have you ever used trunked interfaces to the firewall? I've noticed that on the virtual interfaces I don't get as much information as I do on the "real" interface. Does that mean that any frame errors are recorded only on the physical interface?
I don't have a PIX with 6.3(5) handy to take a look, but I believe you are correct in that there is certain information that will only be recorded on the physical interface, rather than each individual sub-interface.
The syslogs may have some additional information that will help point you in the right direction to solve this problem.
Also, it may be a stretch given the predictable symptoms, but with interface errors you always want to check the speed/duplex settings on both sides of the link. These settings must be identical on both sides of the link, or interface errors can start to increase (i.e. auto/auto on both sides or 100/full on both sides).
Thanks Mike. I'll double check the speed/duplex settings. We always try to set those manually to avoid any possible conflicts, but it wouldn't be the first time one had slipped through.
As for the syslog, I set trap logging to notifications and right about the time we started seeing the problem, I saw a lot of these messages:
%PIX-4-410001: Dropped UDP DNS reply from outside:xx.xx.xx.xx/53 to webdmz:xx.xx.xx.xx/59323; packet length 524 bytes exceeds configured limit of 512 bytes
I think that message would explain why I see the receive discard errors on the physical interface increase, but would that have any affect on any other tcp flows currently going through the device?
At a basic level, this should not affect any of your TCP flows, since DNS traffic is sent via UDP. However, a couple of situations come to mind where this could affect a TCP flow in your case:
1. If the PIX is receiving an unusually high amount of these messages during the time of the problem, it could indicate that the PIX's interface is becoming overwhelmed, which causes your TCP flow to fail because the interface simply does not have enough buffer space to process all of the packets it is receiving. I would guess that this is probably not the case, though, since you only need to restart the server to get the TCP data to start flowing again.
2. Another possibility would be if this TCP flow that is failing relies on DNS to find the server. If, for example, the client had a DNS cache that timed out after 2 hours and DNS replies were blocked when the client tried to resolve the server's IP address again, the TCP connection would fail since the client does not know the IP address of the server anymore. Again though, this fails to explain why restarting the server seems to fix the problem.
Do the addresses you sanitized from the syslog message correspond with a DNS server on the outside and a host that is participating in this failed TCP connection on the webdmz interface? If not, I'm afraid these syslog messages may be a red herring for you.
Unfortunately the tcp connection is strictly IP - IP... no DNS queries at all for either side. I'm agreeing with the red herring diagnosis, although it is interesting that we always see the interface errors at the same time as the disconnect.
I see a lot of no buffer errors on the interface. I'm clearing the counters again to see how much they increment after the next outage.
Just for giggles, let's say that this UDP DNS problem is exhausting the buffers on that interface. What are my options? I could create a route to null for the source of those requests. I guess I could also seperate out some of those virtual interfaces from that one physical interface. The actual traffic loads are almost negligible though, I guess tjsut the packet counts during that time is killing it.
If that does turn out to be the problem, I would recommend trying to find out why you see such a large burst of traffic that is overwhelming the interface rather than simply routing it to a black hole. Also, if it does turn out to be a load issue, I would also recommend trying to alleviate some of the load on that link like you mentioned.
I know you mentioned that you have already gathered packet captures, but it might be useful to gather simultaneous, bi-directional packet captures on either side of the PIX during the time of the problem. This will give you a much clearer picture of what is going on and also indicate what, if anything, the PIX is doing to the TCP stream that is causing it to fail.
Ok -- I've definitely confirmed it's somethign related to the DNS events that are occuring right at the same time. I had my syslog going, I was watching the DNS server logs, and the interface errors, and they all came alive right at the same time.
That rules out the firewall as the source of the problem. Now on to the DNS server...
Thanks again for all your help Mike.