Imagine someone poses this scenario to you: There is no "known and right" answer here. Its just a brainstorming thing....
A host sitting on a LAN segment can reach a server that sits on a different subnet across a L3 cloud.
The client can hit the WEB server, click on the "contact us" link and its fast.
The client, however, will try to download a file from the WEB server and its real slow, and then eventually dies out.
The question was 'what do you look for'?
The connection shown was a client PC connected to a L2 switch, a router, a WAN connection, another router, a L2 switch, a Firewall, and then the server.
My approach and thinking was the following:
The client can reach the server, so the FW is not blocking it. There is also no asymmetric routing (at least as far as the FW segment is concerned) because the connection is established. If, say, the client's SYN packet was routed through 1 Firewall and the server's SYN-ACK was routed back toward the client but through a different firewall (not shown on the drawing), the connection would never be established.
I also suggested running an extended PING test with an extended packet size of, say, 1400 BYTES, from the client to the server.
The answer given was that the RTT was about 50 or so ms and fluctuating a bit - say, between 50 and 75ms. Nothing significant. Only 1 lost packet here and there...nothing more.
My answer was that if there was anything wrong with the hardware or the integrity of the circuit (lets say a ton of errors on some interfaces), it would have been reflected in the stringent PING test that I recommended.
Moreover, there is nothing wrong with the network's routing because there is no sign of latency and there is no reason to go from L3 hop to L3 hop to check the routing because if some sort of asymmetric routing was causing the problem by routing the traffic intermittently through, say, a 56K connection, that would have been reflected in the continuous, extended PING times.
Was my line of thinking logical or reasonable?
There are many things that could be causing this but the 2 things that spring to mind
1) WAN bandwidth limitation when you try to download as this will probably entail a larger transfer of data than simply connecting to the web site. Yes your pings can be useful in this test but bear in mind TCP works very differently from ICMP.
2) The obvious one is the firewall. Is the downloading using additional ports (perhaps not if the download starts). Is the firewall getting overloaded. Is the download sensitive to delays which may be introduced by a busy firewall.
These are the first 2 things i would look at.
I mean, the bandwidth is the first thing I thought of, too, but I figured the PING packets were being sent continuously and they were very large (1400 bytes) and barely any packets were being dropped (just one or 2 very sporadically), so I figured the bandwidth cant be the issue because i would see a lot more damage to the PING test if it was an issue.
As far as the FW, good point...I didnt think of that at first, but I did later.
So, I am glad you told me what you would look at -- thats great. I wanted to know that, too. But I also want your honest opinion about my logic and approach. I mean, did it ake sense? Logical?
Nothing wrong with your approach at all. It makes perfect sense to test the end to end connectivity with ping packets. By ruling that out though it doesn't necessarily mean there isn't a problem with the WAN and the app.
But from experience whenever i see a firewall in the equation it's almost always worth looking at that sooner rather than later :-)
I understand and agree with you regarding the APP. In fact, that was my contention: that the APP is probably having a problem, maybe the web server itself, or maybe the server has to access some resources from another server and that is slow. Does that part make sense?
With regard to the WAN, I understand that TCP uses a sliding window and as congestion builds, the window shrinks until you have nothing - the session dies. But the fact that the PING packets are continuous and at 1400 bytes, I figured if the WAN was experiencing a problem so bad that it would kill a very slow FTP session altogether, those PING packets would be dropping -- at least a lot more than one or two every 50 or so.
As for the FW, i agree. It may not be robust enough. Just didnt think of that right off the top of my head.
lastly, here is something I also didnt think of - QoS. Maybe QoS is placing FTP packets in a low priority queue.
What do you think?
"In fact, that was my contention: that the APP is probably having a problem, maybe the web server itself, or maybe the server has to access some resources from another server and that is slow. Does that part make sense?"
Absolutely it makes sense. The server NIC itself could be getting overloaded altho unlikely. The download may require resources from another server(s) as you say and this may well have an impact.
Your ping test of the WAN was a valid thing to do and it does indeed give you a quick view of the WAN throughput with the previously mentioned proviso's.
QOS - yes this could well have an effect as well.
As i said originally it could be many different things from the app, to WAN to the firewall. Way i approach these sort of issues is to be choose the most obvious things first and if you rule them out look at the less obvious things next.
ok, a host with 192.168.10.10/24 is connected to a switch.
A server with 192.168.10.20/28 is connected to another port on the same switch.
All CAM/MAC tables are empty. All ARP tables are empty. New connections.
Can you go through the STEP BY STEP process of how the client will try to PING the server and will it work?
[EDIT] Please dont lab it up...do it in your head [EDIT]
"Can you go through the STEP BY STEP process of how the client will try to PING the server and will it work?"
Okay i'm assuming the client and server ports are allocated to the same vlan. Also assuming you ping the IP address so we can rule out DNS lookups.
client knows it's subnet is 192.168.10.0 255.255.255.0
server knows it's subnet is 192.168.10.16 255.255.255.240
client wants to ping server. So the client compares the server IP address using it's own subnet mask ie.
192.168.10.20 255.255.255.0 = 192.168.10.0/24 so client believes server is on the same subnet.
client arps out for server which is a broadcast at L2. Because the server and the client are in the same vlan the server will receive the broadcast. The switch will also update it's CAM table because it now knows which port the client is on.
The problem comes when the server tries to respond to the client. The server compares the client address with it's own subnet mask. So 192.168.10.10 255.255.255.248 = 192.168.10.0/28. But the server knows it is on the subnet 188.8.131.52/28. So the server believes the client is on a different subnet and therefore has to use it's default-gateway to respond to the client.
So the server arps out for it's default-gateway. Now it depends on
1) whether there is actually a default-gateway L3 routed interface for the server
2) if there is does the L3 device have proxy-arp enabled. If it does the ping may well work.