There is a need to identify all the causes that have the potential to impact the performance of applications built upon ASP.Net / Forms / VB etc using oracle 9i at the backend. The user facing clients are thick and not web enabled.
Now, applications apart, consider two offices, main and branch located across 200 kms from each other connected via 10Mb dedicated WAN link provided by our ISP.
The underlying technology of this link is not known but chances are it's MPLS but without CoS aspects.
Application/DB servers are hosted in Main Datacenter from where guys at the branch are also accessing them.
Ping to the servers returns a reply of <1ms from within the Main Site. Over 1500 users are accessing application and generating required reports off this application, just fine.
Over at the branch office, connecting links b/w the two offices, utilisation about a max of 50% at peak times. Ping reply to the app serves hosted at main datacenter return an average of 7ms.
Below is a tracert to a PC lying on branch office from main office:
Tracing route to branchpc.ds.company.local [10.30.1.54]
over a maximum of 30 hops:
1 <1 ms <1 ms <1 ms 10.100.55.2
2 <1 ms <1 ms <1 ms 10.100.10.252
3 15 ms 7 ms 7 ms 172.31.1.2
4 7 ms 8 ms 7 ms 10.30.7.10
5 16 ms 7 ms 13 ms 10.30.1.54
Users from branch are complaining that the app access process is abysmally slow and I have confirmed this.
A report generation process which takes 5 secs at main is taking almost 15 secs at branch.
We cannot host the application/db instance at the branch to improve the performance due to manpower unavailability to maintain and administer the application(s). Access times off all the applications being accessed from branch to main, are coming out to be slow and it's not one application specific.
While the report is being generated, Task Manager in Win XP shows up that particular application associated Task going into a "Not responding" state only to return to the "Running" state after a long delay when the required window/report opens up after about 10-15 secs.
What could be the causes? Where /How should I start investigation?
I understand that internally at main site our client to server communication is happening on a gigabit LAN while at site it's only 10Mpbs dedicated WAN pipe. Would this have an effect? Would this have an effect even while only 50% of it is utilised and their is still sufficent bandwidth left?
"What could be the causes?"
You're likely seeing the impact of the additional latency to/from your branch site. You might also be seeing the impact of transient congestion.
If there's FIFO queuing across the WAN link, you might switch to FQ (also insure it activates for the actual bandwidth, i.e. the 10 Mbps).
In cases like yours, the root problem is applications are not developed to work best with increased latencies. WAAS type products can often mitigate much of the impact of the additional latency.
I do not agree with Joseph this time.
Latency, especially when in the order of few ms, is rarely a cause for bad application performance.
Neither is frequent that applications are responsible. In most cases, these don't even know there is a network and of which type, and are just generic applications.
What one has to do, arm with patience and a sniffer, capture around to find where and why packets are dropped or seriously delayed.
On the face of it what you state makes sense. I'll try FQ as suggested by Joseph and then start to run wireshark and see if I get to see anything odd.
Indeed, a few extra ms of latency, does make sense that it shouldn't be a problem until you consider your traceroutes show between 7 to 16 ms (or so) vs. under one ms. If the "under" is only about half a ms, then the branch is 14 to 32 times slower for RTT. If the application is "chatty", all those few extra ms can add up to be noticeable vs. running the application across the LAN.
In the "grand scheme of things", assuming there isn't a major network issue, users reporting horrible application performance could be true relative to their LAN experience.
You've described this as a "thick" application using Microsoft(?) architecture on the hosts(?) and Oracle on the backend. Microsoft applications usually, by default, are not well coded for non-LAN performance. (BTW, a Microsoft presentation, for developers, providing information how to better design for non-LANs: http://download.microsoft.com/download/f/2/1/f2146213-4ac0-4c50-b69a-12428ff0b077/Optimizing_Applications_for_Remote_File_Access_Over_WAN.pptx)
You also described a "report generation process" taking 3x as long, although something like that I would expect to be less latency sensitive and more bandwidth sensitive.
What I didn't mention before, and hard to be certain on one traceroute sample, the jump between 7 to 16 ms might also indicate/confirm you do have transient congestion. Your interface stats don't show any packet drops, but the queue depths are such that packets could be delayed but not dropped. Watching/monitoring the active queue depth is often a good way to see if there are traffic bursts that create transient congestion. (BTW, the reason I suggested FQ, it keeps a few heavy bandwidth demanding flows from too adversely impacting light bandwidth flows. However, it's not a panacea.)
There's no harm in doing an in-depth analysis of your network by sniffing traffic, etc., and there might really be a problem. But what is often overlooked by many, what works well on a LAN with sub ms latency, doesn't always work as well as the latency begins to climb and this is the nature of the network.
Joseph, most insightful.
To explain my most non-technical bosses that they need to invest in WAN optimisers or presentation servers, how can I explain in layman terms that it should be expected to have a certain 'not within office LAN' experience of a gigabit internal network vis a vis being on a 10Mb site network?
If they contend that not all 10Mb is utilised and there is ample room for the traffic to flow, how would I justify the delay given what you mentioned in your posting, report generation is more of a bandwidth hogger than latency.
Also, I noticed, if similar reports are being generated over a web interface rathar than the thick client, they are comparatively much faster and nearer to the LAN experience.
Also jospeh, pls explain a bit more of the sentence "If the "under" is only about half a ms, then the branch is 14 to 32 times slower for RTT".
How do we reach this figure of 14 to 32 times?
"To explain my most non-technical bosses that they need to invest in WAN optimisers or presentation servers, how can I explain in layman terms that it should be expected to have a certain 'not within office LAN' experience of a gigabit internal network vis a vis being on a 10Mb site network? "
There are two important components to network performance, one is bandwidth, which everyone fixats on, the second is latency. Even if you had a gigbit WAN link, its latency will likely be higher than a LAN's and this impacts most network application "performance".
One of the common key features of most WAN optimizers/accelerators, they cache network data so, other then the first network WAN access, subsequent access is really LAN access. (I.e. since you can't reduce WAN latency, avoid doing it.)
"If they contend that not all 10Mb is utilised and there is ample room for the traffic to flow, how would I justify the delay given what you mentioned in your posting, report generation is more of a bandwidth hogger than latency. "
My guess would be the report generation process, especially if a "thick" app, might still be doing more network I/O then is really necessary, but at least some of the network I/O might be passing a larger proportion of data than more interactive apps and this would also be impacted by bandwidth available.
BTW, you note there's "ample room", but is there? When you say utilization of 50%, over what time intervals? Does this include just the 15 seconds when this report is running?
Also WAN optimizers/accelerators commonly compress data that does cross the WAN link to increase effective bandwidth.
"Also, I noticed, if similar reports are being generated over a web interface rathar than the thick client, they are comparatively much faster and nearer to the LAN experience."
In theory WEB applications being designed for networks should be structured to utilize networks more optimally than "thick" applications. If this is so in this case, it could explain the better performance. It also might confirm that there isn't a network issue, but a network application issue. Consider, how can one application perform better then another delivering the same results but using the same network?
"Also jospeh, pls explain a bit more of the sentence "If the "under" is only about half a ms, then the branch is 14 to 32 times slower for RTT".
How do we reach this figure of 14 to 32 times? "
If we assume LAN pings are .5 ms, and your WAN pings are 7 to 16 ms, then those pings are 14 to 32 times longer. (e.g. 16 ms divided by .5 ms = 32)
This means that any network I/O waiting on a reply will take 14 to 32 times longer.
Joseph, technically, pls relate to me the differences that would generally make a LAN link of 100Mbps different in efficiencies and performace from a WAN link of equivalent dedicated bandwidth of the latest, say, MPLS (everything going by default class) technology?
Everything else being equal, the LAN link will generally have less latency than a WAN link. This assuming two hosts are physically closer on a LAN vs. a WAN.
Hi Joseph & paolo.
I ran a Sniffer while I ran the required action on the client from my remote site. Following is the description of the IPs' as shown in the three attached pictures of WireShark, 1 2 and 3 and in that order.
a: 10.30.1.81: IP Address of the workstation I am using to access application at the main site and upon which the thick client (built upon VB 6 code) is loaded upon.
b: 10.100.8.115: IP address of the machine where the main application resides.
c: 10.30.1.81: IP address of the backend Oracle database server used to store the associated data.
For a whole half a minute, there is a repetition happening of the following :
can-dch > nuts_dem : between the database server and the client on the other end.
can-dch is a tcp port 1919 marked as 'candle Directory service' and nuts_dem is also a tcp port 4132 marked as NUTS Daemon.
The task manager of the client shows the process in a 'Not Responding state' while these two transactions are under way.
What's happening here?? Pls advise!!
I assume "c:" should be 10.100.8.25.
It looks to me you have a very "chatty" interaction between the host running the application and the servers. If you look at the time deltas, replies from servers take about 8 to 10 ms, or so, (about the minimum times you saw with your ping tests).
As to the 'Not Responding state', my understanding is, this can just be an indication that the application is busy doing something (for a while) other than responding to Windows; in this case doing lots of network I/O. Since the application does eventually succeed, there's nothing wrong, per se.
This is only a guess, but the host application might be walking a database table, row by row, and issuing a network I/O for each row (attachment two?).
From these packet traces, I'm still of the mind the primary issue is an application that's poorly architected for a WAN, even one with little WAN latency.
BTW, another approach to verify whether it's an application architecture issue, or not, use a "network impairment tool" or "wan emulator tool" between a host and the servers on the main LAN but "dial in" similar WAN latency and 10 Mbps for bandwidth. If the application then behaves as you see it doing on the actual WAN, you've confirmed it's especially sensitive to increase latency.
Thanks Joseph. Here's my queuing out on one of the routers.
Interface GigabitEthernet0/1 queueing strategy: fair
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: Class-based queueing
Output queue: 0/1000/64/0 (size/max total/threshold/drops)
Conversations 0/5/256 (active/max active/max total)
Reserved Conversations 1/1 (allocated/max allocated)
Available Bandwidth 9933 kilobits/sec
How do I make sure that it activates for the actual bandwidth of 10mbps?
Do I need to set my giagbit interfaces to 10 mbps?
FQ looks like it might already be enabled and that the interface is running at 10 Mbps. What about the other side?
Since you mentioned MPLS, is it possible for any sites to communicate directly with this branch or main site? (Or is the 10 Mbps exlusive between the two sites?)