By the time you have finished reading this, I am sure you would have come across the most fascinating networking issue haunted by our friendly ghost Casper.
Having exploited most of my resources to solve my eerie problem, I finally decided to involve our ISP hoping that this would be the end of it...but it wasn't supposed to be that way. So here I am, writing this at NetPro.
To cut a long story short, our ISP had provided us with EoATM 100 mbps link between two locations, say A and B.
But, since the line was given, we felt that we were not only having intermittent problems that required switch reset but also felt that we were not getting the right speed and the data transfer rates(FTP copy and other stuff) was really not befitting a 100Mbps link.
In order to make sure, this time the ISP guy brought some equipment to our premises and confirmed that speed at Layer 2 is indeed 100.
There are two cisco 3800 series (2 gigabit interfaces) routers across Sites A and B and two media changers at each end converting Fiber to UTP. Media converters are also set at 100Mbps.
Now a strange thing is that when we configure the two routers (Site A and B) in 'bridging' mode and start data transfer across, the speed becomes incrementally fast ( which should be taken as normal at all times). There is also another 100Mbps link provided by the same ISP to us between Buildings A and C, which works just fine, as it should be.
The moment we enable our routers at Site A and B in Routing mode, We get to suffer delays and all data transfers slow down, without bringing any core/edge switches into the picture.
Various things have been done to reach some conclusion:
1. Ip Router configurations has been reset and put to bare minimum needed with ipcef enabled, all QoS commands disabled.
2. Configurations has been checked with all combinations of Speed Auto/100 FullDuplex/Auto with best results coming out of FD/100 but still far below satisfactory.
3. Equipment which serves between Site A and C has been temporarily put between Site A and B, with same non-satisfactory results.
4. Earthing issues/Electrical disruption in the Room where routers are located has been looked into. Routers on both sides have been changed to rule out hardware issues. We also did a test on the line by bringing our routers into another room ruling out some electrical disturbance of any sort.
Seems like, at Layer 2, despite being showing us full 100mbps, Layer 3 and above transfers are unable to provide the required service. Opening applications across the two buildings is very slow as most of our servers reside at Site A with user base at Site B.
Currently this ISP engineer has provided us with a patched pure fibre link between Sites A and B without any intervening ISP equipment in between and we have connected our two core switches in both buildings directly to the UTP interface of Media converter but that's not the permanent solution. ISP Engineer is also trying hard to find this ghost problem. He says that he has found no problems on his side and the only thing that comes in the middle is a MPLS enabled router. But even he is a bit baffled.
What else can we look at?
Thanks for taking time to read this whole ghost story. If you have read this all, I am sure you won't stop thinking ;)
This is a great story. I'm sure there are good engineers at work on it and they will make things work eventually. I would be ready to defend the design at L3 but I don't have not even close to the minimum elements to defend my thinking, as it applies to a real situation.
Good luck and keeps us posted.
Sounds like Sniffer(Ethereal) time to me. Start on the LAN side, and if the issue is not obvious, insert a switch on the WAN side so you can SPAN the traffic between the 2 routers.
I would be looking for MTU size issues, or perhaps routing. Are you static routing or using a dynamic protocol?
Ethereal and something more perhaps. Reasonably all the basics of the design are correct like the things you are saying, but there is some packet loss at times of sustained traffic.
Ultimately, this is what slows down users.
Now when you use Ethereal and the like, sometime the tool has a feature that analyze the raw data and will tell you how many flows there are, and which ones are suffering. And, if the is packet loss, it must be reflected somewhere in the interface counters on the routers/switches.
Hi Dave, I am only using single laptop on each side of the two routers connected on LAN interface of it.
Tried different MTU sizes and realised that 1500 still remains the one that works best.
Hmm...Static routing probably, having both the WAN interfaces within the same range. As I mentioned, the ISP has provided us with a direct fiber line with none of his equipment in the middle, still, the moment I push my routers on each side, the problem re-surfaces.
I didn't quite understand the SPAN issue that you mentioned, if I insert a switch on the WAN side? What would be the result of this experiment?
How can I try the dynamic protocol?
when you install your routers and go slow, have checked all interface counters for drops ?
Run Iperf in UDP mode between two sites to see what you get (convert to L2 bps taking into account overhead). While Iperf is running check CPU utilisation on your routers.
Can you post your routing, and your bridging config?
When routing, the LAN interfaces must be a different subnet than the WAN, so the far end router must either be told statically or learn dynamically the path to that network.
A Sniffer packet capture will give you more clues as to the cause of your slowness. Packet loss and retransmission is usually to blame. If you do see packet loss, a packet capture on the WAN side would help determine if it is a service provider or a router issue. Of course then the question would be, what is different about the frames when routing vs bridging, but the Snffer could also help you there. The only way to capture packets on the WAN side would be to pass the traffic through a switch, which allows you to mirror (SPAN) the traffic to a port that your Sniffer is connected to.
What throughput do you get in mb/s?
Dave, Attached is the running config of Routers in Bldg A and B.
PS: I have now even taken out he Access Lists configured on BLD_B router shown here but cannot post the most recent config as it's down right now with ISP engineers working.
1) Is the EoATM link layer-2? I would assume since you can bridge but you know what happens when you assume.
2) If the link is Layer-2, why are the provider facing interfaces on different networks? I assume the "TO-Block-A" and "TO-Block-D" interfaces face the provider...the configs show BLD-A provider interface is 172.16.1.1 while the BLD-D ISP interface is 188.8.131.52. These need to be on the same subnet. Once that is sorted out then the static routes need to be corrected to reflect the proper subnet.
The reason the bridging is working is because the layer-3 issues are not affecting the traffic.