I have a quick question or not so quick. We have been having network issues for the past week or so. This all started around the time we had an attack from a hacker which overwhelmed our catylist ( Cisco 6509 ) with a suprevisor 720. We have one IT person on the staff and we subcontract through another IT company. Neither of them seem to be able to figure it out. We have ~75 machines that can download data from the outside ( ~3 Mb/s ). Then we have ~50 machines that start to download OK ( ~3 Mb/s ) but then drop to speeds of 10 kb/s then just hang at that speed forever. Both sets of machines also have the same external IP address. This problem seems to only occur when connecting with one or two servers on the outside (http://nomads.ncep.noaa.gov). It seems to be that they are being throttled in some way but the provider insists that they are not throttling us and have checked. I believe them since some internal machines work. Does it make sense for the issue to be with the server that we are connecting to since some machines work and others don't? I have also contacted the ISP (TW Telecom) who checked the paths and said everything seemed to be ok on their end. Do you have any ideas what could cause something like this? It really doesn't make much sense at this point. Is it possible some unseen switch got thrown on the Catalyst that we can't see when it is being overwhelmed to save the network? and how would we correct it if it did?
Thanks in Advance!!!
This is a fairly broad question without a lot of details on your setup. Some info based on below might help figure this out.
1. Are the users directly connected to the 6509 or are they on different switches connected to the 6509.
2. Does the 6509 handle all the layer3 routing?
3. Whats the commited rate for your Internet connection? Can it burst? Is it shared with other departments?
4. What type of firewall is sitting between your users and the Internet connection.
5. Are all the users on the same subnet or are they split up?
6. Are the one's having the issue within the same subnet as those that work or are they seperated?
7. Have you tried testing from the pc's with the issue to different download sites besides the one above?
8. You mentioned a hacker attack. Was this some sort of virus? Were the pc's affected?
There are a lot of unknowns here.
1. The machines are connected in several ways some of them go through additional switches while others are connected directly. Neither configuration seems to make a difference. Some machines work and others don't.
2. Not sure I'll have to check
3. We have a 100 mb/s connection with around 50% utilization at the high end we are usually around 30%.
4. I am checking on the firewall type.
5-6. There are around 6 or 7 different subnets with each having machines that work and some that don't.
7. We have tried many other places all seem to work. Even the ftp connection seems to be OK to the server I mentioned before. It only seems to be on port 80.
8. The hacker broke into one of our external machines. From there he launched attacks on servers outside of our network. To our knowledge he only got to the outside servers and not on any of the now slow machines. While the attack was going on it overwhelmed our network.
I just checked we have a Cisco 5510 ASA in place right now, and the 6509 does all the 3 layer routing. There should be no proxy setup on the network. I have done traceroutes on most of the internal machines and get consistent results with just 2 hops before the outside. One to the catalyst and another to the ASA.
Well, the fact that only port 80 is affected seems to indicate more of a service type issue then a network type issue. You also mentioned it only seems to occur when you are connecting to "one or two servers on the outside". Have you contacted the remote agency hosting those devices? Have you verified that the machines with the issue aren't infected with some sort of malware? If you go to another website to download a file from one of the machines affected, does it perform ok?
Yes we have checked the machines they all look ok. They are all Linux servers running Centos. We have been in touch with the data providers but they seem to insist that the problems are on our end and have pretty much stopped responding to questions. According to them they have searched through all of their logs and they see nothing out of the ordinary. I have also tested these downloads at my house and a couple other places with no issues. The provider said they do not throttle anyone when using the http service, but limit to 120 connections when using FTP. What type of service type issue would make one machine work and another not both on the same subnet. We ordered 2 or three new servers in the last week and none of them work , but if I bring the machine to my house it works fine.
At this point you may want to try to capture a session from one of the problem systems going to the problem destination. It seems very odd that it only affects some of your internal systems going to only one or two external destinations.
Are your internal hosts automatically connecting to the remote system or are you only seeing this when manually connecting? Will one system be fine while the other is slow? Also, do you know if the external destination is using some sort of load balancing?
One thing that can slow data transfers down is tcp zero window. However, this usually only happens when the system in question is very busy.
The machines that this is running on is not busy at all they are doing very little. We have had to switch all of our operational data ingestion to one of our working systems. Actually my desktop system has issues even loading the web page on the server http://nomads.ncep.noaa.gov/pub/data/nccf/com/gfs/prod/gfs.2013091306/ . My machine will sit there and eventfully just time out with an unknown network error. This happens on one of the windows systems as well. The problem happens if I connect manually or automatically to the server. I do know that they are load balancing on their end with two separate servers. I can start 2 separate instances of my data grab code on a working machine and a broken machine at the same time, and they always behave the same way. The slow machines are slow and the fast machines are fast. Is it possible some switch got thrown on the catalyst when the network was bogged down to slow traffic to certain places to maintain the integrity of the network?
For the tcp zero window, I was thinking the remote host might be sending you the "lower your window size" since it may be the busy one.
If some switch got put onto the network, it wouldn't affect just http to a single site. Most network wide problems would affect all traffic, since switches only deal with layer2 and layer3 (if they are layer3 capable). If you pull some data from one of the normally slow machines from a different site, how does it perform? Also, where would someone put this switch if they were trying to resolve the issue? If you have physical access to your network devices, you could check your physical connections. Have you tried moving one of the problem servers to a switch where another server is working? Once again, I don't know your full setup.
Most of the servers are plugged into a patch panel then directly into the catalyst with very few switches in the way except for the Desktop machines which are on 2 separate switches. The catalyst is then plugged into the ASA. There is only one server room at this location where we are having a problem. We have an offsite collocation facility where there are no issues whatsoever. If we pull data from any other server we see normal speeds even on the slow machines! We have tried moving slow machines from a known slow port in the catalyst to another where a machine was fast but it made no difference. The only place to put another switch would be to bring down the eniter network and put a switch before the ASA and bypass everything. We are trying to avoid network disruptions as much as possivle. In your opinion would it be a good idea to reboot the catalyst? Is there any tools or software you can use to diagnose where the slowdown is occuring?
From everything you've metioned, I don't think the issue is with your Catalyst switch. Does each server pull the same data? Perhaps the issue could be with what you're actually downloading. Do you have automated scripts? Could a server that is working ok be switched to downloading the content one of the problem servers is having?
Yes each server pulls the same data but it is in a primary/backup configuration. So if one machine is down the other automatically takes over. We started pulling the problem datasets on one of the working machines with no problems. I also want to let you know that these broken machines used to work in the past but then suddenly stopped. The problem is neither of our ingesion machines work so we had to place this download burden on a machine that wasn't meant for this. The scripts are all automated and use wget to pull the files. My concern is that it is internal to us. I would expect that if it was an issue on the providers end that I would have problems at my house and other places that use this data would be having issues too.
So you're saying you can take one of these problem systems to a different location and it works ok or you have simply tried the same process but havne't actually tested it with one of the problem systems?
I physical took one of the machines to an outside network ( to my house) and it worked fine.As another test we then switched the ports of a bad machine and a good machine. We concluded that it didn't matter the port in the catalyst since the machines still acted the same way.
Ok. So the issue seems to be with certain servers using wget to a specific web site only on your internal network. Do you have logging enabled for wget? Are you doing recursive type gets? I'm not a wget expert but perhaps the issue lies in wget itself. Since you switched the process from the original servers to a temporary server to keep the jobs going, the issue really appears to be server or server app related.
One easy test to eliminate the network would be to do the http job on the problem server to the problem destination and at the same time do an http download from another site that doesn't cause you issues. This is more "eliminating" what it isn't versus tackling the actual problem. However, it can help narrow your focus down.
I tested the same machine with the usuall results the test file is ok while the real file slows down to 5k.
# Test File Works Fine
--2013-10-07 13:38:46-- http://ipv4.download.thinkbroadband.com/20MB.zip
Resolving ipv4.download.thinkbroadband.com (ipv4.download.thinkbroadband.com)... 184.108.40.206
Connecting to ipv4.download.thinkbroadband.com (ipv4.download.thinkbroadband.com)|220.127.116.11|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20971520 (20M) [application/zip]
Saving to: ‘20MB.zip’
100%[==========================================================================================================================================================================>] 20,971,520 2.74MB/s in 13s
2013-10-07 13:38:59 (1.59 MB/s) - ‘20MB.zip’ saved [20971520/20971520]
# Real File Broken
Connecting to 18.104.22.168:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/octet-stream]
Saving to: ‘filter_nam.pl?file=nam.t00z.awphys00.grb2.tm00&all_lev=on&leftlon=0&rightlon=360&toplat=90&bottomlat=-90&dir=%2Fnam.20130930.1’
[ <=> ] 5,647,087 5.64KB/s
Regardless of the fact that it worked in the past, I would Google wget and cgi-bin. I saw some info on wget and having issues with cgi-bin. I'm not saying this is the issue, but you might want to check it out.