Hello. We have a server in our data center which is responsible for sending out patches to our 55 regional offices, and the patch software has no way to throttle the bandwidth while sending the patches. This would result in the server saturating the primary connection into the data center from the WAN while sending updates, since the connection is only 40Mb/s, and the server is running gig.
There is no QoS in place yet, so in order to stop this from happening we simply hard set the port to 10Mb/s, full duplex, and set the server accordingly. This solved the problem of link saturation, but now when there is heavy traffic in the data center (like during backups), the server drops packets and misses pings. Here are the stats from the port on the Catalyst 4948 it is plugged into;
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 2247155445
Port Tx-Drops-Queue-1 Tx-Drops-Queue-2 Tx-Drops-Queue-3 Tx-Drops-Queue-4
Gi1/4 2247155445 0 0 0
The settings match on the server and the switch port, and we know it isn't a cable problem since it runs fine when running gig. Any ideas what might cause the output drops?
Try to put a shaping on the interface that limit the traffic to 10 Mbps and put both interface on 1000/Full
From your description, the output drops are likely congestion caused port transmission buffer/queue overlow drops.
If this is the port that's actually connected to the "patch" server, and it is running at 10 Mbps, it would not be difficult for other LAN hosts with 100 Mbps or gig connections to overrun this port sending to the "patch" server.
This is indeed the port connected to the Patch server running 10Mb. The odd thing is that there is no traffic to or from this server when the drops are occuring. The drops are occuring when other servers connected to the same switch are being backed up. This server is not even backed up.
It would be very odd, indeed, in there are drops when there is no traffic at all being sent to the "patch" server, especially if other ports do not show any drops.
What might be happening, a very, very low volume of traffic directed to the "patch" server might be encoutering drops when the 4948 is busy with other traffic assuming the 4948 somehow shares its buffer space across multiple ports and allocates even less for ports running at 10 Mbps. I'm unfamilar with how the 4948 physically operates, both with regard to its software and hardware architecure.
OK, a little more info on this. I put a sniffer on the patch server port last night, and the problem is indeed related to the 10Mb port being overrun with traffic. The strange thing is this, the port is seeing tons of traffic from several different servers, but the traffic is destined for the backup server. I'm not entirely sure how this is possible.
The servers that are being backed up are on subnet X, and the patch server and the backup server are both on subnet Y. For some reason, the port the patch server is plugged into is seeing all backup traffic destined for the backup server, which is on the same L2 vlan. I've checked the obvious things, like port mirrors, etc. I've got a hunch that this isn't related to the patch server port only, as we've had some strange things happen on the network lately that this type of broadcast traffic would explain.
Any ideas what could cause this?
Ah, good to know the drops are in fact likely from the port being overrun.
As to the cause of the port being overrun, do you think unicast flooding is possible within your topology?
It seems likely, specifically the case study mentioned here;
Would a good test be to setup a continuous ping from the server that is flooding to the backup server? This way there will be no need for the flooding, since the retries will remain in the appropriate tables?
If the problem does not occur, the fix would be to set the mac aging higher than the ARP timer? Is there a Cisco recommended setting for these?
Trying some continuous ping tests would seem to be an easy and simple test. If you do, since default CAM timers seems about 5 minutes, a multi-minute ping poll should do if shorter than the CAM default.
The timer approach sets the CAM timer to be the same or greater than the ARP timer. In your reference, the recommendation is to increase the MAC timer, in this reference, http://www.cisco.com/en/US/docs/solutions/Enterprise/Campus/HA_campus_DG/hacampusdg.html#wp1108684, the recommendation is to decrease the ARP timer to be equal or less than the CAM timer (or to avoid asymmetric routing).
OK, I scheduled a ping to run from the server every 4 minutes, and we haven't seen the problem since I set it up 3 nights ago. The following statement from the doc you referenced says I should change the cam timer to be equal to the arp timer;
"The preferable method is to change the MAC aging time to 14,400 seconds."
Will this cause any adverse affects when it is implemented (any sort of outage)? Should I do this on all VLAN's and switches in my data center, or only the ones we seem to be experiencing this with?
Actually, your document reference had increasing the MAC aging time, my reference has "If you must implement a topology where VLANs span more than one access layer switch, the recommended work-around is to tune the ARP timer to be equal to or less than the CAM aging timer." Either timer approach should work, it's getting the two timers in sync that's the work-around resolution.
As to adverse affects, I don't have first hand experience making this change. However, I wouldn't think it should cause an adverse effect. If no one else comments, you might post that as an explict question.
I'm glad to read the scheduled ping seems to "cure" the problem. Seems to confirm what we believe is happening.