I have a point-point Sonet OC-3, both ends terminated by 7206VXR router. The circuit is only operating at about 2/3's capacity (100MB) and i am getting a ton of OutputDrops. I have included the show int for the POS below, the utilization was not high at the time of capture. Has anyone ever seen this before?
POS1/0 is up, line protocol is up
Hardware is Packet over Sonet
Internet address is 192.168.2.1/24
MTU 4470 bytes, BW 155000 Kbit, DLY 100 usec,
reliability 255/255, txload 2/255, rxload 1/255
Encapsulation HDLC, crc 32, loopback not set
Keepalive set (10 sec)
Last input 00:00:01, output never, output hang never
Last clearing of "show interface" counters 1d16h
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 43764
Queueing strategy: random early detection(RED)
30 second input rate 35000 bits/sec, 73 packets/sec
30 second output rate 1482000 bits/sec, 136 packets/sec
245406033 packets input, 2578976581 bytes, 0 no buffer
Received 0 broadcasts, 0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort
480392843 packets output, 2355422897 bytes, 0 underruns
0 output errors, 0 applique, 1 interface resets
0 output buffer failures, 0 output buffers swapped out
0 carrier transitions
Your "show interface" output shows: "Queueing strategy: random early detection(RED)"
It could be that the RED algorithm is proactively dropping packets on the output of the interface to avoid congestion. If you are absolutely sure that the link does not have bursts and transient congestion issues, perhaps you could check that your RED configuration parameters are not too aggressive and tune them a bit if necessary.
Initially I had FIFO as the Queueing Strategy and the packet loss was worse. The line has never exceeded 100MB, so i would think there should be zero packet loss unless the utilization exceeds that of an OC-3.
I will check with th provider, perhaps they are having issues on the line. I appreciate the response.
The output that you posted shows very low bandwidth utilization and packet/second counters. When your link experiences such problem again and you see 100Mbps maximum, make sure you check the packet/second counter as well. It could be that you have too many packets, which could indicate an attack for example. Do you have any other issues on this router? High CPU utilization for example?
Edit: Your interface does not show physical layer error symptoms. And you are dropping the packets on your output before they ever reach the provider, so I tend to believe this issue has to do with your equipment.
Ok, i will try and get the output at the time of higher utilization adn will note the Packets/sec
During the 100MB utilization, the CPU usage was approximately 7%. The circuit is strictly used for Replication and Backup, so an attack is extremely unlikely.
It is not uncommon to have drops during times when backups are taken. At this time the traffic tends to burst, and you might not be able to capture it completely in the output which only shows a 30 second average. Try to pace the backup procedures if possible. Anyway, TCP-based applications are actually based on drops to determine the rate at which they should operate. The fact that you see a difference when you enable RED points in that direction (I mean that with RED the application finds out faster the proper rate at which it should exchange its traffic and avoids too many drops compared to fifo). The fact that RED reacts shows that the queue on the output interface is stressed, although from a bandwidth perspective it seems that no problem is there. You could try a "show interfaces random-detect" to confirm that RED is actually dropping the traffic.
to add to the good advice given here, the 'ton of drops' is actually less than 1 in 10,000 if you do the math. That is very very little.
What happens, is that you have burst of traffic and no queuing technique can eliminate them completely. Such is the nature of networking, and there is nothing wrong with that.
Good point from Paolo. Perhaps the ton of drops occured before the counters were cleared. The output posted was not very indicative of the problem described, since it was captured at a time with very low traffic levels.
If you have gigabit paths from the machines that do the backup towards your OC3, at the time the backup begins, you might have tons dropped, because the gigabit path encourages ambitious transmission. The higher the bandwidth of that path, the more traffic you will see dropped in the bottleneck. It takes some time for your application to realize that drops are occuring somewhere and lower its rate. By that time, there could have been Mbits outstanding in the wires and some of those Mbits will be dropped naturally. Because we are talking lots of Mbits (that is an order of 10^6, which we could loosely refer to as a ton, the same way we could refer to Kbits as kilograms :-) or even Gbits, it is normal even to see the tons you are descibing until the application learns its lesson and slowers down.
Whenever you have procedures that can be controlled, it is a good idea to pace the procedures and avoid unecessary stress on your devices, especially if internal devices with gigabit connections will be sending traffic towards your precious WAN links. It is not uncommon to rate-limit some internal devices to avoid such issues.
i just wanted to know why the MTU size is " MTU 4470 bytes " in the output provided .
Default MTU size is 1500
is their any specific values that need to be configured on the interfaces for serial / OC-3 etc ..
Default for oc-3 interface is 4470, similar size has ATM.
Default for the original ethernet is 1500, same for the internet in general.
Then we have gigabit ethernet that also supports more than 1500.
So a large MTU is good is good because a router connected to a gigabit host that can send large packets (jumbo frames), will not fragment these, and efficiency is improved.
That is correct, the output does not reflect the actual % of drops, because the high utilization had stopped.
As I had previosuly posted, when the traffic utilization is not 100MB, the drops stop. This changes the statistics that Pablo had mentioned, because the traffic counters continue to climb with no drops.
The backup machines only trasmit at 100MB, so I dont see how this traffic would burst beyond the limitations of the OC-3.
The drops reported are total output drops, not a (weighted) time average (it is not like the bps counter). The only way (besides you clearing the counters) for this number to go towards a lower number is for the relevant counter to overflow.
As I said previously, the bandwidth maybe low, but the queue might be stressed, there can be a variety of small-sized and large-sized packets there. A small-size packet can "waste" bandwidth by taking up queue entry space. If this space was used by a larger packet more transmission would be possible.
Anyway, you might be interested to check with Cisco the possibility of a bug as the one mentioned by another participant in this discussion.
Hi smarotta82, actually we also found the same issue some days ago. Are you using NPE-G2 with 12.4.11T train IOS? We have done test and found this IOS train has issue on POS throughput (also seen higher processor loading). You can try on 12.4.4XD train or 12.4.15T1 IOS.
I have report this issue to Cisco but they just ask me to contact local sales as I don't have support contract. Not sure their engineers know this yet.
Which NPE are you using? Proved problems like the one exposed by Calvin Chu tend to happen with certain hardware but not other due to architectural reasons.
Your issue could be a bug, but it might not be the same as the one already mentioned. That's why I said previously to check with Cisco about a bug as the one mentioned. Some bugs look similar, but they are not exactly the same, which means that another workaround or resolving software might be needed.
The problem with your topology is that you cannot feed the OC3 with enough traffic to see how far the throughput can go even with drops. Unless your 7200 has a gigabit (this is not clear to me, you only mentioned the speed of the backup machines), in which case you could perhaps direct additional traffic towards the OC3 to confirm throughput issues. Up to now your throughput does reach the expected maximum according to the feeding backup machine speed. Up to now I only see "drop issues", which could be a bug or it could not (if for example a significant percentage of the traffic consists of small packets, the small packets might be filling your output queues). Another difference I see is the CPU clue. The previous post that mentioned a bug was talking about seeing "higher processor loading" and does not state how much it was, while you said that you had monitored your CPU during your issue and it was somewhat low (7%).
I hope this will help you in your search for a similar bug and a workaround for your own issue.
p.s. If you find a solution to your issue, it would be kind if you reported back to us how your issue was resolved. I finally remember to mention that I find pretty cool the somewhat not usual method you troubleshooted it, by enabling RED on your router.
I think he have enough traffic as he see peak off at 100M. Anyway, NPE-G2 has 3 GE ports on board. So shouldn't have problem to generate some more traffic to the link if topology support.
For the "higher processor loading", I want to mention we have policy-map on the link and BGP,OSPF,etc on the production router. Loading save from 8x% to 6x% after we changed IOS. I can't say which part caused the extra loading. But on the test bed, loading is very low when we confirm the POS throughput issue. These are major issues found in 12.4.11T train. I didn't mean the loading caused the throughput issue.
I also try to search Cisco bug before but didn't get it. May be the keywords I used not correct or just Cisco didn't know it yet. So I report to Cisco but got rejected.
I would suggest to replicate the problem to confirm link or IOS issue if have spare equipment. Or do a quick IOS change on the router as it is backup link. 12.4.4XD10,XD7 and 12.4.15T1 is ok in our test. And tell us the IOS version currently using so I might be able to test here as well.
According to Kalvin, the issue he reported is resolved in the IOS version you are running. Nevertheless, Kalvin has shared with us a lot of useful information and I believe you should be rating his post the same way I did.
Thanks. One last thing I want to confirm is the feature set of IOS. Are you running Advanced IP Services feature set? If yes, then probably not the same issue as mine.