Strange packet loss on 29XX products and QNX OS

Unanswered Question
Jan 27th, 2010

Hi guys,

We've got a strange problem we've been troubleshooting for weeks with Cisco 2960G switches, Cisco 2950 and also Cisco 2940, no matter what release of IOS we are using.  Our application runs under the QNX OS, 6.4.1, with Intel Pro 1000 adapters in every systems.

Let's say we start with a small 8 ports Cisco 2940 switch, which has a Gigabit port.

We have 4 systems in the 100 mbps ports, and one "receiver host" in the Gigabit port.  The 4 systems are sending a lot of data to the 'Gigabit receiver host', the gigabit port seems to be occupied at around 15%.

With that being said, on the TCP/IP level, we are losing some packets like they were never transferred, and if I do a "show int gi0/1" the only error counters I see increasing steadily are the "no buffer" and the "ignored".  They are increasing like a dozen every second while our application is running, and are not increasing when it's not.  No other error, no duplex problem or anything.  Tests are done in an isolated environment with no other workstations or network noise.

We contacted QNX and they told us the "swich" was the problem.  At first we didn't believed them and they demonstrated us with a basic 3COM switch, there were no packet loss at all.

They also told us that Cisco might cause the problem with the PFC (Priority Flow Control).  With this information, we tried to disable the "flowcontrol" in and out, on every GigE ports, but still, the problem remains.

Could the 'PFC' really be the problem, where should I look at ?  Can I disable the PFC ?

Thanks a lot,

JFG

I have this problem too.
0 votes
  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 0 (0 ratings)
Loading.
Kevin Dorrell Thu, 02/18/2010 - 03:25

When you had flow control enabled, were the counters showing any "pause" frames?

I am a bit confused by his because you say the main data flow is towards the G0/01 port.  But yet I think the "no buffer" and "ignored" errors are  to do with received frames, not transmitted ones.  That seems to imply that it is the TCP incoming ACKs that are getting lost rather than the bulk outgoing data.

So what happens if the "receiver host" sends you a TCP packet but you don't have a buffer to receive it into, then you ignore it.  If you have no buffer available, then you can try sendin a flowcontrol "pause" to hold the other end up.  But if the other end does not understand the "pause", then there is nothing you can do about it.

So the question is, does the host NIC understand "pause"?

Kevin Dorrell

Luxembourg

jfgrenier Thu, 02/18/2010 - 04:09

Hello Kevin,

We found out a bit more about this problem...

QNX uses its own protocol (qnet) which is a kind of stripped down layer2 protocol that relies totally in the link layer flowcontrol.  Basically, it needs "lossless ethernet" to work.

I also found out that on Cisco 2950 (XX50) products, it was possible to enable both 'send' and 'receive' flowcontrol (802.3x), and this was changed in the 2960 (XX60) products, it is only possible to enable 'receive' flowcontrol, the receive of PAUSE frames, but the switch won't forward PAUSE frames anymore...

So, that's our problem right now.  We need "lossless ethernet" to work flawlessly with the QNX qnet protocol.

Thanks for your reply!

Bye,

JFG

Kevin Dorrell Thu, 02/18/2010 - 05:28

Right ... so the switch has no way to tell the server "hold up, I don't have a buffer at the moment".  It looks like an intractible problem.  I get really uncomfortable when system designers require "lossless Ethernet".  The Ethernet standard was never designed to be lossless.  Just like IP was never designed to be reliable.  UDP even has "unreliable" in its name.  All these connectionless protocols should be able to rely on the higher levels for loss recovery, but in this case it appears the highter level is not up to it.

Good luck.

Kevin Dorrell

Luxembourg

Actions

This Discussion