cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
7040
Views
0
Helpful
4
Replies

Nexus 7010 intermittently dropping packets for certain SRC/DST IP

seabird505
Level 1
Level 1

Hi,

We have a pair of Nexus 7010 acting as the data centre gateways which connect to a pair of Nexus 5000 aggregators (using VPC) which then serves the end devices. N7K has separate VRFs for outside world by the name EXTERNAL and data centre by the name INNER (there are other VRFs but they are not relevant to this discussion). Inter-VRF routing is done by a firewall which implements data centre traffic policies (whats allowed, whats not). A traffic flow from a client to the server will look like this:

Client   --intranet-->   AGW --L3-->   N7K-02   --L2Trunk-->   FW   --L2Trunk-->   N7K-01   --L2Trunk-->   N5K-02   --L2Trunk-->   Server

FW is directly hanging off both the N7K via multiple trunk links (which are in a VPC on the N7K end). As shown, from the Access Gateway (AGW), a packet hits the first N7K, gets routed by the FW and then reaches the second N7K. Via the VPC Lnk, reaches the first N7K. Then it takes the second N5K and reaches the virtualized UCS server. This is the forward traffic path only.

Now the problem , intermittently a SYN packet from the client to the server is dropped at the trailing N7K-01. I say dropped because its not captured on N7K-01 on the link towards N5K-02. Capture on the N5K-02 confirms its not receiving any. In 100 iterations of client making a complete TCP connection to the server, about 5-8% of the connections have this fate. The client is configured with a very long TCP connect() timeout value so we sometimes see one, two, three even more SYN getting dropped before that particular iteration is successful. Mostly its one SYN getting dropped but in one of the earlier reported cases the client reported to have transaction time of 189 seconds indicating 6 SYN (exponential tcp connect() backoff) of the same session were lost. Other packet types may also be getting dropped but its not a huge number.

While this may at first indicate a network congestion / error issue. We don't have congestion issues or packet loss in general in the data centre. This only happens from certain clients to certain servers and happens intermittently. The same client going to a different IP address on the same server is always successful - 100% all the time. Also, a different client going to the same server IP is always successful. Also, after upgrading the FW last week which requries a reboot of the device severing its links with N7K, the N7K now seem to exhibit this intermittent behaviour for different bunch of client/server IP combinations.

Any help will be greately appreciated. Thanks for your time.

Client app used in troubleshooting:      tnsping     (may be other protocols suffer too, but havn't done any testing on that)

Server:      TNS Listener TCP 1521 port

! On Nexus 7010

# show version

Software

  BIOS:      version 3.22.0

  kickstart: version 5.1(3)

  system:    version 5.1(3)

  BIOS compile time:       02/20/10

  kickstart image file is: bootflash:///n7000-s1-kickstart.5.1.3.bin

  kickstart compile time:  12/25/2020 12:00:00 [03/11/2011 18:42:56]

  system image file is:    bootflash:///n7000-s1-dk9.5.1.3.bin

  system compile time:     1/21/2011 19:00:00 [03/11/2011 19:37:35]

Hardware

  cisco Nexus7000 C7010 (10 Slot) Chassis ("Supervisor module-1X")

  Intel(R) Xeon(R) CPU         with 4115812 kB of memory.

  Processor Board ID JAF1414AADD

  Device name: <<<device-name>>>

  bootflash:    2000880 kB

  slot0:              0 kB (expansion flash)

plugin

  Core Plugin, Ethernet Plugin


Regards, Rashid.

4 Replies 4

seabird505
Level 1
Level 1

For the benefit of others, here is what we found. The N7K was hitting the bug CSCtg95381.

Symptom

:

Nexus 7000 may punt traffic to CPU; so that the traffic may experience random delay or drop.

Further looking, ARP is learned and FIB adjacency is in FIB adjacency table.

Conditions

:

The problem is caused by race condition. Some hosts have not responded to the ARP refresh sent by

N7k which in turn  trigger to delete  ARP entry  due to expiry. As a result   the route delete notification is

sent   to URIB from the process. However there is still traffic coming to   given IP address as a result  the next packet that hit glean resulting   in triggering ARP and  hope ARP is learnt from the host this time.

Workaround

(s):

Clear ip route < host>.

Not  totally explains why it was working for certain client-server  combination but yet the workaround is holding well for end-points when  implemented.

There  would be no host route for the destination server in the adjacency  manager on N7K-01. The only thing thats there is the subnet route  pointing towards the vlan gateway address. Implementing the work-around,  a new /32 route can now be seen in the adjacency manager for the  server.

The bug is fixed in releases starting 5.1(5). Planning to upgrade to 5.2(3a).

Regards, Rashid.

We ran into a similar situation you experienced as well last week.  I was ready to blame the other vendor until my boss saw the same behavior as you did and confirmed that the 7K was the root cause.

The Nexus product is not yet ready for prime time.

Do you have any update on this? We experienced the same issue and has been waiting more than 3 weeks now for Cisco TAC to respond

https://supportforums.cisco.com/thread/2148874

Hi, We are running in to a similar problem, packets with a different source address are getting dropped by the Nexus 7k. We are running 6.2.2 version of the code and we have confirmed that it is nexus 7k which is dropping the packets. 

Few packets sent to the same destination but with different source address gets dropped. So do you think, clearing the ip route for the destination would work here?

Any comments or suggestions greatly appreciated!

 

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: