Nexus 7010 intermittently dropping packets for certain SRC/DST IP
We have a pair of Nexus 7010 acting as the data centre gateways which connect to a pair of Nexus 5000 aggregators (using VPC) which then serves the end devices. N7K has separate VRFs for outside world by the name EXTERNAL and data centre by the name INNER (there are other VRFs but they are not relevant to this discussion). Inter-VRF routing is done by a firewall which implements data centre traffic policies (whats allowed, whats not). A traffic flow from a client to the server will look like this:
FW is directly hanging off both the N7K via multiple trunk links (which are in a VPC on the N7K end). As shown, from the Access Gateway (AGW), a packet hits the first N7K, gets routed by the FW and then reaches the second N7K. Via the VPC Lnk, reaches the first N7K. Then it takes the second N5K and reaches the virtualized UCS server. This is the forward traffic path only.
Now the problem , intermittently a SYN packet from the client to the server is dropped at the trailing N7K-01. I say dropped because its not captured on N7K-01 on the link towards N5K-02. Capture on the N5K-02 confirms its not receiving any. In 100 iterations of client making a complete TCP connection to the server, about 5-8% of the connections have this fate. The client is configured with a very long TCP connect() timeout value so we sometimes see one, two, three even more SYN getting dropped before that particular iteration is successful. Mostly its one SYN getting dropped but in one of the earlier reported cases the client reported to have transaction time of 189 seconds indicating 6 SYN (exponential tcp connect() backoff) of the same session were lost. Other packet types may also be getting dropped but its not a huge number.
While this may at first indicate a network congestion / error issue. We don't have congestion issues or packet loss in general in the data centre. This only happens from certain clients to certain servers and happens intermittently. The same client going to a different IP address on the same server is always successful - 100% all the time. Also, a different client going to the same server IP is always successful. Also, after upgrading the FW last week which requries a reboot of the device severing its links with N7K, the N7K now seem to exhibit this intermittent behaviour for different bunch of client/server IP combinations.
Any help will be greately appreciated. Thanks for your time.
Client app used in troubleshooting: tnsping (may be other protocols suffer too, but havn't done any testing on that)
Server: TNS Listener TCP 1521 port
! On Nexus 7010
# show version
BIOS: version 3.22.0
kickstart: version 5.1(3)
system: version 5.1(3)
BIOS compile time: 02/20/10
kickstart image file is: bootflash:///n7000-s1-kickstart.5.1.3.bin
Nexus 7010 intermittently dropping packets for certain SRC/DST I
For the benefit of others, here is what we found. The N7K was hitting the bug CSCtg95381.
Nexus 7000 may punt traffic to CPU; so that the traffic may experience random delay or drop.
Further looking, ARP is learned and FIB adjacency is in FIB adjacency table.
The problem is caused by race condition. Some hosts have not responded to the ARP refresh sent by
N7k which in turn trigger to delete ARP entry due to expiry. As a result the route delete notification is
sent to URIB from the process. However there is still traffic coming to given IP address as a result the next packet that hit glean resulting in triggering ARP and hope ARP is learnt from the host this time.
Clear ip route < host>.
Not totally explains why it was working for certain client-server combination but yet the workaround is holding well for end-points when implemented.
There would be no host route for the destination server in the adjacency manager on N7K-01. The only thing thats there is the subnet route pointing towards the vlan gateway address. Implementing the work-around, a new /32 route can now be seen in the adjacency manager for the server.
The bug is fixed in releases starting 5.1(5). Planning to upgrade to 5.2(3a).
Hi, We are running in to a similar problem, packets with a different source address are getting dropped by the Nexus 7k. We are running 6.2.2 version of the code and we have confirmed that it is nexus 7k which is dropping the packets.
Few packets sent to the same destination but with different source address gets dropped. So do you think, clearing the ip route for the destination would work here?
Desire to create Terminal Server ("TS" in this document) out of 2811 Cisco Router with HWIC-16A card (with Octal cables)
Desire to use SSH over Telnet
TS is ip'ed, SSH access configured (to the TS)
Python based Script to BULK Import/Delete devices using Cisco Prime API
Check my Repo on GitHub for all the details ( see below link )
ASR9001 DC chassis doesn't prints a syslog incase voltage for one of the Power Modules goes below certain threshold. Because of this the monitoring of PowerModule for any failure becomes difficult. ...