Cisco Support Community
cancel
Showing results for 
Search instead for 
Did you mean: 
New Member

Is there any function built in Nexus 5010 to detect intermittent link down?

Hi experts,

  I found intermittent link down(20~40 seconds average) occurred about 1~10 times every month. SAP reported a lot of active connections are disconnected and I used a batch to ping and found "requested time out" about 30 seconds.

And Windows, SQL server, Nexus 5010 do not show any errors. We run cluster and cluster does not fail over.

And I don't know which cables or nics cause this issue. When it happened, almost all servers are unreachable. For example, SQL server 1 -> SQL server 2, IBM HS22-1 -> SQL server 1. However, some connections are not dropped sometimes. It varies each time.

PS: I run this topology last year without any problems but it started intermittent link down from 2011/1/7. Because there is no errors in Nexus 5010, it is difficult to troubleshoot. Cisco TAC recommended us to implement virtual port channel yesterday.

Could I use "errdisable detect cause" to detect what caused the intermittent link down? Is there any error logs or switch parameters/status can use to troubleshoot?

SAP network topology.JPG

18 REPLIES

Re: Is there any function built in Nexus 5010 to detect intermit

What other devices connect to the n5k?

Sent from Cisco Technical Support iPad App

New Member

Re: Is there any function built in Nexus 5010 to detect intermit

Hi Andrew,

  IBM HS22 with Nortel BNT 6-ports, HP dl980 with NC550SFP.

Cisco Employee

Re: Is there any function built in Nexus 5010 to detect intermit

Hi Dennis,

What does the NMS tool log which you use for this network is showing for this very moment when the connections are down?

Best regards,

Alex

New Member

Re: Is there any function built in Nexus 5010 to detect intermit

Hi Alexander,

  Our network team used network monitoring tools(I am not sure if it is the NMS) and he observed the same pattern. Many servers could not reach via ping at the moment. However, no one knows why. mac address flapping? defective ports or links?

Re: Is there any function built in Nexus 5010 to detect intermit

are you running vPC?

Sent from Cisco Technical Support iPad App

New Member

Re: Is there any function built in Nexus 5010 to detect intermit

Hi Andrew,

  No yet. Could vPC solve this problem? We will change port channel from PACP to LACP at DB connection and use vPC in Dept 10. Any thoughts?

Cisco Employee

Is there any function built in Nexus 5010 to detect intermittent

Is the ping to your network devices ok in this moment of interruption I did not understand. You said that ping is getting timeout to servers from servers and from your monitoring tools. Where the monitoring tools reside? Which connections they use? Are there any firewall or multicast in the network? If you have link flapping you should see it in the log of the Nexus. In the moment of the problem are you able to ping the nearest switch from the server which disconnected the sessions? Are the servers in one vlan? These anwers could really help in locating the problem.

Best regards,

Alex

New Member

Re: Is there any function built in Nexus 5010 to detect intermit

Hi Alex,

Is the ping to your network devices ok in this moment of interruption I did not understand. You said that ping is getting timeout to servers from servers and from your monitoring tools. Where the monitoring tools reside?

>Monitoring tool is located outside of the broadcast domain. Catalyst 6513 are connected to a lot of server-farm switches, the monitoring tool is running in one of the server-farm switch.(not showed in this graph)

Which connections they use? Are there any firewall or multicast in the network?

>No, no firewall or multicase.

If you have link flapping you should see it in the log of the Nexus. In the moment of the problem are you able to ping the nearest switch from the server which disconnected the sessions?

>No, the whole broadcast domain is affacted. However, sometimes 100% connections are disconnect, and sometimes only 80~90%.

Are the servers in one vlan? These anwers could really help in locating the problem.

> Yes, one vlan.

New Member

Re: Is there any function built in Nexus 5010 to detect intermit

Hi Alex/Andrew,

  Here is my new strategy.

I am not sure how to troubleshoot using wireshark/MS network monitor so I ran some tests on my testing environment. Please check if I did it correctly.

1. I installed network monitor on DB and APP servers and ping each other permanently.

2. I filter IP and protocol( IPv4.Address == 192.168.28.99 AND IPv4.Address == 192.168.28.109 ) (ICMP, port 7)

3. I recorded the time when the link is down.

4. I checked .cap files on both sides and I found there are only request sends on both sides but no one gets any replies. The conclusion is both servers, services, NIC drivers, teaming drivers and OS settings are OK. The source of the problem comes from the middle device(switches or links).

Cisco Employee

Is there any function built in Nexus 5010 to detect intermittent

I do not see the connection to the switch - how it reacts. E.g ping from sql server1 to Nexus 1, sql server 1 to Nexus 2, and ping from sql server 1 to sql server 2 this should be done simultaneously. This is one way that we can see where is the problem. E.g. if ping is ok from sql server 1 to nexus 1  but fails to nexus 2 then we can search the problem there if you can ping nexus 2 but you cannot ping sql server 2 then connection between nexus 2 and sql server 2 should be verified and we we can focus on nexus 2 and sql server 2 only.

Best regards,

Alex

New Member

Re: Is there any function built in Nexus 5010 to detect intermit

Hi Alex,

  I revised my strategy after combining your recommendations and opinions from Microsoft forum. If you have any other ideas, please let me know.

http://social.msdn.microsoft.com/Forums/en-US/sqldataaccess/thread/415ba445-c227-4bf2-9c00-25bd3ed114bf

1. I install network monitors on SQL server 1, SQL server 2, APP server 1 and APP server 2 and I ping each other together with 6 middle devices(5010 * 2 plus Nortel BNT 6ports * 4) permanently.

2. I filter only protocol (ICMP) and this minimizes overhead.

3. I record the time when the link is down.(both in SAP and in my ping log file)

4. I check .cap files on every hosts and if I found there are only request sends on the node but no one gets any replies. The conclusion is both servers, services, NIC drivers, teaming drivers and OS settings are OK. The source of the problem comes from the middle devices(switches or links).

Cisco Employee

Is there any function built in Nexus 5010 to detect intermittent

OK, Dennis let's see what will be the capture when the link is down. You have syslog server for Nexus and other netowrk devices should be checked for the same imtermittent moment.

Best regards,

Alex

New Member

Re: Is there any function built in Nexus 5010 to detect intermit

Hi Alex,

  Sorry, I am back. I found intermittent link down again.

  I found intermittent link down issue became worse. Please check out our netmoncap.zip in my ftp site to see if you could find what's going on, thanks. I will check later but I am an SAP system administrator and not an expert at networking. I would appreciate your assistance.

ftp://ftp01.quantatw.com/

user: sapftp     password: wju123

When does intermittent link down happen 2011/12/13:

1:02pm

1:04

1:06

1:13

1:18

1:24

1:30

1:34

topology:

dl980-1 => nexus-1 => nexus-2 => dl980-2

tccap36 => nortel-1 or 2 => nexus-1 or 2 => dl980-1

ip list:

dl980-1(active DB): 192.168.28.11

dl980-2(passive DB): 192.168.28.12

tccap36(APP server 1): 192.168.28.110

tccap40(APP server 1): 192.168.28.115

Nexus 5010 ip: 192.168.28.251 192.168.28.252

Nortel ip: 192.168.28.25~28

New Member

Re: Is there any function built in Nexus 5010 to detect intermit

Hi Alex,

  Not all connections are broken when intermittent link down occurred. I found it is complicated to identify the source of the problem. Should I combine wireshark with port mirroring? Becuase we use port aggregation, only Rx could be received, right? Is there any documents in Cisco mention how to troubleshoot the problem like this in detail? Wireshark offical guide? or >show techsupport? Any information will be appreciated.

Cisco Employee

Is there any function built in Nexus 5010 to detect intermittent

Hi Dennis,

The connections between servers(192.168.28.11, 192.168.28.12, 192.168.28.110, 192.168.28.115) looks OK from the second capture.

Best regards,

Alex

New Member

Re: Is there any function built in Nexus 5010 to detect intermit

Hi Alex,

  Yes, I have discussed 2 times these issue with Microsoft. I look forward to your opinion.

--- first mail --

Hi Marty,

  What did you find out? I think intermittent link down happened between dl980-1 <=> nexus-1 or nexus-2

Because if you investigate tshark.cap in tccap40, it is ok from tccap40 to Nortel(95), Nortel(96), nexus(251), nexus(252). That’s to say, we should focus on nexus or dl980 teaming driver? Any opinions?

Dl980-2 => dl980-1

Wireshark: missing and out of order

Tccap40 => dl980-1

Tccap40 => nexus(251) ok

          Nexus (252) ok

          Nortel(95) ok

          Nortel(96) ok

--- second mail --

Hi Ellis,

  Please confirm with Ted how to setup ether channel with portfast or edge port correctly.

We will schedule the downtime for enabling portfast tomorrow.

Question: Why does my team loose connectivity for the first 30 to 50 seconds after the Primary adapter is restored (fallback)?

Answer: Because Spanning Tree Protocol is bringing the port from blocking to forwarding. You must enable Port Fast or Edge Port on the switch ports connected to the team.

Here list 4 steps and please make sure you did it all.

http://www.cisco.com/en/US/tech/tk389/tk213/technologies_configuration_example09186a008089a821.shtmlHere

Here list portfast command for Nexus 5010. The Ethernet interface must be configured as PortFast (use the spanning-tree port type edge trunk command).

New Member

Re: Is there any function built in Nexus 5010 to detect intermit

Hi Alex & all,

  I have uploaded the latest config of Nexus 5010. Would you please check if we have followed best practice(http://www.cisco.com/en/US/tech/tk389/tk213/technologies_configuration_example09186a008089a821.shtml  ) to setup ether channel for HP network configuration utility(Teaming). I really appreciate youe help. I upadloed "show running-config" and "show techsupport". The file name is config.zip, FYI.

ftp://ftp01.quantatw.com/

user: sapftp          password: wju123



New Member

Re: Is there any function built in Nexus 5010 to detect intermit

Hi experts,

  additional info. We are running Broadcom smart load balancing teaming in IBM HS22. Every support staff keep told me it is impossible to run active/active in one blade. However, please check this link. (http://support.dell.com/support/edocs/network/P29352/English/teamsvcs.htm

----

Smart Load Balancing (SLB)

Smart Load Balancing™ provides both load balancing and failover when configured for Load Balancing, and only failover when configured for fault tolerance. It works with any Ethernet switch and requires no trunking configuration on the switch. The team advertises multiple MAC addresses and one or more IP addresses (when using secondary IP addresses). The team MAC address is selected from the list of load balancing members. When the server receives an ARP Request, the software-networking stack will always send an ARP Reply with the team MAC address. To begin the load balancing process, the teaming driver will modify this ARP Reply by changing the source MAC address to match one of the physical adapters.

Smart Load Balancing enables both transmit and receive load balancing based on the Layer 3/Layer 4 IP address and TCP/UDP port number. In other words, the load balancing is not done at a byte or frame level but on a TCP/UDP session basis. This methodology is required to maintain in-order delivery of frames that belong to the same socket conversation. Load balancing is supported on 2-8 ports. These ports can include any combination of add-in adapters and LAN-on-Motherboard (LOM) devices. Transmit load balancing is achieved by creating a hashing table using the source and destination IP addresses and TCP/UDP port numbers.The same combination of source and destination IP addresses and TCP/UDP port numbers will generally yield the same hash index and therefore point to the same port in the team. When a port is selected to carry all the frames of a given socket, the unique MAC address of the physical adapter is included in the frame, and not the team MAC address. This is required to comply with the IEEE 802.3 standard. If two adapters transmit using the same MAC address, then a duplicate MAC address situation would occur that the switch could not handle.

Receive Load Balancing is achieved through an intermediate driver by sending Gratuitous ARPs on a client by client basis using the unicast address of each client as the destination address of the ARP Request (also known as a Directed ARP). This is considered client load balancing and not traffic load balancing. When the intermediate driver detects a significant load imbalance between the physical adapters in an SLB team, it will generate G-ARPs in an effort to redistribute incoming frames. The intermediate driver (BASP) does not answer ARP Requests; only the software protocol stack provides the required ARP Reply. It is important to understand that receive load balancing is a function of the number of clients that are connecting to the server via the team interface.

SLB Receive Load Balancing attempts to load balance incoming traffic for client machines across physical ports in the team. It uses a modified Gratuitous ARP to advertise a different MAC address for the team IP Address in the sender physical and protocol address. This G-ARP is unicast with the MAC and IP Address of a client machine in the target physical and protocol address respectively. This causes the target client to update its ARP cache with a new MAC address map to the team IP address. G-ARPs are not broadcast because this would cause all clients to send their traffic to the same port. As a result, the benefits achieved through client load balancing would be eliminated, and could cause out of order frame delivery. This receive load balancing scheme works as long as all clients and the teamed server are on the same subnet or broadcast domain.

When the clients and the server are on different subnets, and incoming traffic has to traverse a router, the received traffic destined for the server is not load balanced. The physical adapter that the intermediate driver has selected to carry the IP flow will carry all of the traffic. When the router needs to send a frame to the team IP address, it will broadcast an ARP Request (if not in the ARP cache). The server software stack will generate an ARP Reply with the team MAC address, but the intermediate driver will modify the ARP Reply and send it over a particular physical adapter, establishing the flow for that session.

The reason is that ARP is not a routable protocol. It does not have an IP header and therefore is not sent to the router or default gateway. ARP is only a local subnet protocol. In addition, since the G-ARP is not a broadcast packet, the router will not process it and will not update its own ARP cache.

The only way that the router would process an ARP that is intended for another network device is if it has Proxy ARP enabled and the host has no default gateway. This is very rare and not recommended for most applications.

Transmit traffic through a router will be load balanced as transmit load balancing is based on the source and destination IP address and TCP/UDP port number. Since routers do not alter the source and destination IP address, the load balancing algorithm works as intended.

Configuring routers for Hot Standby Routing Protocol (HSRP) does not allow for receive load balancing to occur in the adapter team. In general, HSRP allows for two routers to act as one router, advertising a virtual IP and virtual MAC address. One physical router is the active interface while the other is standby. Although HSRP can also load share nodes (using different default gateways on the host nodes) across multiple routers in HSRP groups, it always points to the primary MAC address of the team.

1446
Views
0
Helpful
18
Replies
CreatePlease to create content