cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
2236
Views
0
Helpful
18
Replies

Is there any function built in Nexus 5010 to detect intermittent link down?

DennisLee1
Level 1
Level 1

Hi experts,

  I found intermittent link down(20~40 seconds average) occurred about 1~10 times every month. SAP reported a lot of active connections are disconnected and I used a batch to ping and found "requested time out" about 30 seconds.

And Windows, SQL server, Nexus 5010 do not show any errors. We run cluster and cluster does not fail over.

And I don't know which cables or nics cause this issue. When it happened, almost all servers are unreachable. For example, SQL server 1 -> SQL server 2, IBM HS22-1 -> SQL server 1. However, some connections are not dropped sometimes. It varies each time.

PS: I run this topology last year without any problems but it started intermittent link down from 2011/1/7. Because there is no errors in Nexus 5010, it is difficult to troubleshoot. Cisco TAC recommended us to implement virtual port channel yesterday.

Could I use "errdisable detect cause" to detect what caused the intermittent link down? Is there any error logs or switch parameters/status can use to troubleshoot?

SAP network topology.JPG

18 Replies 18

andrew.prince
Level 10
Level 10

What other devices connect to the n5k?

Sent from Cisco Technical Support iPad App

Hi Andrew,

  IBM HS22 with Nortel BNT 6-ports, HP dl980 with NC550SFP.

Hi Dennis,

What does the NMS tool log which you use for this network is showing for this very moment when the connections are down?

Best regards,

Alex

Hi Alexander,

  Our network team used network monitoring tools(I am not sure if it is the NMS) and he observed the same pattern. Many servers could not reach via ping at the moment. However, no one knows why. mac address flapping? defective ports or links?

are you running vPC?

Sent from Cisco Technical Support iPad App

Hi Andrew,

  No yet. Could vPC solve this problem? We will change port channel from PACP to LACP at DB connection and use vPC in Dept 10. Any thoughts?

Is the ping to your network devices ok in this moment of interruption I did not understand. You said that ping is getting timeout to servers from servers and from your monitoring tools. Where the monitoring tools reside? Which connections they use? Are there any firewall or multicast in the network? If you have link flapping you should see it in the log of the Nexus. In the moment of the problem are you able to ping the nearest switch from the server which disconnected the sessions? Are the servers in one vlan? These anwers could really help in locating the problem.

Best regards,

Alex

Hi Alex,

Is the ping to your network devices ok in this moment of interruption I did not understand. You said that ping is getting timeout to servers from servers and from your monitoring tools. Where the monitoring tools reside?

>Monitoring tool is located outside of the broadcast domain. Catalyst 6513 are connected to a lot of server-farm switches, the monitoring tool is running in one of the server-farm switch.(not showed in this graph)

Which connections they use? Are there any firewall or multicast in the network?

>No, no firewall or multicase.

If you have link flapping you should see it in the log of the Nexus. In the moment of the problem are you able to ping the nearest switch from the server which disconnected the sessions?

>No, the whole broadcast domain is affacted. However, sometimes 100% connections are disconnect, and sometimes only 80~90%.

Are the servers in one vlan? These anwers could really help in locating the problem.

> Yes, one vlan.

Hi Alex/Andrew,

  Here is my new strategy.

I am not sure how to troubleshoot using wireshark/MS network monitor so I ran some tests on my testing environment. Please check if I did it correctly.

1. I installed network monitor on DB and APP servers and ping each other permanently.

2. I filter IP and protocol( IPv4.Address == 192.168.28.99 AND IPv4.Address == 192.168.28.109 ) (ICMP, port 7)

3. I recorded the time when the link is down.

4. I checked .cap files on both sides and I found there are only request sends on both sides but no one gets any replies. The conclusion is both servers, services, NIC drivers, teaming drivers and OS settings are OK. The source of the problem comes from the middle device(switches or links).

I do not see the connection to the switch - how it reacts. E.g ping from sql server1 to Nexus 1, sql server 1 to Nexus 2, and ping from sql server 1 to sql server 2 this should be done simultaneously. This is one way that we can see where is the problem. E.g. if ping is ok from sql server 1 to nexus 1  but fails to nexus 2 then we can search the problem there if you can ping nexus 2 but you cannot ping sql server 2 then connection between nexus 2 and sql server 2 should be verified and we we can focus on nexus 2 and sql server 2 only.

Best regards,

Alex

Hi Alex,

  I revised my strategy after combining your recommendations and opinions from Microsoft forum. If you have any other ideas, please let me know.

http://social.msdn.microsoft.com/Forums/en-US/sqldataaccess/thread/415ba445-c227-4bf2-9c00-25bd3ed114bf

1. I install network monitors on SQL server 1, SQL server 2, APP server 1 and APP server 2 and I ping each other together with 6 middle devices(5010 * 2 plus Nortel BNT 6ports * 4) permanently.

2. I filter only protocol (ICMP) and this minimizes overhead.

3. I record the time when the link is down.(both in SAP and in my ping log file)

4. I check .cap files on every hosts and if I found there are only request sends on the node but no one gets any replies. The conclusion is both servers, services, NIC drivers, teaming drivers and OS settings are OK. The source of the problem comes from the middle devices(switches or links).

OK, Dennis let's see what will be the capture when the link is down. You have syslog server for Nexus and other netowrk devices should be checked for the same imtermittent moment.

Best regards,

Alex

Hi Alex,

  Sorry, I am back. I found intermittent link down again.

  I found intermittent link down issue became worse. Please check out our netmoncap.zip in my ftp site to see if you could find what's going on, thanks. I will check later but I am an SAP system administrator and not an expert at networking. I would appreciate your assistance.

ftp://ftp01.quantatw.com/

user: sapftp     password: wju123

When does intermittent link down happen 2011/12/13:

1:02pm

1:04

1:06

1:13

1:18

1:24

1:30

1:34

topology:

dl980-1 => nexus-1 => nexus-2 => dl980-2

tccap36 => nortel-1 or 2 => nexus-1 or 2 => dl980-1

ip list:

dl980-1(active DB): 192.168.28.11

dl980-2(passive DB): 192.168.28.12

tccap36(APP server 1): 192.168.28.110

tccap40(APP server 1): 192.168.28.115

Nexus 5010 ip: 192.168.28.251 192.168.28.252

Nortel ip: 192.168.28.25~28

Hi Alex,

  Not all connections are broken when intermittent link down occurred. I found it is complicated to identify the source of the problem. Should I combine wireshark with port mirroring? Becuase we use port aggregation, only Rx could be received, right? Is there any documents in Cisco mention how to troubleshoot the problem like this in detail? Wireshark offical guide? or >show techsupport? Any information will be appreciated.

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: