cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1658
Views
0
Helpful
7
Replies

LACP Drop packet after couple of week

vignal-systems
Level 1
Level 1

Hello,


We facing strange issue on the following configuration:


Master switch SG500 :
 routing vlans 1 (10.0.1.x/24) & vlan 6 (10.0.6.x/24)
 ip on vlan1 10.0.1.1
 ip on vlan6 10.0.6.1
 Default vlan 1
 tcam +- 400/1024 used for routing
 Fw version 1.4.1.3
 
Member switch SF300
 Default vlan 1
 Fw version 1.4.1.3

Theses switchs are connected with LCAP on 2xgigabyte Port with mac address Load Balance algorithm with trunk on it.

Mokup test:
-StationA connected to master switch vlan1
-StationB connected to member switch vlan6
-Routing working via master switch
-ACL disable for testing purpose

IMPORTANT: Issue appear after days & days without any problems

Symtom:
StationB could not comunicate with StationA
Station B could ping gw 10.0.6.1 (master swith) & gw 10.0.1.1 (master swith) but not StationA 10.0.1.x station during network outage.
After some minutes : traffic automaticly recover.

Network Analysis:
No log or alert on switch
We Make wiresharck trace on both switch (vlan port monitor) and find something: TCP connection Syn StationB & SynAck StationA are visible on master switch but SynAck packet is not present on member swith.

We suspect bug on LACP that appear after many days of uptime. The first time We facing this issue was 2 month ago and we suspect LACP bug due to FW misthmatch on swith.
We upgrade FW on both switch and everything working like charm during 44 days. Now issue came back without any change on configuration.

It looklike LACP drop Synack or return packet. Today we Disconect & reconect LACP and it seems to recover the situation...


Do you have something similar, or any idea about this issue ?

7 Replies 7

Aleksandra Dargiel
Cisco Employee
Cisco Employee

Hi,

One thing you should know about LAG and LACP on Small Business switches is that with 1.3 firmware we have changed LAG behavior even if there is LACP failure the first port will be up and becomes active in case one link partner is temporarily not configured for LACP.

That would implicate that the connection should not be lost. I wonder if the problem you see is not connected to power saving options on host A or perhaps ARP request/response issue, closer to host A location. That would explain why there are no relevant logs on the switch.

Aleksandra

Hi,


First off all: thanks for reading and reply me.


For information, when I post message (29 july) we disconnect / reconnect physicaly LACP ports. Everything return normal until the 26 august when problem reappear. And grow  until compagny call me and we remotly (no on-site network team: so no modification done on sw configuration) do the same action (disconect reconect LAG) and situation recover again.


We are sure that there is no issue concerning host A like power save or ARP issue. We use intervlan routing of SG500 for routing RDS farms (vlan1) to clients (vlan6). Issue appear slowly (1 host, then 5 hosts... 30 hosts... ) to approx 1/4 of connected stations in 36h of time.


When issue is present we can see packet (SYN / SYN-ACK or ICMP-request / ICMP-reply) on SG500 but packets SYN-ACK or ICMP-Reply diseapear after LAG on SF300 switch.

I suspect the following two options:

-SG500 routing or intervlan tagging malformated packet (low probability)

-LCAP algorithm issue.


I'll switch from MAC algorith to IP/MAC algorithm to see if It make any difference. But we must wait approx 4 weeks to see if it have any impact...


As it Impact our factory when issue occur i've short windows to make test, so do you have any suggestion or TEST i can do ?

Hi,

I would take a look at mac addresses in your tests packet capture, also compare to mac table. 

Just to be sure when the LACP LAG is removed you have not seen this issue?

And please double check that you are running latest firmware and boot code on both switches.

Both should be on 1.4.1 firmware and 1.4.0 boot code

Aleksandra

For FW & Boot code

For SG500 :
Firmware Version (Active Image):  1.4.1.3
Firmware MD5 Checksum (Active Image):  f73388df555545d4ac56b89a208493c9
Boot Version:  1.4.0.02
Boot MD5 Checksum:  accbdaec117726d0e5149babc5b2a0b0

For SG300 :
Firmware Version (Active Image):  1.4.1.3
Firmware MD5 Checksum (Active Image):  e6e1243d05a6228d03bfb562616b78bb
Boot Version:  1.3.5.06
Boot MD5 Checksum:  da44c9c583e5a8a274f911c4d16f501e

 

=> I don't see bootcode 1.4 avalable for Sx300 switch, i'm wrong?

 

"Just to be sure when the LACP LAG is removed you have not seen this issue?"

I'm not sure to understand your question but: when issue appear, we don't make any change in configuration on the switch: I just physicaly unplug/plug RJ45 cables and situation recover.

Hi,

Yes for SG300 boot code this is the latest one.

If we assume that there is a problem with algorithm used by LAG then we should be not seen the problem when LAG is removed. This is one of the way to approach the issue.

On the other hand you may also check when you have a problem but before restarting the port, interface and etherlike counters.

Those two are most probably just a start points for further troubleshooting.

I would also suggest you to open tickett with Small Business Support Center Team so they can go with you through the tests and gather all required information in case there developers needs to me notified.

http://www.cisco.com/c/en/us/support/web/tsd-cisco-small-business-support-center-contacts.html

Regards,

Aleksandra

Ok,

We don't have active support contract to open ticket.

We don't see this issue before implement LAG between switchs, but we change also the core swith from SG300 to SG500 in the same time so it is not relevant because it is not the exact same configuration. But SFP and fibers are the same as previously used: with no error on link.

Today, I've clear counter and change LAG Load Balance Algorithm from MAC Address to IP/MAC Address : we will see if issue occur in this mode.

I will give you feedback asap.

 

Thanks.

4 weeks later:

Changing LAG Load Balance Algorithm from "MAC Address" to "IP/MAC Address" issue not came back (yet).

Waiting a couple of month before propose this as an anwser.

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community:

Switch products supported in this community
Cisco Business Product Family
  • CBS110
  • CBS220
  • CBS250
  • CBS350
Cisco Switching Product Family
  • 110
  • 200
  • 220
  • 250
  • 300
  • 350
  • 350X
  • 550X