We have 10 Cisco SG200-50 switches running firmware 220.127.116.11. The switches are used in a small data center where we are a tenant and the manager. The switches normally work smoothly with no problems. When the switches fail we experience packet loss and the only way to fix them is to power cycle them.
As the manager of the data center we use a Cisco 7507 router to take internet bandwidth from multiple carriers, split our external IP addresses into different subnets, put those subnets into different VLANs(600-649), and deliver the VLANs to the customers of the data center. As a customer of the data center we give our external VLANs (600-624) to our routers and firewalls and add internal VLANs (10-19) for our internal subnets.
Switch - A is the root of our spanning tree with a bridge priority of 16384
Switch - B has a brigde priority of 20480
All other switches have a bridge priority of 32768.
As a customer we turn off Spanning Tree on some ports because we use Cisco Local Directors for load balancing.
The default VLAN Id on all the switches is 1.
When we are having a packet loss problem:
1. As a customer we have intermittent ping loss (like get 2, lose 5) when pinging from a computer from Customer 1's switches(like Switch - 104) to the 7507 router
2. Switch - A's management interface is very slow or unusable.
3. Pings to Switch - A from a computer connected to Switch - B show an increasing response time until they go back to the normal 1 ms(For example we will see response times of 1 ms, 1 ms, 1 ms, 5 ms, 10 ms, 14 ms, 20 ms, and then back to 1 ms. The response times will loop like this until we power cycle Switch - A.)
4. Switch - A is set up to send informational log data to a syslog server but, nothing relevant is logged.
Packet Loss Scenerio 1:
We configured a Trunk port on Switch - B with some VLANs on it. We then configured a trunk port on a Cisco Catalyst 2950 and connected it to Switch - B. The CPU usage on the 2950 went to 100% with the spanning-tree process taking 80%. Unplugging the 2950 from Switch - B did not fix the problem. The 2950 supports STP and PVST while the SG200 supports STP and RSTP.
Workaround: change the Trunk port on Switch - B to a General port and only allow tagged frames. I think the trunk ports on the SG200 switches require allowing untagged packets. Why does the change from a Trunk port to a General port fix this problem?
Packet Loss Scenerio 2:
I accidentally plugged 2 new computers into ports on Switch - 106 that were configured as Trunk ports allowing untagged traffic on VLAN 1. We started losing packets on Customer 1's switches and then the problem spread to Switch - A. Unplugging the computers from Switch - 106 did not fix the problem. Because the problem spread from the customer's switches to the data center's switches I am forbidden from using the SG200 switches as a customer until this issue is resolved. We had to replace our SG200 switches with our legacy Catalyst 3500 and 2900 switches.
Why are we having these packet loss problems?
Why does the packet loss problem spread from the customer switches to the data center's switches?
Why does unplugging the equipment that caused the problem not fix the problem?
Why is a power cycle necessary to fix the problem?
Does the default VLAN Id need to be different for each customer?
I'm not sure where to go next because I haven't been able to reproduce either scenerio in a test environment. I think I will turn off as much extra stuff as possible(discovery protocols, smart ports, replace LAG trunk with single cable trunks) and turn the logging up to debug. But none of that fixes the problems we are experiencing it just eliminates potential causes. There is also a new firmware update available but, I would like to be able to reproduce our problem before upgrading the firmware.
Considering the notion that the issue persists throughout the network even when you disconnect units you suspect are the point of failure, this implies there is 1 of 2 things.
Possible TCAM / MAC overflow
Possible spanning tree / storm control issue
You had indicated the switches CPU and memory are getting maxed out. Of course there can be potentially thousands of causes for this, including spanning tree, storm control, MAC/TCAM, heavy data loads, etc. The SG 200 switches are a "light managed" switch and the former 2900 and 3500 series are quite more robust than the SG 200 product.
To note your observation about the General mode Vs Trunk mode, there isn't really a huge technical difference. One could argue a General port may be more of a true 802.1q port. I have a feeling changing the port to general mode, with the smart port negotiating, the trunk requires 1 untag (the native vlan) while the general port does not.
Additionally, for the spanning-tree, you should also verify the Edge port configuration. If PORT FAST is negotiating to any port linking to another device such as a switch, router, etc, this must be disabled, otherwise a BPDU message will be received and cause a chaos on your network.
Trunk mode VLAN: by default sets egress to tagged, supports multiple VLANs, does not set PVID (native VLAN, ingress untagged), native VLAN cannot be a configured Trunk VLAN or 4095 (discard VLAN).
General mode VLAN: by default sets egress to tagged, supports multiple VLANs, does not set PVID (native VLAN, ingress untagged), native VLAN can be any defined VLAN. Setting the PVID removes default vlan (VID=1) for that port.. PVID can be 4095 (discard VLAN). General mode allows mix of tagged and untagged VLANs in the egress direction.
Since the resolution is power cycling the switch, this would indicate the switches may be having too much load. Either from a networking error (spanning tree) or simply too much traffic. Also, the 18.104.22.168 firmware has been pulled by the business unit. The 22.214.171.124 is the current supported software.
Please mark answered for helpful posts
Sx550X, Sx350X, Sx250: PSE will Supply Power to Catalyst PSE Ports
May 31, 2016
June 5, 2017
Configure Remote Network Monitoring (RMON) Events Control Settings on a Switch through the Command Line Interface (CLI)
Remote Network Monitoring (RMON) was developed by the Internet Engineering Task Force (IETF) to support...