6509/SupII/MSFCII and 4006/SupII Performance problems

cajalat · ‎07-09-2007

I'm running into a performance issue where I'm at a loss at trying to troubleshoot any further.

I have two switches (6509 with SUPII/MSFCII) and 4006 with SUPII that I use for a backup environment. I have a client on the 4006 that is NFS mounting from an NFS server on the 6509. The backup server is also on the 6509. The backup flow is NFS server (6509) to client (4006) back to the Backup server (6509). Crazy I know but that's for another day. They are connected via a 2xGEC as follows:

6509:WS-X6516-GBIC:4/9 to 4006:SUPII:1/1

6509:WS-X6516-GBIC:4/13 to 4006:SUPII:1/2

The servers in question are connected as follows:

4006:

CLIENT:WS-X4448-GB-RJ45:2/19

6509:

NFS Server: WS-X6316-GE-TX:9/5

BACKUP SVR: WS-X6316-GE-TX:9/9

It used to be that I had the backup server on the same group of 8 ports that the client was connected to (i.e. on WS-X4448-GB-RJ45:2/17). Performance between those two servers was good but horrible between the backup server and many other servers that mostly live on the 6509 or other switches that transit the 6509. So we moved the backup server to the 6509 per my description above based on flow analysis of traffic and this resolved the backup problems for all other systems except for the one client mentioned above.

Now I'm aware of the shared 1 Gbps per 8 port grouping and how limitations due to oversubscription manifest themselves in the counters. But this time I'm not seeing any Que/Error/Pause counters increment on all ports involved (NFS server, Backup Server, Client, 2xGEC from each end). Also I?ve ensured that only the one client per 8 port group is being backed up at any given time. If I move the client to the 6509 (something I can?t do permanently) then performance is good again. If I move the backup server back to the 4006:2/17 then performance to the client is good again. The 2xGEC is about 6%-9% utilized on one of the links when the backup is occuring (the other is idle expectedly).

Any help/pointers to look at how to diagnose congestion internal to the switch using something other than "show counters" and "show ports" or other techniques for finding internal 4006/SUPII architecture bottlenecks is greatly appreciated.

Casey

dgahm · ‎07-09-2007

Casey,

You've probably already checked for duplex mismatches, but that is the most common cause of slow file transfers. A sweep ping from the gateway router to the servers will reveal that problem quickly, or check for CRC/FCS errors on the ports.

What version code are you running on the 4006 side? I have seen some serious issues with Cat 7.3(2) when trunking over the Sup ports, though it was primarily a problem when also using SPAN sessions. If you have a couple unused gig ports on a line card I would recommend moving your 4006 etherchannel.

Another common performance killer with your topology is dual NIC servers that are homed to both switches and configured for load balancing instead of fault tolerance.

Do you have a Sniffer/Ethereal/WireShark analyzer you can use to capture the slow transfer? This would tell you if packets are being dropped.

Please rate helpful posts.

Dave

cajalat · ‎07-09-2007

Hi Dave,

I did check for the usual list of suspects and specifically the ones you mentioned (Duplex, CRC/FCS, etc). As a matter of fact, after clearing counters on all ports on the servers in question, the 2xGEC on each end, and the client port I saw no errors what-so-ever. The only counters incrementing are the ones you'd expect to increment due to normal *cast traffic.

The code I'm using on the 6509 is 8.3.4 and on the 4006 8.4(9)GLX.

No dual NICs in the same VLAN. I have dual NICs on each of the servers but in my topology the 2nd NIC is dedicated for backup. Basically it is on its own VLAN along with the NFS and Backup servers (all are part of the same subnet/VLAN). The other NIC lives on an entirely different Network/VLAN where default routing points.

I haven't done the packet capture yet (I'll have to physically setup for that which will take me some time) but that's my next step. I definitely do not have RSPAN which is what I can do now to sniff traffic but that will likely muddy the water for me for this particular problem (i.e. I'd be sniffing on potentially the same port/GEC that could be part of the problem). So for me to effectively sniff I will need to add a sniffer to a port that is part of the same grouping of 8 ports.

Are there other commands that can show me if there are any congestion points on the switch due to internal fabric limitations?

Casey

dgahm · ‎07-09-2007

Casey,

Show system will tell you what the peak backplane utilization has been.

Dave

cajalat · ‎07-09-2007

Looking at the result I don't see any obvious problem:

from the 4006:

Modem Baud Traffic Peak Peak-Time

------- ----- ------- ---- -------------------------

disable 9600 0% 20% Sun Dec 10 2006, 05:03:23

from the 6509:

Modem Baud Backplane-Traffic Peak Peak-Time

------- ----- ----------------- ---- -------------------------

disable 9600 25% 49% Wed Jul 13 2005, 11:53:24

Fab Chan Input Output

-------- ----- ------

0 24% 19%

1 0% 0%

2 0% 0%

3 5% 15%

4 0% 0%

5 0% 0%

6 14% 10%

7 0% 0%

8 0% 0%

9 0% 0%

10 0% 0%

11 0% 0%

12 0% 0%

13 0% 0%

14 0% 0%

15 0% 0%

16 0% 0%

17 0% 0%

EDIT: Sorry I forgot to mention that the 4006 top shows 6% utilization as the highest of any Gig port and 27% util on the highest FE port. So not much traffic but still slow transfer when the Server/Client are on different switches.