I am facing a wierd issue with one of the line card in the Cisco 6513 switch. All the servers, desktops and trunks connected to this line card are showing high response time (>500ms) locally. But those IP phones connected to the same line card is working fine and the repsonse time is always 1ms. I am unable to find any root cause as i am not seeing any errors or logs, please help me to fidn the root cause.
Anyway i ahve a plan to reboot the line card and test it, but i am just looking for a fix that would help us to avoid the reboot.
The affected module is WS-X6548-GE-TX.
No errors for the following commands.
#sh fabric status
#sh fabric error
#sh module all
#sh module <slot>
#sh interface counter errors
Cisco Internetwork Operating System Software IOS (tm) s72033_rp Software (s72033_rp-IPSERVICESK9_WAN-M), Version 12.2(18)SXF11, RELEASE SOFTWARE (fc1)
Compiled Fri 14-Sep-07 21:50 by kellythw
Image text-base: 0x40101040, data-base: 0x42DB89D0
ROM: System Bootstrap, Version 12.2(17r)SX5, RELEASE SOFTWARE (fc1)
BOOTLDR: s72033_rp Software (s72033_rp-IPSERVICESK9_WAN-M), Version 12.2(18)SXF11, RELEASE SOFTWARE (fc1)
CISCOCORE-6513 uptime is 4 weeks, 4 days, 18 hours, 35 minutes
Time since CISCOCORE-6513 switched to active is 4 weeks, 4 days, 18 hours, 34 minutes
System returned to ROM by reload at 04:33:29 UTC Fri Feb 22 2013 (SP by reload)
System image file is "sup-bootdisk:s72033-ipservicesk9_wan-mz.122-18.SXF11.bin"
cisco WS-C6513 (R7000) processor (revision 1.0) with 458720K/65536K bytes of memory.
Processor board ID SAL09507GV9
SR71000 CPU at 600Mhz, Implementation 0x504, Rev 1.2, 512KB L2 Cache
Last reset from s/w reset
SuperLAT software (copyright 1990 by Meridian Technology Corp).
X.25 software, Version 3.0.0.
TN3270 Emulation software.
15 Virtual Ethernet/IEEE 802.3 interfaces
172 Gigabit Ethernet/IEEE 802.3 interfaces
4 Ten Gigabit Ethernet/IEEE 802.3 interfaces
1917K bytes of non-volatile configuration memory.
8192K bytes of packet buffer memory.
65536K bytes of Flash internal SIMM (Sector size 512K).
Configuration register is 0x2102
What supervisor are you running ?
Be aware that the 6548 module is quite heavily oversubscribed and you could be running into this ie. servers will definitely be sending a lot more data than an IP phone.
The 6548 has an 8Gbps connection the switch fabric (note this is assuming a sup 720). So you have 48 1Gbps ports going in (in = from end devices) and only 8Gbps going out (out = to the switch fabric). Each group of 6 ports shares a 1Gbps connection the switch fabric. So lets says on one of those groupings you had 6 servers with 1Gbps connections. If only three of these servers were sending out 400Mbps of data each you have contention because 3 x 400Mbps = 1.2Gbps and you only have 1Gbps to the switch fabric.
If this is the problem there are a couple of solutions -
1) you can look into how the devices are spread across the port groupings and maybe you could move things around to make the traffic more even among port groupings. However you mention servers and trunks so you could just end up with the same problem but also the phones are not working either.
2) you can look to migrate some connections off that module onto another but this supposes you have spare capacity elsewhere. If those trunk links are connected to other switches then you would defintely want to look into moving these as the 6548 was never intended to be used as an uplink module between switches.
3) you could upgrade the module to something like a 6748 (or whatever the latest is as they keep introducing new ones) which has a much better connection to the switch fabric. Again this assumes a sup720 as the 6748 would't work with a sup32.
However don't just rush into 3) without first doing some investigation work eg. what do the servers do, how much traffic are they churning out etc. Have you looked at the interface stats for each connection to see if you are getting drops etc ?
I agree with Jon it is most likely oversubscribed , this module was never intended as a server access module , it was meant mostly for user end station connectivity. The phones may not show an issue because if they were set up right they used QOS for the phones which prioritizes phone traffic .
Jon appreciate your timely repsone and detailed explanation.
What i understand from your explanation is that the issue is around the traffic (oversubscription) on the line module. I am doubting one of my trunk link that is coming from an ISP's point to point link, I know it used to hang my previous 3750 core switch completely and now it is behaving differently since i am using a higher model. Everytime i get an explanation from the ISP thata there was a flood in the ring loop.
Do you think this is the culprit ? If yes, do you have any idea how we can avoid such broadcasts/floods floating to our internal network.
It could be, but it also be other devices connected to the actual module. What exactly is coming down that trunk link from the ISP ? Are you extending vlans to another site(s) ?
Bear in mind as well that if you haven't restricted which vlans can be on that trunk then a broadcast in any of your vlans on the 6513 will be sent down that link and the same in the opposite direction. But as i say i'm not clear what that trunk link is used for ?
You refer also to a ring loop, so what is the topology between you and your ISP ?
It could be an issue but if broadcasts/floods were really an issue then the phones should be affected as well, or perhaps not depending on how much data they are sending.
Difficult to say without further info.
You are right.
do you think the #sh fabric utilization command helps to monitor the traffic on the line card ?
Could you please help me with few commands that could help in similar cases related to line cards ?
Is there a way to monitor the line card for real time values. ?
If i get a similar case next time i must ensure that i am prepared to run few diagnosis.
See this document for commands you can use to see what is happening with the switch fabric -
No problem. I also found this doc which might help as well -