Cisco 6509 Supervisor failover - How long?

Unanswered Question
Oct 10th, 2007
User Badges:
  • Bronze, 100 points or more

Hi All,


Just a quick question...


How long would you expect a failover from one Supervisor card to the Slave supervisor card to take?



We have a Cisco 6509 with the following:

5 2 Supervisor Engine 720 (Active) WS-SUP720-3B

6 2 Supervisor Engine 720 (Hot) WS-SUP720-3B


When the Active card was 'pulled' by an engineer, the switch rebooted and took a good 5 minutes or so to come back online.


I would have expected a much quicker failover myself...


Anything we need to check or are missing?


Many thanks


Jonathan


  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 3.2 (5 ratings)
Loading.
paul.matthews Wed, 10/10/2007 - 23:49
User Badges:
  • Silver, 250 points or more

What redundancy mode have you set? five mins sounds like the old failover, where all the line cards get rebooted. SSO should be in the region of a second.

ankbhasi Thu, 10/11/2007 - 00:53
User Badges:
  • Cisco Employee,

Hi Jonathan,


I believe you are already running SSO mode because your standby sup says it is HOT which is only possible when you run SSO.


Now when you say you get 5 minutes of downtime looks to be something really going wrong because with SS) it must be in seconds.


Can you paste the output of "sh redundancy" and "sh module" and "sh version" from your box with both sups into the chassis?


Regards,


Ankur

paul.matthews Thu, 10/11/2007 - 01:07
User Badges:
  • Silver, 250 points or more

Good points - it may also be worth describing how you measure when the failover is complete, as there are a number of points at which that may be deemed to have happened.


Are you saying complete is when routing protocols have fully converged?

jonathanaxford Mon, 10/15/2007 - 05:26
User Badges:
  • Bronze, 100 points or more

Hi Paul,


I suppose complete would be when everything has converged. For us here, that shouldn'ttake too long as its a single OSPF area with 4 switches in, all directly connected to the 6509 in question.


What i expected to see was a slight 'Blip' in the network, where the standby SUP took over, but what we actually had was 5 or so minutes where everything lost connection...


Our Cisco re-seller/support company are looking into this for us too, but i always like to get a heads up if i can!


Cheers

jonathanaxford Mon, 10/15/2007 - 02:44
User Badges:
  • Bronze, 100 points or more

Hi Again,


the Show Redundancy is:

Redundant System Information :

------------------------------

Available system uptime = 26 weeks, 2 days, 18 hours, 23 minutes

Switchovers system experienced = 2

Standby failures = 0

Last switchover reason = active unit removed


Hardware Mode = Duplex

Configured Redundancy Mode = sso

Operating Redundancy Mode = sso

Maintenance Mode = Disabled

Communications = Up


Current Processor Information :

-------------------------------

Active Location = slot 5

Current Software state = ACTIVE

Uptime in current state = 4 days, 3 hours, 46 minutes

Image Version = Cisco Internetwork Operating System Software

IOS (tm) s72033_rp Software (s72033_rp-ADVIPSERVICESK9_WAN-M), Version 12.2(18)S

XF8, RELEASE SOFTWARE (fc2)

Technical Support: http://www.cisco.com/techsupport

Copyright (c) 1986-2007 by cisco Systems, Inc.

Compiled Sat 03-Mar-07 00:07 by tinhuang

BOOT = sup-bootdisk:s72033-advipservicesk9_wan-mz.122-

18.SXF8.bin,1;sup-bootdisk:s72033-entservicesk9_wan-mz.122-18.SXF7.bin,1;

CONFIG_FILE =

BOOTLDR =

Configuration register = 0x2102


Peer Processor Information :

----------------------------

Standby Location = slot 6

Current Software state = STANDBY HOT

Uptime in current state = 4 days, 3 hours, 40 minutes

Image Version = Cisco Internetwork Operating System Software

IOS (tm) s72033_rp Software (s72033_rp-ADVIPSERVICESK9_WAN-M), Version 12.2(18)S

XF8, RELEASE SOFTWARE (fc2)

Technical Support: http://www.cisco.com/techsupport

Copyright (c) 1986-2007 by cisco Systems, Inc.

Compiled Sat 03-Mar-07 00:07 by tinhuang

BOOT = sup-bootdisk:s72033-advipservicesk9_wan-mz.122-

18.SXF8.bin,1;sup-bootdisk:s72033-entservicesk9_wan-mz.122-18.SXF7.bin,1;

CONFIG_FILE =

BOOTLDR =

Configuration register = 0x2102


Will post the Show mod and Show ver in seperate messages (Ran out of space...)


Cheers!

jonathanaxford Mon, 10/15/2007 - 02:45
User Badges:
  • Bronze, 100 points or more

Show Module:


Mod Ports Card Type Model Serial No.

--- ----- -------------------------------------- ------------------ -----------

1 48 CEF720 48 port 1000mb SFP WS-X6748-SFP SAL1105FZ3J

2 48 CEF720 48 port 1000mb SFP WS-X6748-SFP SAL1105FZ59

3 48 CEF720 48 port 10/100/1000mb Ethernet WS-X6748-GE-TX SAL1109J81F

4 48 CEF720 48 port 10/100/1000mb Ethernet WS-X6748-GE-TX SAL1104F66S

5 2 Supervisor Engine 720 (Active) WS-SUP720-3B SAL1105FDV3

6 2 Supervisor Engine 720 (Hot) WS-SUP720-3B SAL1104F35D


Mod MAC addresses Hw Fw Sw Status

--- ---------------------------------- ------ ------------ ------------ -------

1 001a.6dbb.9710 to 001a.6dbb.973f 1.8 12.2(14r)S5 12.2(18)SXF8 Ok

2 001a.6dbb.8d38 to 001a.6dbb.8d67 1.8 12.2(14r)S5 12.2(18)SXF8 Ok

3 001b.2ab4.7180 to 001b.2ab4.71af 2.5 12.2(14r)S5 12.2(18)SXF8 Ok

4 001a.e2d4.8e34 to 001a.e2d4.8e63 2.5 12.2(14r)S5 12.2(18)SXF8 Ok

5 0016.9df6.d630 to 0016.9df6.d633 5.3 8.4(2) 12.2(18)SXF8 Ok

6 0016.4708.1100 to 0016.4708.1103 5.3 8.4(2) 12.2(18)SXF8 Ok


Mod Sub-Module Model Serial Hw Status

---- --------------------------- ------------------ ----------- ------- -------

1 Centralized Forwarding Card WS-F6700-CFC SAL1052CFAP 2.1 Ok

2 Centralized Forwarding Card WS-F6700-CFC SAL1052CHFD 2.1 Ok

3 Centralized Forwarding Card WS-F6700-CFC SAL1106G6J6 2.1 Ok

4 Centralized Forwarding Card WS-F6700-CFC SAL1107H3LA 2.1 Ok

5 Policy Feature Card 3 WS-F6K-PFC3B SAL1104FBAS 2.3 Ok

5 MSFC3 Daughterboard WS-SUP720 SAL1104FB1E 2.6 Ok

6 Policy Feature Card 3 WS-F6K-PFC3B SAL1104FBBQ 2.3 Ok

6 MSFC3 Daughterboard WS-SUP720 SAL1105FDN7 2.6 Ok


Mod Online Diag Status

---- -------------------

1 Pass

2 Pass

3 Pass

4 Pass

5 Pass

6 Pass

jonathanaxford Mon, 10/15/2007 - 02:46
User Badges:
  • Bronze, 100 points or more

Show version:


Cisco Internetwork Operating System Software

IOS (tm) s72033_rp Software (s72033_rp-ADVIPSERVICESK9_WAN-M), Version 12.2(18)S

XF8, RELEASE SOFTWARE (fc2)

Technical Support: http://www.cisco.com/techsupport

Copyright (c) 1986-2007 by cisco Systems, Inc.

Compiled Sat 03-Mar-07 00:07 by tinhuang

Image text-base: 0x40101040, data-base: 0x42D98000


ROM: System Bootstrap, Version 12.2(17r)S4, RELEASE SOFTWARE (fc1)

BOOTLDR: s72033_rp Software (s72033_rp-ADVIPSERVICESK9_WAN-M), Version 12.2(18)S

XF8, RELEASE SOFTWARE (fc2)


NPLYG28-C-1 uptime is 4 days, 4 hours, 17 minutes

Time since NPLYG28-C-1 switched to active is 4 days, 3 hours, 55 minutes

System returned to ROM by unknown reload cause - suspect boot_data[BOOT_COUNT] 0

x0, BOOT_COUNT 0, BOOTDATA 19 (SP by power on)

System restarted at 06:25:43 GMT Thu Oct 11 2007

System image file is "sup-bootdisk:s72033-advipservicesk9_wan-mz.122-18.SXF8.bin

"



This product contains cryptographic features and is subject to United

States and local country laws governing import, export, transfer and

use. Delivery of Cisco cryptographic products does not imply

third-party authority to import, export, distribute or use encryption.

Importers, exporters, distributors and users are responsible for

compliance with U.S. and local country laws. By using this product you

agree to comply with applicable laws and regulations. If you are unable

to comply with U.S. and local laws, return this product immediately.


A summary of U.S. laws governing Cisco cryptographic products may be found at:

http://www.cisco.com/wwl/export/crypto/tool/stqrg.html


If you require further assistance please contact us by sending email to

[email protected].


cisco WS-C6509-E (R7000) processor (revision 1.3) with 458720K/65536K bytes of m

emory.

Processor board ID SMC11030009

SR71000 CPU at 600Mhz, Implementation 0x504, Rev 1.2, 512KB L2 Cache

Last reset from s/w reset

SuperLAT software (copyright 1990 by Meridian Technology Corp).

X.25 software, Version 3.0.0.

Bridging software.

TN3270 Emulation software.

17 Virtual Ethernet/IEEE 802.3 interfaces

196 Gigabit Ethernet/IEEE 802.3 interfaces

1917K bytes of non-volatile configuration memory.

8192K bytes of packet buffer memory.


65536K bytes of Flash internal SIMM (Sector size 512K).

Configuration register is 0x2102



Many thanks for looking into this for me...



paul.matthews Mon, 10/15/2007 - 23:53
User Badges:
  • Silver, 250 points or more

You are running SSO - that should be nice and quick. There are two main areas where there could be an issue. The is the failover itself, and then there is the routing convergence afterwards.


So we need to figure out what your uss is (and how many issues!) Is this in a lab environment where you can test it or is it live?


The first test I would try is to get three PCs attached to the switch. PC A & B in the same VLAN, PC C in a third. No HSRP or anything and this switch being the default gateway for them all. Set a ping going indefinitely to PC a from both B&C and trigger a failover. See how many pings get dropped. That tells us if the basic SSO is working properly. I would expect no more than two responses to be lost on either PC.


If that is OK we know the SSO is working OK and we need to look at routing convergence - have you configured for either Cisco NSF or for Graceful restart at all?

http://www.cisco.com/en/US/products/sw/iosswrel/ps1829/products_feature_guide09186a00805e8fbd.html

jonathanaxford Tue, 10/16/2007 - 01:37
User Badges:
  • Bronze, 100 points or more

Hi Paul,


Thanks for the response. Unfortunately, this is in the production network. I am going to schedule in another failover test as it will be the only way to get to the bottom of it.


I want to actually be there myself so i can see what happens. I like your suggestion about the PC's, i will easily be able to set that up for the test.


I have not set up NSF or Graceful restart, i have begun reading about it, thanks for the link.


One thing i thought though, is that if the majority of the VLANs are directly connected to the 6509, would there still be a big delay in the Layer3 convergence?


thanks for all your help...

lamav Tue, 10/16/2007 - 04:54
User Badges:
  • Blue, 1500 points or more

Jonathan:


have you considered STP convergence? Perhaps STP is adding a considerable amount of convergence time.


Are you running rpvst+?


If not, do you have uplinkfast and backbonefast enabled?


Check some of that stuff out, too.


HTH


Victor

jonathanaxford Wed, 10/17/2007 - 00:32
User Badges:
  • Bronze, 100 points or more

Hi Victor, Thanks for the response.


I was unaware that a stateful failover of the SUP cards would trigger STP reconvergence. We are running rapid-PVST. It is a good point though as we have had issues with STP convergence in the past....

paul.matthews Wed, 10/17/2007 - 06:52
User Badges:
  • Silver, 250 points or more

Layer two will carry on, layer thre will effectively be a complete reinitialisation. Locat routes in the cache will continue to be forwarded, but bear in mind the OSPF neighbours will have lost the adjacency with the 6500 and thus can all the routes via it. 5 mins does seem an awful long time for OSPF to converge though.

lamav Wed, 10/17/2007 - 09:49
User Badges:
  • Blue, 1500 points or more

Paul:


If his sup supports NSF and SSO then he shouldnt lose the neighbor relationships, right? isn't that the main selling point for NSF, that it keeps the interfaces up and the neighbor relationships established during the stateful switchover to the redundant sup module?


Jonathan:


if your sup isnt supporting NSF and SSO then your STP probably did reconverge. Although NSF is a high availability technology for L3 forwarding, it does depend on the links and interfaces remaining in the "up,up" state, so L2 connectivity does play a role in NSF and SSO.

paul.matthews Wed, 10/17/2007 - 11:06
User Badges:
  • Silver, 250 points or more

NSF is a bit of a funny - to even have a chance, it has to be ecplicitly configured. The FIB is shared betweeen the active and standy, but neighbour relationships are not. My understanding of the NSF aware features of routing protocols is that it is not only the router itself that needs to be configured, but the neigbours, and the main (over simplified) effect is that they become more tolerant of a nion-responsive neighbour to keep routes in the table rather than immediately drop them.


Cached routes are in the FIB so forwarding will continue.


P.

jonathanaxford Wed, 10/17/2007 - 23:37
User Badges:
  • Bronze, 100 points or more

Hi,


By the looks of it, the 6509 is not configured for NSF. I will get this implemented before we run the test again to see if it has an effect.


Many thanks

paul.matthews Thu, 10/18/2007 - 00:11
User Badges:
  • Silver, 250 points or more

Don't forget to make sure you look at the neighboring routers...


I would be tempted to plan the NSF config, but do a test first to try to see where delays are.


SSO should be sub second, but thene there are potential SPT issues, and then OSPF itself.


I always like to know *why* something did not work, instead of just seeing it go away IYSWIM...


P.

jonathanaxford Tue, 10/23/2007 - 22:42
User Badges:
  • Bronze, 100 points or more

Definately, I will get to the bottom of this. I also like to ensure that i understand everything so it can't bite me again next time!


I will schedule in NSF config, and also plan for the next down time and failover of the cards so i can make some exact timings and see what is going on.


Thanks for all your help on this one.


Cheers


Jonathan

Actions

This Discussion