Solved: Making IP route advertisements dependent on more than link status

STUART KENDRICK · ‎08-29-2010

I'm trying to solve the following problem:

campus core

/ \

m-a-rtr ------ dc-a-esx --------- dc-b-esx ------ m-b-rtr

subnet 10.1.5.0/24

-Where 'm-a-rtr' and 'm-b-rtr' are distribution layer boxes (C6K), i.e. all Layer 3 interfaces, with an HSRP relationship between them (10.1.5.1 is the virtual address)

-Where 'dc-a-esx' and 'dc-b-esx' are a redundant pair of data center switches (C6K), i.e. all Layer 2 interfaces, servicing hosts

m-a-rtr and m-b-rtr both advertise reachability for 10.1.5.0/24 to the campus core

m-a-rtr:

interface Vlan5

ip address 10.1.5.2 255.255.255.0

standby 5 ip 10.1.5.1

standby 5 timers 1 3

standby priority 105

standby preempt delay minimum 120

m-b-rtr:

interface Vlan5

ip address 10.1.5.3 255.255.255.0

standby 5 ip 10.1.5.1

standby 5 timers 1 3

standby priority 100

m-x-rtr:

router eigrp 30

network 10.0.0.0

passive-interface default

no passive-interface VlanX

When dc-a-esx reboots, link goes down on its m-a-rtr interface, and m-a-rtr withdraws its advertisement for 10.1.5.0/24. That's good. When dc-a-esx returns to life, it brings up link on its m-a-rtr interface, m-a-rtr starts advertising reachability to 10.1.5.0/24. But dc-a-esx isn't functional yet -- it is still booting. Several minutes of black hole ensue. [m-a-rtr's interface to dc-a-esx also goes HSRP Active, as it isn't hearing any Hellos from m-b-rtr. But that doesn't much matter, since dc-a-esx isn't passing traffic and the hosts don't hear its gratuitous ARPs.]

==> How do I suppress advertisements for 10.1.5.0/24 from m-a-rtr until dc-a-esx is fully functional?

I have lived with this design for about a decade, feeding ~ten data centers equipped with redundant switches (Cat4507 and Cat6509), without getting burned. I test Ethernet/IP HA monthly, by rebooting each box in turn, pinging everything in a data center during the process to verify connectivity -- works great. I just deployed Cat6504s in one data center, and now I encounter ~two minutes of black hole. In hindsight, I don't understand why this issue hasn't bitten me before. Perhaps my existing C4K and C6K bring up link on their distribution layer interfaces at the very end of their boot sequences? Seems hard to believe.

Poking around a bit today ... I've run into 'PBR Support for Multiple Tracking Options'. Seems like this would solve my problem. But it also seems like a lot of work. How do other folks solve this problem?

--sk

Stuart Kendrick

FHCRC

Jon Marshall · ‎08-31-2010

stkendrick wrote:
Hi Jon,
OK, this has been helpful for me; thank you for engaging so thoroughly.  [No, we're not ready for VSS.]
On a final note, what value, if any, do you see in the following, again in the context of data centers?
(a) Shrink HSRP timers below 1 / 3, i.e. into the sub-second range?
(b) BFD
(c) IP Dampening
(d) GLBP (replacing HSRP)
--sk

Stuart

Generally speaking anything that can reduce convergence time when an outage occurs in your DC is a good thing to have.

a) You can but you need to be careful that you don't reduce it too much so that HSRP packets get lost and you flip/flop between switches

b) BFD - yes if your devices support it

c) Not sure in what context

b) If you stay with the L2 uplinks bith forwarding then yes it would allow you to load-balance per vlan on the distribution switches. Note if you go with a design where one of the uplinks is blocked there is not a huge amount of gain because traffic has to traverse the L2 distribution interconnect to get to the other active router.

As i say, anything that helps convergence is a plus in the DC.

Jon

View solution in original post

Jon Marshall · ‎08-30-2010

stkendrick wrote:

I'm trying to solve the following problem:

campus core

/ \

m-a-rtr ------ dc-a-esx --------- dc-b-esx ------ m-b-rtr

subnet 10.1.5.0/24

Stuart

Could you confirm the L2 path between m-a-rtr and m-b-rtr ? From the looks of your schematic for m-a-rtr to send a packet to m-b-rtr it has to go via dc-a-esx and dc-b-esx ? Is this is the case ?

If so this is not an optimal setup. Ideally you want the distribution switches to be directly connected ie. not connected via intermediate switches and then have each server switch dc-a-esx and dc-b-esx dual connected to both switches. That way you have removed the dependency of the communication between your distribution switches on the status of the server switches.

The problem with your setup is that a failure of an access-layer switch ie. one of the server switches affects communication between your distribution switches. And that shouldn't happen.

But there could well be other factors that have led to this design that you have not mentioned so perhaps you could provide some more details.

Jon

STUART KENDRICK · ‎08-30-2010

Hi Jon,

I am having trouble maintaining the integrity of my ASCII diagram. Here is another effort.

campus core

/ \

m-a-rtr ------ dc-a-esx ------- dc-b-esx ------ m-b-rtr

subnet 10.1.5.0/24

To use words: yes, for m-a-rtr to send a packet to m-b-rtr, it will typically cross the access layer. In the case of subnet 10.1.5.0/24, that packet would start at m-a-rtr, traverse the L2 path comprised of dc-a-esx and dc-b-esx and land on m-b-rtr. If that path were down, then m-a-rtr and m-b-rtr would reach each other across the L3 path which crosses the 'campus core'.

It seems to me that you are suggesting the following as a more desirable set-up:

campus core

/ \

m -------------------------------- m

a b

r ----------dc-a-esx-------------- r

t t

r ----------dc-b-esx------------X--r

Where the line between the two 'm's carries IGP traffic. And the 'X' marks a link which is disabled by STP. Am I understanding your vision accurately?

The uploaded 'high-level.pdf' offers another diagram of our existing approach, see the 'Example data center' in the lower right-hand corner. The uploaded 'lan-man.pdf' offers a more detailed view; for an example data center, see the 'GA-116' section inside 'Bag 'o Stuff', lower-left hand edge. The doubled blue lines indicate EtherChannelling. "gbsr-hutch-a-esx" corresponds to 'dc-a-esx' in my ASCII diagram.

--sk

Jon Marshall · ‎08-30-2010

Stuart

So if i understand correctly then you basically have L2 uplinks from the server switches which are also used for HSRP traffic. And the distribution switches are interconnected via a L3 link ?

One question though. The link between dc-a-esx and m-a-rtr - how is this configured ie. is it an etherchannel or not and more importantly which module(s) have you used on dc-esa-x for the connection. Bear in mind that a 6500 boots up the modules in order, after the supervisor, so it may be as simple as switching which module the connection is on.

Edit - note the problem you have with your design is that because there is no independant L2 path between the distribution switches then if dc-a-esx goes down you have to stop advertising from m-a-rtr. If there was an independant L2 path between the distribution switches then if dc-a-esx went down you could continue to advertise from m-a-rtr because packets arriving there could be simply switched to m-b-rtr which could then forward the packets to the server.

Having said that you are correct in that with your design STP is not an issue whereas with the suggested solution STP would need to block one of the uplinks from the server switch to the distribution switch. Your design is valid now that i know you have a L3 link between distribution switches. Could you clarify whether that L3 link is again a direct link because it sounds like it is not from your description.

Jon

STUART KENDRICK · ‎08-30-2010

Hi Jon,

dc-x-esx are pure L2 boxes And yes, the HSRP Hellos exchanged between m-x-rtr traverse dc-x-esx.

The link between m-x-rtr and dc-x-esx (specifically, from m-a-rtr ---- dc-a-esx and from m-b-rtr ---- dc-b-esx) are not EtherChanneled

m-a-rtr:

interface TenGigabitEthernet5/3

description To dc-a-esx

switchport
switchport trunk encapsulation dot1q
switchport trunk allowed vlan 5
switchport mode trunk

dc-a-esx:

interface TenGigabitEthernet5/1
description To m-a-rtr
switchport
switchport trunk encapsulation dot1q
switchport mode trunk
spanning-tree portfast edge trunk
spanning-tree guard none

The specific module location varies between DCs. In some cases, the 'uplink' ports are on the Sup card. In others they are on a line card, as in the example above (in the particular example above, the box is a Cat6513, with slots 1-9 populated).

Ah, OK, I can see that with the design you sketch out, I would not encounter my current problem -- when dc-a-esx returned to life, raising link on its m-a-rtr interface, m-a-rtr would *not* forward traffic across it, because it would put the interface in STP block, minimally until BPDUs (and HSRP Hellos) started traversing it.

In exchange, though, I would buy add STP into the mix. To rephrase -- I have a choice in terms of complexity: STP or PBR Tracking -- there is no free lunch, I have to pick which complexity I want to adopt. Would you concur?

--sk

Jon Marshall · ‎08-30-2010

Stuart

You are correct in that you get STP instead of PBR.

Is it safe to assume the servers are dual honed to both server switches ?

Also would it be possible to have the port on dc-a-esx that connects to m-a-rtr to be located on the last module in the switch rather than slot 5 which i assume is the supervisor ?

Jon

STUART KENDRICK · ‎08-30-2010

Hi Jon,

I would say that most hosts are dual-homed to the data center switches. But it varies. Some stake holders are more concerned with uptime -- they do the dual NIC thing -- others are less concerned and connect only a single NIC to one or the other switch.

Universally, we use glass ports for the uplinks to the distribution layer, on account of distance; in general, this means the SFP ports on the Sup cards, which is generally in slot 1. I suppose we could purchase separate SFP modules, slide them into the last slot in the chassis (although, in some cases, this would require buying new chassis, as the existing ones have no free slots). But that's a lot of capital (8 data centers). Three of these DCs use the C6K four-port 10GigE card for uplinks, so perhaps we could move the 10GigE card to the last slot in the box.

I see where you're heading with this ... if the chassis powers on the slots in order, then putting the uplink ports in the last slot would allow us to sustain the existing design, without adding STP or PBR Tracking.

It remains puzzling to me, though, that I've only encountered this problem in one data center -- and remember, we test this design monthly, or nearly so.

So, my options thus far:

(a) Put the uplinks in the last slot of the data center chassis', so that they get powered on just prior to the start of packet forwarding

(b) Add PBR Tracking on the distribution layer

(c) Change the design to dual home both data center switches to the distribution layer; add STP

Have any comments on the pros and cons of each approach?

--sk

Jon Marshall · ‎08-30-2010

Stuart

So, my options thus far:

(a) Put the uplinks in the last slot of the data center chassis', so that they get powered on just prior to the start of packet forwarding

(b) Add PBR Tracking on the distribution layer

(c) Change the design to dual home both data center switches to the distribution layer; add STP

Think you've summarised it nicely. There is something not right about why the traffic is being blackholed for so long though and it would be worth doing some more investigation. Unfortunately i am only able to nip and out of NetPro at the moment so it's a but difficult. Cisco's recommendation in the Campus is to use L3 between distribution and have both uplinks from access-layer switches L2 and forwarding. However this does also suppose that vlans are isolated to only one switch. In the DC it is often more common to use L2 between the distribution switches and rely on STP/RSTP because you are spanning vlans across multiple switches, hence my question about whether your servers were dual honed.

Note that with option c) you would also need a direct L2 connection between the distribution switches ie. no intermediate switches between the distribution pair.

Personally i have designed and managed DC infrastructures with STP/RSTP and they have worked perfectly well. STP is often given a bad name but configured correctly and preferably using RSTP which your switches should support failover is very quick. Having said that at the moment both your uplinks are currently forwarding and if you need the throughput then if you adopted option c) you would need to add extra configuration to the distribution switches to load balance vlans across both uplinks.

Note that with the advent of VSS you can now run L2 from the access-layer to the distribution pair without having to block an uplink but i'm assuming your switches are not VSS capable.

If i was you i would try option a) first. Option c) works well in a DC but it would require a fair bit of reconfig and cabling changes to implement correctly. If neither appeals then option b) but to me it feels a bit like using duct tape to fix a problem. However that could just be me and if it works then by all means use it.

Jon

STUART KENDRICK · ‎08-31-2010

Hi Jon,

OK, this has been helpful for me; thank you for engaging so thoroughly. [No, we're not ready for VSS.]

On a final note, what value, if any, do you see in the following, again in the context of data centers?

(a) Shrink HSRP timers below 1 / 3, i.e. into the sub-second range?

(b) BFD

(c) IP Dampening

(d) GLBP (replacing HSRP)

--sk

Jon Marshall · ‎08-31-2010

stkendrick wrote:
Hi Jon,
OK, this has been helpful for me; thank you for engaging so thoroughly.  [No, we're not ready for VSS.]
On a final note, what value, if any, do you see in the following, again in the context of data centers?
(a) Shrink HSRP timers below 1 / 3, i.e. into the sub-second range?
(b) BFD
(c) IP Dampening
(d) GLBP (replacing HSRP)
--sk

Stuart

Generally speaking anything that can reduce convergence time when an outage occurs in your DC is a good thing to have.

a) You can but you need to be careful that you don't reduce it too much so that HSRP packets get lost and you flip/flop between switches

b) BFD - yes if your devices support it

c) Not sure in what context

b) If you stay with the L2 uplinks bith forwarding then yes it would allow you to load-balance per vlan on the distribution switches. Note if you go with a design where one of the uplinks is blocked there is not a huge amount of gain because traffic has to traverse the L2 distribution interconnect to get to the other active router.

As i say, anything that helps convergence is a plus in the DC.

Jon

STUART KENDRICK · ‎08-31-2010

Hi Jon,

Thank you for your time and exprertise.

--sk