BGP Path decision algorithm

Answered Question
Apr 2nd, 2008

I had always thought the BGP path selection algorithm whittles down the alternative paths by elimination until only one candidate remains. But according to Wendell Odom's CCIE Certification Guide, page 445, that is not so. It appears that each of the 12 steps (or however many there are) is considered seperately. If any step fails to produce a unique clear winner, then all the paths are carried forward to the next step. That is, unless there is a clear winner, the “non-best” paths are not eliminated.

This has some strange consequences that I am I having a hard time getting my head around. One example is in my blog:

http://dorreke.wordpress.com/2008/04/02/gobsmacked-bgp-route-selection-algorithm/

Can anyone with BGP experience comment?

Kevin Dorrell

Luxembourg

I have this problem too.
0 votes
Correct Answer by ruwhite about 8 years 8 months ago

"No matter how it is implemented exactly, I believe the result should be equivalent to the bubblesort logic. "

This is correct.... There is no "carrying forward" of routes in the algorithm in any implementation, unless the routes are identical for the specific attribute being compared. No implementation that I'm aware of (having looked at gated, cisco, juniper, old procket, and several others) have ever implemented it in any other way. The results of each step are completely deterministic, all the time.

The only steps which vary from implementation to implementation are:

1. Cost community, which not all implementations support. Cisco only supports this with some insertion points. This doesn't make the result any less of a bubble sort, but it does mean that you may have different results on different implementations.

2. The final tie breaker is, on cisco implementations, the oldest route, rather than being based on the router id (though there is a knob for this). Some other implementations don't do this.

HTH

:-)

Russ

  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 5 (6 ratings)
Loading.
marikakis Wed, 04/02/2008 - 04:16

Hello Kevin,

From a first glance, it seems to me that the logic described in the book as you describe it (I do not have that book) is wrong. In the example described in your blog path A should win. I will look up the details of BGP path selection logic later (can't do it right now), but I tend to believe the logic described in the book goes against intuition about how tie-breakers work. Tie-breakers are there to distinguish among equals (seamingly equals up to some point in the comparison).

Kind Regards,

Maria

Danilo Dy Wed, 04/02/2008 - 04:19

Hi Kevin,

In my opinion 'A' still wins. Because 'B' already lost to 'A' in the LOCAL PREFERENCE. It is now between 'A' and 'C' and 'C' lost to 'A' in SHORTEST AS PATH.

If 'B' wins in this case, in my opinion it violates the algorithm and it kind of makes BGP best path manipulation obsolete. I can't even think how to do it in programming.

This is more clearer (I think) http://www.bind.com/BGP/pathsel.html

[EDIT] I'm interested if someone can test what the book says and prove it.

Regards,

Dandy

Kevin Dorrell Wed, 04/02/2008 - 05:29

Hi Dandy,

That is precisely my point. I think 'A' should win, and that 'B' should be lost in the LOCAL_PREF stage. But the book seems to suggest that because there was no overall winner at the LOCAL_PREF stage, that all three paths pass down to the AS_PATH stage.

The document you posted is clear in the list of steps, but it does not explicitly state whether the "less than best" paths are discarded at each step, or whether everything is carried forward to the next step until one step elects an overall winner.

Kevin Dorrell

Luxembourg

Danilo Dy Wed, 04/02/2008 - 05:45

Hi Kevin,

When I first read your post, I was also taken aback with a question "what if".

I agree that the documentation does not explicitly states about discarding the "less than best" paths at each steps and neither explictly states what the book says. But in my opinion the injected best path in the routing table will have to battle with the new learned path/route without the previous losing path/route unles their attributes changed.

I can only try it next month though :) if someone can try it earlier, I'd like to hear from them the result :)

Regards,

Dandy

royalblues Wed, 04/02/2008 - 04:30

Kevin,

AFAIK, When BGP receives multiple routes to a particular destination, it lists them in the reverse order that they were received, from the newest to the oldest.

BGP then compares the routes in pairs, starting with the newest entry and moving toward the oldest entry (starting at top of the list and moving down)

So in your case say if path A is the oldest one, these will be compared in the following manner..

PathC & B will be compared and this will result in PathC being selected (as per local preference)

This will then be compared with pathA which will result in PathA being chosen

The only change in the above algorithm would be when you have BGP deterministic-med configured wherein the paths from the same AS are grouped and compared

HTH

Narayan

marikakis Wed, 04/02/2008 - 04:35

Hello,

According to the following document:

http://www.cisco.com/en/US/tech/tk365/technologies_tech_note09186a0080094431.shtml#bestpath

"BGP assigns the first valid path as the current best path. BGP then compares the best path with the next path in the list, until BGP reaches the end of the list of valid paths."

From this I understand that for a path to be considered best it must have won the battle between it and all the other candidate paths individually, either with direct comparison or implied comparison due to transitive property of comparison (if x wins y and y won z, then x also wins z).

Ok, you could argue that if a document is wrong, why wouldn't it be the one in the URL I mention. I can tell you one thing: I have used local preference for full routes received from 2 upstream neighbors. If the local preference wouldn't work as expected, we would be seeing outgoing traffic over the least prefered connection, but we didn't. This is the way it works.

Kind Regards,

Maria

royalblues Wed, 04/02/2008 - 05:02

The thing to understand here is when an update is received BGP would compare all the paths in its BGP table and use the algorithm to install the best route.

So in kevin's case when the update via pathC comes in, the algorithm would start from the beginning again for all the 3 prefixes and chose the one with the highest local preference (A&C) which will then be compared further for AS-path

It will never be a case when a new update would not result in local preference being not used for comparison since they were used before to break the tie for the earlier paths

Narayan

marikakis Wed, 04/02/2008 - 05:32

Hello,

The logic is not exactly "chose the one with the highest local preference (A&C) which will then be compared further for AS-path". It is rather a 2 path battle each time the comparison is done. One path is chosen arbitrarily as the current best. It is then compared to the next valid path from the beginning of the selection criteria list and one is winner. New winner is compared again to the next available path and so on. The algorithm does not know before each comparison whether the next path considered would be prefered due to e.g. weight attribute, so comparison criteria for 2 paths compared are checked from the beginning. The only thing we know is that current best path is the best from all previously considered.

Kind Regards,

Maria

Kevin Dorrell Wed, 04/02/2008 - 05:54

Maria and Narayan,

So from what you are saying, it is a bubble-sort, with the 11-step algorithm being applied to each pair in each iteration. That would make the algorithm the same as I had imagined it up to now.

I am suspecting that either the wording in the book is unfortunate, or he is describing some past buggy version of the software. But he is quite insistant.

"For example, imagine that R1 has five routes to 9.0.0.0/10, two with AS_PATH length 3 and the others with AS_PATH length 5. The decision process did not determine a best route before reaching step 4 (AS_PATH length). Step 4's logic can determine that two routes are better than the others because they have a shorter AS_PATH length. However, because this step does not produce the single winner, the process moves on to he next step, and all five routes are still considered as candidates to be the best route. Each step either produces a winner or moves on to the next step, examining all routes to that NLRI at the next step."

"Nick" put a comment on my blog entry referring to RFC-4271 section 9.1.2.2 which is pretty unequivocal that the the losers are discarded (sorry ... "removed from consideration" ... ) at each step. Whether you express that in terms of a bubble sort or as a bulk comparison, is immaterial.

(Maria, I thought the RFC citation should appeal to you as I know you are an avid reader of RFCs!)

BTW, a variation on Wendell's mnemonic for classicists: nothing and everthing ... NULLA OMNIA

Not reachable

U8 (OK, that one is a bit contrived)

Local preference

Locally injected

AS_PATH length

Origin code

MED

Neighbor (eBGP/iBGP)

IGP metric to NEXT_HOP

From then on, the contenders are candidates for multi-path.

Advertising RID

Just a a further thought: if you run the bubble sort applying all the steps at ech comparison, you get the algorithm as I (and most of the present company) had imagined it. If you run the bubble sort 11 times, applying one of the steps on each time, then you get the algorithm as described in the book. Maybe.

Kevin Dorrell

Luxembourg

marikakis Wed, 04/02/2008 - 06:11

Hello Kevin,

No matter how it is implemented exactly, I believe the result should be equivalent to the bubblesort logic. It could be an optimization for example to check all routes for weight and find the best immediately if lucky (although this might look easier for the human eye, it might not be a lot easier for the software to brute force the algorithm considering all candidates simultaneously, I don't know).

LOL about the RFC reading, you caught me again :-)

"5.1.5. LOCAL_PREF

LOCAL_PREF is a well-known attribute that SHALL be included in all UPDATE messages that a given BGP speaker sends to other internal peers. A BGP speaker SHALL calculate the degree of preference for each external route based on the locally-configured policy, and include the degree of preference when advertising a route to its internal peers. The higher degree of preference MUST be preferred. A BGP speaker uses the degree of preference learned via LOCAL_PREF in its Decision Process (see Section 9.1.1). ".

Note that "The higher degree of preference MUST be preferred. "

Kind Regards,

Maria

Correct Answer
ruwhite Wed, 04/02/2008 - 06:22

"No matter how it is implemented exactly, I believe the result should be equivalent to the bubblesort logic. "

This is correct.... There is no "carrying forward" of routes in the algorithm in any implementation, unless the routes are identical for the specific attribute being compared. No implementation that I'm aware of (having looked at gated, cisco, juniper, old procket, and several others) have ever implemented it in any other way. The results of each step are completely deterministic, all the time.

The only steps which vary from implementation to implementation are:

1. Cost community, which not all implementations support. Cisco only supports this with some insertion points. This doesn't make the result any less of a bubble sort, but it does mean that you may have different results on different implementations.

2. The final tie breaker is, on cisco implementations, the oldest route, rather than being based on the router id (though there is a knob for this). Some other implementations don't do this.

HTH

:-)

Russ

marikakis Wed, 04/02/2008 - 06:30

Hello Russ,

Nice seeing you around again (after the Networkers in Barcelona BGP session :-)

Kind Regards,

Maria

Kevin Dorrell Wed, 04/02/2008 - 06:43

Russ,

Thanks for joining in the discussion. Now we have a definitive answer, and I can sleep easy again.

I can second the comments about Barcelona: I enjoyed your presentation very much and learned some useful stuff in the process.

Thanks.

Kevin Dorrell

Luxembourg

royalblues Sat, 04/05/2008 - 06:09

Kevin/Dandy

I did a test in the lab and it works as desired :-)

R1#sh ip bgp

BGP table version is 9, local router ID is 40.40.40.1

Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,

r RIB-failure, S Stale

Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path

* i9.0.0.0 20.20.20.2 0 80 0 200 200 i

*>i 10.10.10.2 0 100 0 200 200 300 i

R1#

*Mar 1 00:36:52.703: %LINEPROTO-5-UPDOWN: Line protocol on Interface Serial0/2,

changed state to up

*Mar 1 00:36:54.311: BGP: 30.30.30.2 open active, local address 30.30.30.1

*Mar 1 00:36:54.379: BGP: 30.30.30.2 went from Active to OpenSent

*Mar 1 00:36:54.379: BGP: 30.30.30.2 sending OPEN, version 4, my as: 100

*Mar 1 00:36:54.459: BGP: 30.30.30.2 rcv message type 1, length (excl. header)

26

*Mar 1 00:36:54.459: BGP: 30.30.30.2 rcv OPEN, version 4

*Mar 1 00:36:54.459: BGP: 30.30.30.2 rcv OPEN w/ OPTION parameter len: 16

*Mar 1 00:36:54.463: BGP: 30.30.30.2 rcvd OPEN w/ optional parameter type 2 (Ca

pability) len 6

*Mar 1 00:36:54.463: BGP: 30.30.30.2 OPEN has CAPABILITY code: 1, length 4

*Mar 1 00:36:54.467: BGP: 30.30.30.2 OPEN has MP_EXT CAP for afi/safi: 1/1

*Mar 1 00:36:54.467: BGP: 30.30.30.2 rcvd OPEN w/ optional parameter type 2 (Ca

pability) len 2

*Mar 1 00:36:54.467: BGP: 30.30.30.2 OPEN has CAPABILITY code: 128, length 0

*Mar 1 00:36:54.471: BGP: 30.30.30.2 OPEN has ROUTE-REFRESH capability(old) for

all address-families

*Mar 1 00:36:54.471: BGP: 30.30.30.2 rcvd OPEN w/ optional parameter type 2 (Ca

pability) len 2

*Mar 1 00:36:54.475: BGP: 30.30.30.2 OPEN has CAPABILITY code: 2, length 0

*Mar 1 00:36:54.475: BGP: 30.30.30.2 OPEN has ROUTE-REFRESH capability(new) for

all address-families

*Mar 1 00:36:54.479: BGP: 30.30.30.2 went from OpenSent to OpenConfirm

*Mar 1 00:36:54.479: BGP: 30.30.30.2 went from OpenConfirm to Established

*Mar 1 00:36:54.483: %BGP-5-ADJCHANGE: neighbor 30.30.30.2 Up

*Mar 1 00:36:54.487: BGP(0): 30.30.30.2 initial update completed

*Mar 1 00:36:54.623: BGP(0): 30.30.30.2 rcvd UPDATE w/ attr: nexthop 30.30.30.2

, origin i, localpref 100, metric 0, path 200 200 200 300

*Mar 1 00:36:54.623: BGP(0): 30.30.30.2 rcvd 9.0.0.0/8 ----> Next hop does not get updated confirming the algorithm

*Mar 1 00:36:56.287: BGP: 40.40.40.2 multihop open delayed 13845ms (no route)

*Mar 1 00:37:06.243: BGP: Import timer expired. Walking from 1 to 1

*Mar 1 00:37:10.135: BGP: 40.40.40.2 multihop open delayed 11610ms (no route)

R1#

R1#sh ip bgp

BGP table version is 10, local router ID is 40.40.40.1

Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,

r RIB-failure, S Stale

Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path

* i9.0.0.0 30.30.30.2 0 100 0 200 200 200 300 i

* i 20.20.20.2 0 80 0 200 200 i

*>i 10.10.10.2 0 100 0 200 200 300 i

Narayan

ruwhite Mon, 04/07/2008 - 06:50

Thanks.... I should be at the Euro Networkers this coming year, as well....

:-)

Russ

Actions

This Discussion