OSPF Convergence Concept - Adv Router not-reachable

chansy131 · ‎01-31-2007

Hi,

I'm trying to write some program to simulate the OSPF behavior under the situation of adjacency down or link down. As far as I know, once an OSPF router detects that a directly connected link goes down, the router would construct a new LSA to reflect the remaining adjacency. The LSA is propagated to update all LSDB across other OSPF routers. The SPF tree will be re-calculated, and the IP route table will be updated accordingly.

To better understand the OSPF operation, I set up a three-router topology in this way:

[R1]--------------[R2]---------------[R3]

I made the link between R1 and R2 down. R2 would advertise an LSA that only includes the adjacency of R2-R3 to R3. The LSA previously advertised by R1 would be aged as usual and marked "Adv Router is not-reachable" in the LSDB of R2 and R3.

I wonder how do R2 and R3 determine that R1 is out of reach within such a short period between the link down and the SPF re-calculation. Obviously, R2 and R3 have no way receiving any new LSA from R1. Since R2 and R3 do not delete R1's LSA, R2 and R3 *somehow* need to exclude R1's LSA from the SPF re-calculation (by marking it as "Adv Router is not-reachable"?).

The event ordering that I could figure out is as follows:

1. R2 detects link down

2. R2 knows that there will be no path to R1 (How?? The current IP route table is already out-dated; IP table lookup doesn't make sense here), so R2 marked R1's LSA as "not-reachable"

3. R2 propagates an LSA that describes R2-R3 link only

4. R2 re-calculates the SPF by excluding the R1's LSA which is marked as "not-reachable"

Any insight is appreciated.

Chan

bwalchez · ‎02-06-2007

A common problem when using Open Shortest Path First (OSPF) is routes in the database don't appear in the routing table. In most cases OSPF finds a discrepancy in the database so it doesn't install the route in the routing table. Often, you can see the Adv Router is not-reachable message (which means that the router advertising the LSA is not reachable through OSPF) on top of the link-state advertisement (LSA) in the database when this problem occurs. There are several reasons for this problem, most of which deal with mis-configuration or a broken topology. When the configuration is corrected the OSPF database discrepancy goes away and the routes appear in the routing table.

http://www.cisco.com/en/US/tech/tk365/technologies_tech_note09186a008009481a.shtml

keller999 · ‎07-08-2012

Okay, only about 5 years too late, but this brought up something I've been noodling as well. I have been using packet captures to try to better understand how LSAs are updated and LSDBs are updated. Here is my understanding of the options:

Intra-area link down

When a link goes down inside an area, there is no way for OSPF to directly say "hey guys, this link went down". Instead, the routers that are directly-connected to the downed link re-advertise their Type 1 Router LSAs with an updated list of networks that they are connected to. All the intra-area routers see this LSU with a higher seq than their current LSDB entry, and update accordingly.

So, the LINK isn't removed from the LSDBs really -- instead, all attached routers update their Router LSAs to indicate that they are no longer attached to the down link.

Inter-area link down

Of course, it's a different story in area 0 and other areas. ABRs have been converting the T1/T2 LSA's attached links into simple T3 LSAs that just indicate the individual links, not entire routers. The ABR receives a T1/T2 LSA for a ROUTER that indicates that a link is no longer attached. If there are no other routers that have the link attached, we need to purge the LSA from area 0 and the other areas.

Since there is no 'remove this link' LSA, the router instead sends out an updated T3 LSA for the link that has the Metric set to the maximum value. Area 0 routers that receive it increment the metric by at least 1 (metric to the originating ABR), find the total metric out of bounds, and purge it from their LSDB. This max-metric LSA is also propagated down to the other area's internal routers (need to double check this).

Intra-area router down

This one tripped me up, and it sounds like it did for you too. How do the routers know when a neighbor has gone missing? What causes the routers to determine "Adv Router is not-reachable"? I haven't been able to find documentation on this, but I'm pretty sure this is what is happening.

OSPF doesn't have a mechanism to force age out T1/T2 LSAs. What I mean is, there is no mechanism to flood an LSA out to an area that says 'Router 4 is dead, please forgot all about him'. Instead, the routers have to infer that the router is no longer reachable. After MaxAge has elapsed, the T1 LSA is timed out altogether.

The way OSPF gets away with this is by updating everyone about the links surrounding the now-down router. Say R1 and R2 are connected by a point-to-point link. R2 blows up, and R1 is still connected to the rest of the area. Now, to be fair, R1 doesn't know if R2 is truly down of if just its link to R2 is down. So it sends out a new Router LSA for ITSELF with the point-to-point link removed. If that way the only path to R2, all the other routers now discover when they run SPF that there is no path to R2 anymore! And links attached to the T1 LSA are still in the LSDB until MaxAge, but since SPF cannot calculate a path to them they are removed from the routing table. ABRs set the T3 LSAs to MaxMetric to force age them out across the entire AS.

Now, the funky stuff comes when you have a LAN with multiple routers attached. R1, R2, and R3 are on a LAN switch and are all neighbors. R3 is the DR, and is originating the T2 LSA that represents the shared LAN link. R3 lets the magic smoke out, and R1 and R2 drop neighbors. What now?

First of all, a new DR is elected (let's say R1). R1 is in charge of cleaning up the mess that R3 left behind. It sends out an updated version of its own T1 LSA that includes the fact that it is now a DR, the DR interface address, and routers connected to the LAN segment. It also originates a new T2 LSA with a Link ID of its own interface address, and lists connected routers. Other routers on the LAN segment (R2) also sent out updated T1 LSAs indicating that they are connected to the new T2 made by R1 (Link ID has changed, remember?) and are no longer connected to the old T2 originated by R3.

It is noteable that the old T2 LSA is NOT force flushed from the LSDBs. However, all the presently-attached routers (R1 and R2) have sent out updated T1 LSAs that say that they're not connected to the old T2 LSA from R3 (R3's interface IP used as the Link ID). Instead, they are connected to the new T2 LSA originated by R1 (R1's interface IP used as the Link ID). This makes sense if you think of a DR domain as a virtual router. Rather than redefining the old virtual router, we simply disconnect everything from it and make a new one that everyone who's left can hook onto. This leaves behind a 'ghost' T2 virtual router that will age out after MaxAge, but does not need to be force flushed.

The only other situation that might occur is if R1 is the DR, R2 is the BDR, and R3 is a DROTHER. If R3 fails, R1 will simply send an updated T2 LSA that indicates that R2 is no longer connected to the transit network. R3's T1 LSA stays in the LSDB, but its not connected to anything anymore.

What about if the entire shared DR LAN has no more routers on it? So, R3 has totally bit the dust but we have R1 and R2 left. Let's say that R1 is still the DR. If R2 dies, R1 will do a couple of things. It will send out an updated T1 LSA for itself indicating that it is either no longer attached to the DR LAN (interface is down) or that it is now a stub network (interface is still up, but no other routers on the LAN). It will also send out a MaxAge T2 LSA for its own DR LAN to age it out of everyone's LSDB. Yay, everything is cleaned up!

But if R1 dies, and R2 is left to clean up the mess, the only thing it can do is update its own T1 LSA to indicate that it is no longer attached to that DR LAN. If you visualize the topology, we are cutting off that section of the network, which will make any routes back there unreachable now. The now-dead T2 remains, but nothing that is live is connected to it anymore. Messy, but the job is done.

Conclusion or TL;DR

Assuming I understand the above correctly (which is based on me mucking around with a packet capture tool and doing show commands), OSPF is a really terrible housekeeper. And this is completely due to the design of only allowing the router that originated an LSA to update it. So, when the DR for a DR LAN segment dies, instead of simply updating an existing T2 LSA to say that a new router is now the DR, a new T2 LSA is created and all the routers that are left attach themselves to it, leaving the old T2 LSA to wither and die after MaxAge. The same is true with T1 LSAs for routers that have gone away -- no one can update them but the (now down) router. So, everyone detaches their links from them, and the LSA ages out after MaxAge.

To come aaaaaaaaall the way back around to answer your original question! In your example, R2 knows that R1 is no longer reachable because the neighborship went down. Since it is no longer connected to R1, it updates its own T1 LSA to indicate that it is no longer attached to R1, and sends it to R3. R3 no longer has a valid path to get to R1, and marks it unreachable. R1's T1 LSA then ages out after MaxAge, since there's no router left who can actually withdraw it.