EIGRP Flapping

Tshi M · ‎04-09-2008

I have a cisco 6506-E that is connected to other switches. But only starting yesterday that I started getting some eigrp errors between this switch and two others. See below:

4/9/2008 9:50 PM : DUAL-5-NBRCHANGE 106: 000116: Apr 9 21:50:24.920 EST: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 65204: Neighbor 10.33.94.97 (GigabitEthernet1/42) is up: new adjacency

4/9/2008 9:50 PM : DUAL-5-NBRCHANGE 785: 053253: Apr 9 21:50:25.538 EST: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 65204: Neighbor 10.33.94.98 (GigabitEthernet1/6) is up: new adjacency

4/9/2008 9:50 PM : DUAL-5-NBRCHANGE 784: 053252: Apr 9 21:50:24.942 EST: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 65204: Neighbor 10.33.94.98 (GigabitEthernet1/6) is down: peer restarted

4/9/2008 9:50 PM : DUAL-5-NBRCHANGE 782: 053250: Apr 9 21:50:23.086 EST: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 65204: Neighbor 10.33.94.110 (GigabitEthernet1/12) is down: peer restarted

4/9/2008 9:50 PM : DUAL-5-NBRCHANGE 783: 053251: Apr 9 21:50:23.094 EST: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 65204: Neighbor 10.33.94.110 (GigabitEthernet1/12) is up: new adjacency

4/9/2008 9:50 PM : DUAL-5-NBRCHANGE 105: 000115: Apr 9 21:50:21.508 EST: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 65204: Neighbor 10.33.94.97 (GigabitEthernet1/42) is down: holding time expired

I did a continous ping without a single loss:

ping

Protocol [ip]:

Target IP address: 10.33.94.97

Repeat count [5]: 10

Datagram size [100]: 1000

Timeout in seconds [2]:

Extended commands [n]: y

Source address or interface: 10.33.94.98 Type of service [0]:

Set DF bit in IP header? [no]:

Validate reply data? [no]:

Data pattern [0xABCD]:

Loose, Strict, Record, Timestamp, Verbose[none]:

Sweep range of sizes [n]:

Type escape sequence to abort.

Sending 10, 1000-byte ICMP Echos to 10.33.94.97, timeout is 2 seconds:

Packet sent with a source address of 10.33.94.98 !!!!!!!!!!

Success rate is 100 percent (10/10), round-trip min/avg/max = 28/30/32 ms

lamav · ‎04-09-2008

Etienne:

I think you're on the right track with the way you began troubleshooting, but perhaps you need to dig deeper. Let me explain.

According to the output you are showing us, at least one of the peers (10.33.94.97) went down as a result of a holding time expiration.

A hold time expiration means that one router is not seeing EIGRP packets from the configured neighbor within the hold time interval.

The Holdtime interval is 3 times the Hello interval.

Unless the default timers are changed, the default hello interval is 5 seconds for high speed links (above T1 speed) and 60 seconds for slower links.

That means the hold times are 15 and 180 seconds, respectively.

It used to be with older versions of IOS that a router specifically needed a Hello packet to maintain the relationship, but now any EIGRP packet can sustain the neighbor relationship.

Take note, though, that in a stable network, EIGRP will only send Hello packets and wait for a topology change before sending triggered updates or queries. so, it's the hello and holdtime intervals that we can focus on for the purpose of troubleshooting the neighbor relationship.

I see that these routers are connected to each other via high speed GigabitEthernet links, which means the Hello interval is 5 seconds and the Holdtime is 15 seconds.

So, if you are going to initiate a PING test to test the integrity of the link, you would probably want to run it for more than 300 ms (That's the sum of the average times for each of the 10 ping packets you sent as part of your test.).

Be aware that that also may not prove too much if the neighbor relationship only bounces, say, every few hours or so.

Your best bet is to check for congestion (output/input queue drops) or any kind of errors (CRC, input errors, etc) on the inter-switch link interfaces. Also, check for duplex mismatches, which can cause collisions, latency and dropped packets.

Lastly, executing the sh ip eigrp neighbor command on both routers in the peer group and observing an incrementing "Q Count" is a telltale sign that your link is experiencing congestion.

HTH

Victor

Tshi M · ‎04-10-2008

I am not sure who rated the posting but it was not me. Though I must say that I would have rated the same. That being said, the sh ip eigrp ne does not show any increment of q count. the only thing there is the RTO that is greater than 200ms on one of the switch (see attached).

The interfaces are not showing any errors or drops.

Tshi M · ‎04-10-2008

i forgot to add the file

m-abooali · ‎04-13-2008

hi, this Mike. I saw your post and it seems that you might be able to help me out on the following design question/problem or issue!

we have 4 6505s, in two sites, 2 at each, the sites are connected using dark fiber as transit (we have MUX to lite the fiber, transivers, etc.).hese

we also have 2x 6509 behind each pairs of 6506 in the two sites where servers, customers connect to.

I will have PVLANs on these 6509s as well.

now, i have been asked to run BGP between the two sites with provider on the 6506s and EIGRP on the LAN side, i assume on the 6509s.!!?

i don't understand where the BGP on the bprder 6506s should talk to 6509s sitting behind them running EIGRP.

should i run EIGRP on the 6509s or ? I cannot figure out the relation?

if you guys could please share some ideas with me i will certainly be greatful.

Regards,

Mike

lamav · ‎04-10-2008

Etienne:

Interesting results.

Let's go back to your original output.

Besides the holding time expiration, we can see that the neighbor relationships are periodically being restarted by the peer.

There is another situation in which a peer will restart its neighbor relationship and that is when there is a route that is "stuck-in-active" (SIA).

Do you have any associated SIA log messages in any of those 3 routers?

In EIGRP, a primary/successor route is considered to be in a 'Passive' state. If that primary route is lost and no 'feasible successor' exists, that router will place that route in an 'active' state, start a 3 minute timer, and begin sending queries to all its neighbors.

If the neighbor doesn't have a route for that network, it will place that route in an 'Active' state, too, begin a 3 minute timer, and query its neighbors for an answer.

This process can go on as a domino effect until the edge of the router network is reached. That process can be limited, though, with route summarization, but that's another issue.

Everyone's queries should be responded to within 3 minutes.

If a router does not receive a response to its query from its neighbor, it will declare the route to be 'stuck-in-active' and restart the neighbor relationship with that peer and start the whole process over again.

No matter what actually caused the peer to restart the neighbor connection, your troubleshooting should focus on the router that actually felt the need, if you will, to restart the neighbor relationship.

If you reach a brick wall, you may have to start a debugging process, which you will most likely want to do during a maintenance window, since it is so processor intensive and can crash your router.

I will be in meetings all day with one of our clients, so I hope someone else can pick up where I left off if you still haven't resolved the issue.

HTH

By the way, anyone who sees a post that they think has useful information can rate it. It doesnt have to be rated by the person who is asking the question. What is peculiar, however, is that someone gave you -- the questioner -- a rating, and a '1' no less. Perhaps someone is confused. :-)

But that's not important because that's not why you're here anyway. :-)

Victor

Tshi M · ‎04-10-2008

Hi Victor,

Thanks for replying. No, no SIA messages generated. I highly suspect layer 1 to be the problem here. Last night (NY time) I got a lot of bounce but this morning things have been quiet since 7:16EST.

Yes, I get folks might be a bit confused with the rating process. Thanks again...

ruwhite · ‎04-10-2008

We see the peer restarting some, which means the other side sent an update with the init bit set. We see the peer coming up from time to time, which doesn't give us any real information. The only real indicator we have is this:

"4/9/2008 9:50 PM : DUAL-5-NBRCHANGE 105: 000115: Apr 9 21:50:21.508 EST: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 65204: Neighbor 10.33.94.97 (GigabitEthernet1/42) is down: holding time expired "

Which means the peers are losing any/all eigrp packets long enough to dump the relationship. The ping is a good start, I would say.... Checking the logs and counters for interface problems is also good. Two other things I would try:

1. Check the other router, and see what the peer os resetting from that perspective. I don't know if you've already done this, but if you haven't, it would give a bit more information about what's going on.

2. Try pinging using a sweep range of sizes, to see if you are having problems with some sort of low max mtu setting someplace. I suspect this isn't a problem, but it's easy to try, so it's a useful test.

3. Try pinging 224.0.0.10 a lot of times, on the order of 10000 times, with the timeout set to 1 second, or 0. Then look and see what percentage work. Try this several times, over different times of the day, etc.

Since this is Ethernet, it's possible that something in the middle is losing the packets, or even taking the link down, but because you have carrier, you don't see the link fail. There are many errors which could pop up and not show up in any interface counters....

Anyway, that would be the path I would take for the moment.

Oh--one more thing--what are the hello and hold timers set for here? Just curious what the link failure would have to be to cause a down condition.

:-)

Russ

Tshi M · ‎04-10-2008

I will give it a try. The settings are all default. I am baffled because this configuration has been running for quite sometime now. I opened a TAC case but they could not find anything wrong with the devices's configurations. Everything is pointing to layer1. But where is the big unknown...

ruwhite · ‎04-10-2008

Well, I hate to ask all sorts of questions, but when I was on the global escalation team, I did have the team slogan drilled into my head: "Ask questions until they go away!" :-)

More seriously, one other thing occurred to me: How often does this happen? I'd agree this is a layer 1/2 problem, but it's going to be a real hard one, I think, to chase down. We know one thing that helps us, though: When it happens, it happens for at least 15 seconds. That gives us some interesting possible ways of looking for the problem.

For instance, set up a continuous ping through the link, on something that gives you a timestamp with each ping received, on the console, and records the last several hours worth of pings. When you see the log message, then check the pings--did they fail at the same time, or not?

If they did, then the link is probably going completely down, and you don't know about it in some way. At least that would give you a clue of where to look.

Well, if they don't coincide--the eigrp neighbor relationship fails, but the pings stay along, then I would try pinging the 224.0.0.10 address every five seconds or so for some time period long enough to "pass through" one of the failures. I wouldn't know how to tell you to do this--I'd put a network packet device on a span port or something, as a guess. Again, make certain you can see the timestamps. If you see the failure, look for ping failures. There are conditions where unicast will work, but not multicast.

Ah--another way to see if this is the problem, maybe. If there's only two neighbors on the wire, then convert them to EIGRP unicast neighbors. If the problem goes away, then you have some form of intermittent multicast thing going on here.

Finally, if none of these are right, then my next guess is still an mtu issue. Here, I'd set up a ping at max mtu every 5 seconds for long enough to see the problem, perhaps a day or two, and with timestamps, as above. You can again check the correlation, and perhaps pin the problem down to one specific area to look at.

BTW, I once saw something like this, and it turned out to be a randomly injected host route to one of the two interfaces in the network. IE, if the two interfaces are 10.1.1.1/24 and 10.1.1.2/24, and some other router injects 10.1.1.1/32 at the .2 router, then EIGRP goes haywire as long as that route is in the table. This was a redistribution case, where some server running rip was freaking out and leaking a host route for one of the interfaces running EIGRP into the network. We filtered out the host routes at the redistribution point, and the problem went away. Just to give you an idea of how twisty these sorts of problems can be.

Thankfully, I don't do Global Escalation any longer, so my days of troubleshooting these sorts of things are in the past.... Now, I just get to troubleshoot even weirder problems--and I get to create problems for other folks to troubleshoot! :-)

:-)

Russ

Tshi M · ‎04-10-2008

the occurance are so sporadic that i don't know where to start.

I enabled debug eigrp ne AS ne_IP 3 hours ago and still awaiting for the problem to re-occur. I will try the above tomorrow new york time.

Thanks for the posting.