ARP/CEF/BGP problem with 12008

Unanswered Question
Feb 20th, 2009

Hello,

I have a strange problem with several BGP peerings on the same 12008 router, apparently caused by underlying L2 issues.

We are connected by means of a port-channel to a switch not owned by our company.

Please follow the output below:

sho proc cpu

CPU utilization for five seconds: 3%/0%; one minute: 22%; five minutes: 18%

sho ip bgp summ | i 195.69.145.117

195.69.145.117 4 41420 142685 143545 0 0 0 10:17:47 Active THE BGP SESSION IS STUCK FOR 10 hrs now.

sho arp | i 195.69.145.117

Internet 195.69.145.117 0 000c.db1f.a400 ARPA Port-channel1

ping 195.69.145.117

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 195.69.145.117, timeout is 2 seconds:

.....

Success rate is 0 percent (0/5)

sho ip cef 195.69.145.117

195.69.145.117/32, version 26316038, epoch 0, connected, cached adjacency 195.69.145.117

0 packets, 0 bytes

Flow: AS 0, mask 23

via 195.69.145.117, Port-channel1, 0 dependencies

next hop 195.69.145.117, Port-channel1

valid cached adjacency

clear ip arp 195.69.145.117

ping 195.69.145.117

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 195.69.145.117, timeout is 2 seconds:

!!!!!

Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/4 ms

sho arp | i 195.69.145.117

Internet 195.69.145.117 0 000c.db1f.a400 ARPA Port-channel1

sho ip bgp summ | i 195.69.145.117

195.69.145.117 4 41420 142688 143555 68197271 0 0 00:00:51 8 AND THE BGP SESSION IS ESTABLISHED AGAIN!

The router is running gsr-p-mz 12.0(32).S11 and apparently no references in the bug toolkit exist in regard to such problems.

There was a etherchannel/MAC problem before with a 12008 but it got resolved long time ago.

Is there a way to determine if this is actually a problem on the router itself or is it a problem with the non-cisco lan-infrastructure to which the router is connected to? I 've noticed that the ARP-entry's age is 0 minutes both before and after the clear which implies constant traffic coming from the peer's ip - maybe because the BGP peering is "active". I suspect it has something to do with the etherchannel and ARP/CEF on the router/

Does somebody have any idea what is happening here?

Thanks & Kind Rgds,

Dirk Versavel

I have this problem too.
0 votes
  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 0 (0 ratings)
Loading.
Giuseppe Larosa Fri, 02/20/2009 - 05:24

Hello Dirk,

the ARP entry IP address and MAC address is not changed after clear ip arp so the old entry wasn't wrong or overwritten by a third party device.

What is the usage level of the member links of the port-channel ?

Do you see a fair distribution ?

What happens if you try to use an extended ping with a different source when direct ping fails ?

if you see traffic coming to you when the session is stucked it should be only the BGP attempts to restart the session.

It looks like that forwarding path from your neighbor to your router works but traffic from your router and in special mode the BGP packets session are lost somewhere.

What type of linecards are involved ?

How much memory is installed in the linecards ?

How many BGP routes are present in the routing table ?

and how many entries are in the CEF table?

the last ones have to be replicated on all linecards.

if volume traffic allows : what happens with only one active member link in the bundle ?

Hope to help

Giuseppe

d.versavel Fri, 02/20/2009 - 06:53

Hi Giuseppe,

thanks man for your valuable contribution to this forum. I suspect the port-channeling of 2 distributed CEF cards imposes a huge burden on the CPU.

Moreover my BGP peerings are sometimes falling back to active and then te CPU peaks through the roof.

Maybe there is no other solution but upgrading the system to redundant 10Gig connections without port channeling.

You can find all info in the attached text file.

Thanks and have a nice weekend,

Dirk

d.versavel Mon, 02/23/2009 - 00:27

Hi Giuseppe,

the system were our router is connected to uses an arp sponge system.

Problem is apparently that our router sometimes only sends out unicast arp requests (no broadcast) and does not reply on arp request in a timely manner. At that point the arp sponge takes over and responds with it's own mac-address, thus actively breaking the BGP session(s).

Cheers,

Dirk

Giuseppe Larosa Mon, 02/23/2009 - 02:15

Hello Dirk,

in this case you can fix it with a static ARP entry in your router for the eBGP neighbor and the other neighbor can do the same.

Hope to help

Giuseppe

d.versavel Mon, 02/23/2009 - 04:47

Hi Giuseppe,

this would be indeed a solution but there are about 70 peers that we don't maintain so this option is not really scalable. However it would be suitable

for testing purposes for instance our peering organization claims that when they arpping our mac-addr they receive a reply for only 50% of the requests sent.

Cheers,

Dirk

Actions

This Discussion