Re: Core BGP Router not-responding

leelove01 · ‎09-17-2010

We use Nagios to monitor our 4 core routers. 2 of which are our main BGP routers. We use Nagios to monitor these devices. Periodically we get an alarm in nagios stating its not-responding and within a few minutes another alert clearing the previous one stating all is ok. Everytime I log into the device I don't see anything going on. Its up and appears to be operational. The only thing I find is that the CPU seems to take a spike around the same time from the BGP Router process. We are recieving a total of 623763 prefixes from our bgp neighbors. Has anyone seen anything like this before? Could this be a problem with nagios or the BGP process? IT doesn't appear to be a memory issue as we appear to haver plenty of free memory. Any help would be great. Thanks!

BGP table version is 33649554, main routing table version 33649554
328279 network entries using 38408643 bytes of memory
623763 path entries using 32435676 bytes of memory
1136755/71241 BGP path/bestpath attribute entries using 181880800 bytes of memory
677951 BGP AS-PATH entries using 32045892 bytes of memory
4091 BGP community entries using 186952 bytes of memory
1 BGP extended community entries using 24 bytes of memory
0 BGP route-map cache entries using 0 bytes of memory
0 BGP filter-list cache entries using 0 bytes of memory
BGP using 284957987 total bytes of memory
12064 received paths for inbound soft reconfiguration
BGP activity 924004/593348 prefixes, 5688171/5062032 paths, scan interval 60 secs

Neighbor        V           AS MsgRcvd MsgSent   TblVer InQ OutQ Up/Down State/PfxRcd
1.x.x.x           4       26482 6451091 14376090 33649554    0    0 9w6d       289913
2.x.x.x           4       26482   99567   99456 33649554    0    0 9w6d            4
3.x.x.x           4       26482   99569   99456 33649581    0    0 9w6d            8
4.x.x.x           4       26482   99579   99467 33649581    0    0 5w1d           15
5.x.x.x           4       26482 100265   99446 33649581    0    0 9w6d           36
6.x.x.x           4       29779 300772 293277 33649542    0    0 2w3d            1
7.x.x.x           4       64512 100266   99457 33649542    0    0 6w0d            1
8.x.x.x           4       64513    4572    4127 33649542    0    0 1d19h           0
9.x.x.x           4       10913 53153649   99456 33649542    0    0 9w6d       321716

----------------------------------------------------------------------------------------------------------------------------

sh memory
                       Head       Total(b)         Used(b)       Free(b)      Lowest(b)   Largest(b)
Processor   462F7650   902826416   630503748   272322668   270728408   266354900
      I/O          8000000    67108864    21605604    45503260     45357096     45501532

------------------------------------------------------------------------------------------------------------------------------

cisco WS-C6504-E (R7000) processor (revision 2.0) with 983008K/65536K bytes of memory.
Processor board ID FOX103304ZT
SR71000 CPU at 600Mhz, Implementation 0x504, Rev 1.2, 512KB L2 Cache
Last reset from s/w reset
1 Enhanced FlexWAN controller (2 Serial).
1 Virtual Ethernet interface
50 Gigabit Ethernet interfaces
2 Serial interfaces
1917K bytes of non-volatile configuration memory.
8192K bytes of packet buffer memory.

65536K bytes of Flash internal SIMM (Sector size 512K).
Configuration register is 0x2102

Jon Marshall · ‎09-17-2010

If your BGP neighborships are not dropping and you have no problems with intermittent connectivity then i wouldn't worry about it to be honest. If the CPU is spiking it could just be normal BGP behaviour ie. the BGP scanner runs routintely.

If the CPU is handling the BGP scanner at the time your networking monitoring software is polling the device it may just be that the 6500 is too busy temporarily to answer.

If you CPU was spiking all the time and you were low on memory and traffic forwarding or route peerings/information were suffering then that would be something to worry about.

Jon

gatlin007 · ‎09-18-2010

It could be that the BGP process has all the CPU at the time nagios sends a ping and the switch is to busy to respond. This occurring regularly would indicate an underlying problem. I suppose making the monitoring rule more tolerant to packet loss may be possible.

Take a close look at spanning-tree topology changes and IGP routing between the monitoring server and the target network device. If there is a L2 network involved enable spanning syslog traps to gain situational awareness. If there is a L3 network involved take a close look at the 'age' of the routes for both the monitoring server and the target network device.

Also consider possible congestion on key links between the monitoring server and the target network device. A queuing strategy may be helpful here.

Chris

leelove01 · ‎09-18-2010

Thank you for the responses. I also am afraid that its hinting at an underlying problem that I'm not seeing. We have another core BGP router that has even more BGP routes from peers and less free memory, but it is not having this problem. So I'm thinking that something else is going on that is causing this problem. It might not even be related to BGP but I just happen to see the BGP router processor spiking when I'm logging in. Thanks for the information. I will come up with a strategy to address this based on your input and hopefully resolve it. If not...I'll be back. Thanks!