I two 7204VXR routers (IOS 12.2-46f SERVICE-PROVIDER feature set) each connected to a different service provider with bgp running over them in a multihomed inbound config.
HRSP is implemented on the internal interfaces while tracking the external interfaces for changing the priority.
5 days back, the primary router suddenly restarted. After it restarted, all inbound traffic to our webservers failed and kept failing. As I troubleshooted the router over a couple of days (changing all of its hardware modules) (it has a POS STM external interface), I finally had to shut its internal interface to allow all traffic through the second router.
I noticed that while it was bgp peering with its corresponding ISP neighbor => it utilized about 81% of memory (it has a 256 RAM) with about ~ 250000 bgp entries. Cisco Output Interpreter specifies this as a dangerous threshold. Late night tests revealed that somehow if if I put the primary router back online and disabled the secondary router => traffic flows inbound successfully. But when both routers are live => the memory utilization on the primary jumps to 85% and traffic sometimes flows and sometimes there are extreme cases of losses. (I had to actually downgrade the IOS on the primary to that SERVICE PROVIDER set from the initial IP PLUS IPSEC 3DES to free fractions of memory allocations to enable traffic to flow even if with some loss).
I have been told that I need a memory upgrade to at least 512 RAM.
My specific question is: Does 80% memory utilization primarily from BGP Router processes and entries is overloading the router that it cannot function properly? => Hence a memory upgrade will fix it?
And is it possible that a router that was operating successfully non stop for over 5 years => suddenly crashes and reboots and stops of the flow of traffic because its BGP entries suddenly increased to a level it cannot handle (in one day)?
I will be doing some hardware upgrade but I would like to know for sure what caused the problem.