I'm doing some testing in my lab with respect to some high-availability things we are doing at work. In accordance with best practices, I have a 3-tier design comprised of:
Aggregation (2x 3550-48 (Agg01 & Agg02))
Access (2x 2950)
Everything is connected using Rapid-PVST+, and everything is linked in the traditional triangle style.
Core and Aggregation are running OSPF on loopbacks, while end-user routes are being distributed via iBGP (for scalability). Additionally, the two aggregation switches are running HSRP for a single vlan.
All of my failovers work as expected, including STP, HSRP and OSPF route changes.
My problem is the route convergence times using iBGP. When Agg01 goes completely offline, the Core pulls the OSPF routes quickly, but the iBGP routes stick around for ~3 minutes, blackholing the traffic in the meantime as Agg01 is not available. Eventually the hold timers expire, and the Core gets the new routes from Agg02. Unfortunately, 3 minutes is a really long time.
Does anybody have any experience with iBGP hold timers in such an environment? I want to set them very low, such as 5-10seconds, but I am concerned about very high CPU utilization.
Is modifying BGP hold timers the right answer? Am I missing something entirely?
Thanks for your thoughts,