All virtual routers on our backup CSS are flipping back-and-forth from backup to master within 1-to-5 seconds at different times of the day. Different services are also flipping back-and-forth from alive to down during the same time frames. The master CSS shows no symptoms and no users are complaining.
From our backup CSS log.sys file:
MAR 19 12:31:09 5/1 61114 VRRP-4: Virtual router 3: master on interface 172.20.2.93
MAR 19 12:31:10 5/1 61120 VRRP-4: Virtual router 3: backup on interface 172.20.2.93
MAR 19 12:31:10 5/1 61121 VRRP-4: Master is 172.20.2.92
MAR 19 12:31:10 5/1 61123 NETMAN-2: Enterprise:Service Transition:cwh-ott-nt-053-tourismpartners -> down
Our Sniffer on the frontend (spanning the backup CSS port) confirms that when the issues occur, the master vrrp multicast packets are appropriately reaching the backup CSS, and we also see that the backup CSS begins to send conflicting vrrp multicast packets (confirming that it thinks its the master router). The only other unusual event I can see during that time frame is that the backup CSS fails to reply to several pings from an upstream Foundry ServerIron load balancer (pings every 400ms). This leads me to believe that our backup CSS has internal problems (i.e.: it can't keep up with VRRP and PING packets it receives).
Any idea or suggestion on how to identify the root cause?
-There are two other suspicious log entries: services going up and down, wich I'll investigate next because they are all http:get keepalives (on several diffent circuits/VLANs) and also these "duplicate IP" errors:
MAR 21 17:06:53 5/1 152326 IPV4-4: Duplicate IP address detected: 220.127.116.11 00-10-58-03-49-05
MAR 21 17:06:53 5/1 152327 IPV4-4: Incoming CE 0x3c01f00, incoming (0 based) SLP 0xf
Maybe the CSS has a problem with http:get keepalives? I've increased the logging level for vrrp to "debug" but have not
-I don't know about IMM Queue full error messages because we had disabled this trap a few months ago due to too many of these errors filling up our log files. I've re-enabled that trap for now to see if they coincide with the vrrp problems. I haven't seen any new ones yet.
-We're running version 5.03 Build 15.
-we have approx. 80 services defined with icmp keepalives and approx 10 with http:get keepalives.
-Our platform is CSS11153.
-CPU level on backup CSS is between 10%-20%. on Master CSS, CPU is between 5%-15%. We're trying to setup historical reporting on the CPU utilization over time to see if it changes when the problems occur.
PS: the primary CSS still appears to be functionning properly.
I have run into problems with TCP based keepalives failing with similar symptoms. In my case the return packets from the web server would either be lost or they would be forwarded out another interface (even different VLANS).
You could take a trace and see if traffic sent to the CSS is being ignored.
The problem was related to a software bug associated with the Gig port that cropped up when the attached ethernet switch was rebooted.
Shutting down and bringing the CSS gig port back up would clear the problem.
VMware Trunk Port Group is supported from ACI version 2.1
VMM integration must be configured properly
ASA device package must be uploaded to APIC
ASAv version must be compatible with ACI and device package version
In the Previous articles of ACI Automation, we are using Postman/Newman as the Rest API tool to automate the ACI Configuration.
In this article I’m going to discuss on usin...
One of the first steps in building your ACI Fabric is to go through Fabric Discovery. While Fabric Discovery is usually a straightforward process, there are various issues that may prevent you from discovering an ACI switch. This article wil...