Solved: PIX 535 Failover - Failover Communication Event - Short Outage

jordan.bean · ‎02-22-2012

We have a pair of PIX 535's with GB interfaces for the inside/outside networks. We're running them in an active/passive configuration. At peak times we're pushing about 60 Mbps. Over the past few months we've had 2-3 events that cause traffic to stop being passed for up to about a minute. One event occurred during business hours and the other occurred after-hours when usage was minimal. The interface is connected to a Catalyst 3560G SFP, which doesn't show any logs or errors.

Logically we believe the issue must be with the PIX itself (hardware or IOS issue), the GB Ethernet card in the PIX or the SFP in the Catalyst 3560G. We plan on starting by failing over to the secondary and seeing if the events subside, but we'll have to wait a while to know whether or not that resolves the issue. If not, we then plan on upgrading to 7.0(8)6 to ensure we're not exposed to any DoS vulnerabilities. We're hesitant to make any more changes than necessary because we plan to decomission these units soon.

Any thoughts?

PIX# show log

Nov 28 2011 20:27:16 PIX : %PIX-1-105005: (Primary) Lost Failover communications with mate on interface inside

Nov 28 2011 20:27:16 PIX : %PIX-1-105008: (Primary) Testing Interface inside

Nov 28 2011 20:27:16 PIX : %PIX-1-105009: (Primary) Testing on interface inside Passed

Feb 22 2012 15:55:33 PIX : %PIX-1-105005: (Primary) Lost Failover communications with mate on interface inside

Feb 22 2012 15:55:34 PIX : %PIX-1-105008: (Primary) Testing Interface inside

Feb 22 2012 15:55:34 PIX : %PIX-1-105009: (Primary) Testing on interface inside Passed

PIX# show failover

Failover On

Cable status: Normal

Failover unit Primary

Failover LAN Interface: N/A - Serial-based failover enabled

Unit Poll frequency 15 seconds, holdtime 45 seconds

Interface Poll frequency 15 seconds

Interface Policy 1

Monitored Interfaces 4 of 250 maximum

Version: Ours 7.0(6), Mate 7.0(6)

Last Failover at: 17:45:22 CDT Aug 10 2011

This host: Primary - Active

Active time: 16953870 (sec)

Interface outside (207.71.25.99): Normal

Interface inside (192.168.1.1): Normal

Interface dmz1 (172.29.1.1): Normal

Interface admin1 (172.30.30.222): Link Down (Waiting)

Other host: Secondary - Standby Ready

Active time: 0 (sec)

Interface outside (207.71.25.98): Normal

Interface inside (192.168.1.2): Normal

Interface dmz1 (172.29.1.6): Normal

Interface admin1 (172.30.30.223): Link Down (Waiting)

Stateful Failover Logical Update Statistics

Link : Unconfigured.

TSPIX# show int inside

Interface GigabitEthernet1 "inside", is up, line protocol is up

Hardware is i82543 rev02, BW 1000 Mbps

(Full-duplex), 1000 Mbps(1000 Mbps)

MAC address 0003.47df.847a, MTU 1500

IP address 192.168.1.1, subnet mask 255.255.255.248

29514684464 packets input, 15965565724242 bytes, 257280 no buffer

Received 15643 broadcasts, 0 runts, 0 giants

0 input errors, 0 CRC, 38 frame, 176271 overrun, 0 ignored, 0 abort

0 L2 decode drops

31542559822 packets output, 24913481903406 bytes, 0 underruns

0 output errors, 0 collisions

0 late collisions, 0 deferred

input queue (curr/max blocks): hardware (2/0) software (0/0)

output queue (curr/max blocks): hardware (0/100) software (0/0)

TSPIX# show ver

Cisco PIX Security Appliance Software Version 7.0(6)

Compiled on Tue 22-Aug-06 13:22 by builders

System image file is "flash:/pix706.bin"

Config file at boot was "startup-config"

TSPIX up 270 days 22 hours

failover cluster up 270 days 22 hours

Hardware: PIX-535, 1024 MB RAM, CPU Pentium III 1000 MHz

Flash i28F640J5 @ 0x300, 16MB

BIOS Flash DA28F320J5 @ 0xfffd8000, 128KB

0: Ext: GigabitEthernet0 : address is 0003.47df.8478, irq 255

1: Ext: GigabitEthernet1 : address is 0003.47df.847a, irq 12

2: Ext: Ethernet0 : address is 0003.479b.01cb, irq 255

3: Ext: Ethernet1 : address is 0002.b3d5.2e93, irq 255

Licensed features for this platform:

Maximum Physical Interfaces : 14

Maximum VLANs : 150

Inside Hosts : Unlimited

Failover : Active/Active

VPN-DES : Enabled

VPN-3DES-AES : Enabled

Cut-through Proxy : Enabled

Guards : Enabled

URL Filtering : Enabled

Security Contexts : 2

GTP/GPRS : Disabled

VPN Peers : Unlimited

This platform has an Unrestricted (UR) license.

mirober2 · ‎03-01-2012

Hi Jordan,

The CPU hog information is stored in a buffer and timestamped so as long as you don't reload the PIX you can go back and look at the output of 'show proc cpu-hog' and see the last few hogs that occurred. You would need to keep an eye on these and check after the overruns increase again (assuming they are not constantly increasing) to see if there were any recent CPU hogs. If this turns out to be the case, I would recommend upgrading to the latest 7.2 or 8.0 image, or opening a TAC case for further investigation.

Flow control would help with issue #1 that I mentioned above, but this is not supported on any PIX version, so you would need an ASA for that.

If the switch(es) that connect to the PIX support Netflow, you may want to consider setting that up with a collector so you can identify the source of the packet bursts (assuming #1 was the issue).

Otherwise, you could consider tuning the interface poll/hold times so that the PIX waits longer before failing over. This might help since the interface communication problem seems to be very brief:

http://www.cisco.com/en/US/docs/security/asa/asa70/configuration/guide/failover.html#wp1073912

-Mike

View solution in original post

mirober2 · ‎02-29-2012

Hi Jordan,

Both failovers (November and February) occurred because the Primary unit stopped receiving failover hello messages from its mate on the inside interface. This would trigger a failover event and since you aren't using stateful failover, all connections would have to rebuild and you would see a brief outage while everything recovered.

If you don't see any link down events on the switch for the ports that connect to the PIX's inside interfaces, the interfaces would have stayed up and the hello packets were likely dropped somewhere in the path. This can certainly happen due to a bad SFP, but I would also look at the PIX interfaces and switch ports to see if there were any CRC, overrun, or underrun errors at the time.

Another possibility is that the PIX was too busy to process failover traffic. This could be because of high CPU utilization, memory block depletion, or high traffic bursts/loops. Whatever the problem was, it was extremely brief as you can see from the logs (the interface was marked as failed and recovered within the same second).

-Mike

jordan.bean · ‎03-01-2012

Thanks Mike. Our PIX 535's are connected to SFP's on two Cat 3560's. I checked the Cat 3560 interfaces and we have 0 errors. On the PIX's, we're seeing some "no buffer" and overrun errors. I'm surprised we're seeing these because we're using gigabit interfaces and only hit about 60 Mbps at peak times. We're running PIX OS 7.0. We've been hesitant to upgrade because it's been stable for the most part and we're trying to get these things decommissioned.

Over about 15-16 hours:

PIX# show int outside

Interface GigabitEthernet0 "outside", is up, line protocol is up

Hardware is i82543 rev02, BW 1000 Mbps

(Full-duplex), 1000 Mbps(1000 Mbps)

91005761 packets input, 80904499890 bytes, 50458 no buffer

Received 21397 broadcasts, 0 runts, 0 giants

0 input errors, 0 CRC, 0 frame, 15988 overrun, 0 ignored, 0 abort

TSPIX# show xlate

1753 in use, 6679 most used

Our config is pretty straight forward. We only have an inbound ACL on the outside interface. It's about 600 lines long. And, we have about 100 static NAT's.

Do I need to try to clean up this config to simplify packet processing? Is there anything we can do to resolve these overrun errors? I assume if a failover hello/response packet gets dropped because of the no buffer/overruns then we'll see the error and experience the behavior that we're seeing.

mirober2 · ‎03-01-2012

Hi Jordan,

The overruns can certainly cause that behavior. There are 2 main things that can cause overruns:

1) Packets arrive at the interface at a rate faster than the interface buffers can handle

2) The PIX's CPU is too busy to pull packets out of the interface's receive buffer

For #1, this is caused by the packets/sec rate, rather than the throughput (Mbps) of the link. One example is the case where you might see a short burst of many small packets. In this case, the packets/sec rate is too high for the interface to put all of the packets into the Rx buffer, but the throughput is still very low since the packets are very small in size. If this is the problem, there is not much you can do from a PIX 7.0 stand point, other than reducing the load on the interface.

For #2, you would need to see if the CPU utilization of the PIX is very high, or if you see CPU hogs in the output of 'show proc cpu-hog' at the time when the overruns are increasing. If this is the cause, the fix would be to reduce the load on the CPU or get a software fix for the hogs.

You should also consider using stateful failover so users don't need to rebuild all of their connections if a failover occurs. This would help reduce the time of the outage:

http://www.cisco.com/en/US/docs/security/asa/asa70/configuration/guide/failover.html

-Mike

jordan.bean · ‎03-01-2012

It's going to be hard for us to catch the CPU hogs since the events happen rarely. Currently this PIX's are providing NAT for internal IP's. We're working to phase those out so that we can decommission them entirely. That process will take some time. Is there anything we can do in the meanwhile? i.e. would an OS upgrade to 8.2 so we can enable "flow control send on" help any? Would replacing them with ASA's help? I'd ike to determine what our options are until we can get the firewalls removed entirely. Thanks.

mirober2 · ‎03-01-2012

Hi Jordan,

The CPU hog information is stored in a buffer and timestamped so as long as you don't reload the PIX you can go back and look at the output of 'show proc cpu-hog' and see the last few hogs that occurred. You would need to keep an eye on these and check after the overruns increase again (assuming they are not constantly increasing) to see if there were any recent CPU hogs. If this turns out to be the case, I would recommend upgrading to the latest 7.2 or 8.0 image, or opening a TAC case for further investigation.

Flow control would help with issue #1 that I mentioned above, but this is not supported on any PIX version, so you would need an ASA for that.

If the switch(es) that connect to the PIX support Netflow, you may want to consider setting that up with a collector so you can identify the source of the packet bursts (assuming #1 was the issue).

Otherwise, you could consider tuning the interface poll/hold times so that the PIX waits longer before failing over. This might help since the interface communication problem seems to be very brief:

http://www.cisco.com/en/US/docs/security/asa/asa70/configuration/guide/failover.html#wp1073912

-Mike

jordan.bean · ‎03-16-2012

We did some clean-up on the PIX. We disabled http inspection to minimze perfmon stats and also removed some ACL's which dropped our ACL length from about 680 to 300 in hopes that would cut down on CPU utilization (though we didn't really have CPU utilization issues before. I'm sure this is minimal for the platform already). We also upgraded to 7.2(4)38 and I think that's what made the difference thus far. After a few days we're running clean with no errors. Software queues are 0/0 and hardware queues are 0/36. No cpu hogs and CPU runs at about 8% during peak hours.