Firewall Services Module (FWSM) is positioned as an aggregation edge firewall. Its architecture is primarily designed to service a high number of low-bandwidth flows. When the FWSM is used to protect environments involving a few high-bandwidth flows (such as network backup applications), the observed performance on such flows is frequently lower than expected. This guide is will go over the existing limitations and provide several ways to improve single TCP flow performance.
TCP Performance Considerations
Even without an FWSM in the path, the maximum throughput of a single TCP flow is capped by the combination of the TCP receive window size as well as the Round Trip Time (RTT) between the endpoints. The TCP window size advertised by an endpoint indicates how much data the other side can send before expecting a TCP ACK. We assume that the send buffer of the transmitting endpoint can accommodate at least the size of the TCP receive window of the other side. Since the sender cannot transmit more data than the advertised receiver’s TCP window size during an RTT interval (i.e. the time it takes for the first block of data to arrive to the receiver and for the TCP ACK to come back to the sender), the maximum throughput of a TCP flow can be calculated as such:
Maximum Throughput [bps]= (TCP Window Size [bytes] /RTT [seconds]) * 8 [bits/byte]
In this and the following calculations we assume that the send buffer of the transmitting endpoint can accommodate at least the size of the TCP receive window of the other side. Inversely, to calculate the appropriate TCP window size to take the maximum advantage of the available bandwidth, the following formula can be used:
Optimal TCP Window Size [bytes] = (Minimum Link Bandwidth [bps] / 8[bits/byte]) * RTT [seconds]
For instance, assume that host A is transmitting data to host B and host B has advertised an 8Kbyte receive window. The RTT between the two hosts is 500 msec (0.5 sec). The maximum throughput of the TCP flow would be (8000 bytes/0.5 sec) * 8 bits/byte = 128Kbps. If the actual bandwidth of the link between the hosts is 10Gbps, the optimal TCP Window size would be (10,000,000 bps / 8 bits/byte) * 0.5 sec = 625 Kbytes. Notice, that the link is severely underutilized when the receiver uses a TCP window of 8 Kbytes. To achieve maximum utilization, it should use the window of 625 Kbytes instead. However, here lies a problem. Per RFC 793, the length of the window size field in the TCP header is 16 bits. Hence, the maximum achievable window size value is 65535 bytes. RFC1323 introduces a new TCP option called Window Scale that allows expanding the window size by using a fixed multiplier. For instance, host B will advertise the window scale of 4 during the three-way handshake with host A to imply that any TCP window size set by host A should be multiplied by 2^4 = 16. Now, host B can advertise the TCP window of 39063 bytes that host A (provided it supports Window Scaling) will multiply by 16 to get the actual TCP window size of 625008 bytes that will allow the transfer to occur at the maximum possible speed.
Another issue that significantly affects TCP throughput is packet loss. Since an endpoint can only learn about one lost TCP segment per RTT, it significantly slows down the transfer. Furthermore, any data sent after the lost segment has to be retransmitted even if it successfully arrived to the receiver. When Window Scaling is used and the RTT is high, the amount of needlessly retransmitted data can be tremendous. RFC2018 introduces a new mechanism for Selective Acknowledgement (SACK). It allows the receiver to request retransmission of only certain TCP segments while acknowledging the receipt of subsequent data. This is accomplished through embedding the information about the left and right edges (sequence numbers) of the successfully received data in TCP ACK retransmission requests. Consider the following example:
Notice that the TCP ACK on the segment is set to 1069276099 implying that this is the sequence number of the next expected segment from the other side. However, the embedded SACK option lists the data from 1069277089 through 1069277090 that was successfully received. Hence, the sender only needs to retransmit the data from 1069276099 through 1069277089. On large data transfers with occasional packet loss, this mechanism provides significant advantages.
FWSM Impact on Single TCP Flow Performance
FWSM deploys distributed processing architecture that involves several low-level Network Processors (NPs) as well as the general purpose Control Point. The majority of the traffic is handled by the NPs which have the highest forwarding capacity (hence sometimes referred to as “Fastpath”). Only certain traffic (such as that subject to application inspection) is sent to the Control Point. Since the Control Point may impose additional limitations on the throughput as well as the properties of the TCP traffic, this discussion will only consider the connections flowing exclusively through the NPs. As a general rule, avoid enabling application inspection on any traffic unnecessarily as it will significantly impact the throughput of these flows.
FWSM communicates with the network through the 6Gbps data plane in the form of an Etherchannel with the local switch. The Etherchannel comprises of 6 individual GigabitEthernet ports. As with any other Etherchannel, all packets in one direction of a flow (for instance, a TCP connection from host A to host B) always land on the same port. Consequently, any single TCP flow going through the FWSM cannot transmit data at more than 1Gbps rate. Furthermore, several flows sharing the same port will reduce the maximum throughput of each individual flow even further.
As mentioned earlier, the FWSM architecture is optimized to handle a large number of relatively low-bandwidth flows. Due to the lock structure of the hardware Network Processors (NPs), packets belonging to a single flow cannot be processed in a truly parallel fashion. As a result, every single TCP flow is capped by a certain maximum packet rate. Consequently, the more TCP payload is sent per packet, the higher throughput can be achieved. During the three-way handshake, each endpoint advertises its TCP Maximum Segment Size (MSS) value which indicates the maximum data it can process per TCP segment. With the default MTU of 1500 bytes, it typically leaves 1460 bytes for the payload. However, the default FWSM setting is to adjust the value of TCP MSS advertised by the endpoints to 1380 bytes. While this approach may be justified in certain cases, this value can be increased or the adjustment turned off altogether with per-context sysopt connection tcpmss command:
FWSM(config)# sysopt connection tcpmss ?
configure mode commands/options:
<0-65535> TCP MSS limit in bytes, minimum default is 0,
maximum default is 1380 bytes
minimum Set minimum limit of TCP MSS
When going from 1380 to 1460 bytes of payload per packet, the typical performance increase is about 6%. To increase the amount of data transmitted in every packet even further, Jumbo Frames can be used as well. FWSM supports Jumbo frames of up to 8500 bytes in size, so this setting can be used end-to-end (including the switch and the respective endpoint ports) to achieve much higher firewalled throughput. To enable Jumbo Frame support on the FWSM itself, you just need to use mtu <nameif> 8500 command for every associated interface:
FWSM(config)# mtu inside ?
configure mode commands/options:
<300-8500> MTU bytes
Since we had established that TCP Window Scale and SACK options can improve the performance of TCP flows in a significant way, it is advisable to not clear them on the FWSM. By default, each FWSM context permits these options. You can use show run sysopt command to ensure that the following lines are present there:
FWSM#show run sysopt
sysopt connection tcp window-scale
sysopt connection tcp sack-permitted
Even when TCP SACK is permitted through the FWSM, there is a problem introduced by TCP Sequence Number Randomization feature that is enabled by default. The feature hides the sequence numbers generated by the endpoints behind the higher security interface by shifting them by a certain value (determined in a random fashion for each TCP connection). However, the feature does not rewrite the right and left edge values embedded into TCP SACK option. As a result, a TCP ACK requesting selective retransmission that traverses from a lower- to higher-security interface makes no sense to the inside endpoint (since the TCP sequence numbers embedded into the SACK option represent the “randomized” values known only on the outside of the FWSM). Consider the following example:
Notice that the TCP ACK is requesting retransmission of the TCP segment with the sequence number of 3973898807. This number actually makes sense to the inside host since it was “de-randomized” by the FWSM on the way in. However, the embedded TCP SACK option confirms receipt of the segments from 10969277089 through 1069277090. These sequence numbers represent the “randomized” values and hence make no sense to the inside host. As a result, the inside host ignores TCP SACK and retransmits the entire stream of data thus wasting the bandwidth. Since TCP Sequence Number Randomization is a legacy feature that was supposed to protect hosts that use predictable algorithms for initial TCP sequence number generation, it is does not provide much additional security on the modern TCP stacks. Hence, the feature can be selectively disabled to take full advantage of TCP SACK and achieve the maximum throughput on a single TCP flow. The best way to disable the randomization is to use Modular Policy Framework (MPF); you can also narrow the class down just to those trusted hosts that do the high-speed transfers:
match port tcp range 1 65535
set connection random-sequence-number disable
service-policy global_policy global
Yet another factor that can negatively impact TCP flow performance is packet reordering. When multiple paths between the endpoints are used and load-balancing is deployed, it is possible for the receiver to get TCP segments out of order. Sometimes, such condition can be mistakenly recognized as packet loss resulting in unnecessary retransmissions and reduction in throughput. Due to the parallel processing architecture, FWSM itself may put certain TCP segments out of order. This is true especially for those flows that involve smaller sized packets within a batch of larger ones. To combat this undesirable behavior, FWSM contains a module called NP Completion Unit that ensures that the packets leave the NPs in the same order that they came in. It should be noted that it will only preserve the ingress order and not correct the out-of-order conditions introduced before the FWSM. Furthermore, it will not be able to preserve the order of TCP segments flowing through the Control Point as well as traffic processed by the FWSM capture feature. While the Completion Unit may introduce minor latency into the packet processing path, the typical performance improvements significantly outweigh this side effect. The Completion Unit is disabled by default but can be enabled globally (from within the admin context if running in multiple-context mode) with sysopt np completion-unit command:
FWSM(config)# sysopt np ?
configure mode commands/options:
completion-unit Set Completion-unit on FP NPs
Additionally, ensure that the FWSM packet capture functionality is disabled on the high-bandwidth flows as it negates the effect of the Completion Unit. Switchport Analyzer (SPAN) feature on the switch should be leveraged for any performance-related FWSM troubleshooting tasks instead.
To achieve the maximum single TCP flow performance when going through an FWSM, one should implement the following:
- Use the optimal TCP window size as well as TCP Window Scale and SACK mechanisms on the endpoints.
- Ensure TCP Window Scale and SACK options are not cleared by the FWSM.
- Increase the default limit or disable TCP MSS adjustment on the FWSM.
- Disable TCP Sequence Number Randomization for the high-bandwidth flows on the FWSM.
- Enable NP Completion Unit on the FWSM.
- Ensure that the traffic is not being captured on the FWSM itself.
- Use Jumbo Frames end to end.
All tests are done through iPerf with 256 Kbyte TCP window size between two test hosts connected to 1Gbps ports on a single Cisco6509 switch. The FWSM is running 4.0(12) software. Bear in mind that individual results may vary depending on the specific hardware and software levels used as well as the traffic patterns and the amount of other load on the FWSM.
|Test Case Description||Transfer Size (Gbytes)||Bandwidth (Mbits/sec)|
Default FWSM Configuration
Optimized FWSM Configuration
Optimized FWSM Configuration With Jumbo Frames