Enable PBR on Cat3750 Stops Packet Flow

Unanswered Question
Aug 20th, 2009
User Badges:

I am trying to set up a simple policy-based routing scenario to force traffic from two internet proxies to use an alternate internet connection. See attached diagram for details.


I have the scenario set up (very similar, but not exact topology) in my work area, and pbr works as I would like.


However, when I apply the ip policy to the incoming interfaces on the real network devices, all communication to/from the c3750 switches stops flowing.


C3750s are running 12.2(46)SE. There are two C3750s in a stacked configuration.


The essential (and simple) pbr configuration is:


ACL to match traffic from two proxies:

ip access-list extended www_inet_policy

permit ip host 10.200.aaa.bbb any

permit ip host 10.200.ccc.ddd any


Route Map for PBR:

route-map www_policy permit 10

match ip address extended www_inet_policy

set ip next-hop 10.200.48.241


PBR Applied to C3750 Interfaces (see diag)

interface GigabitEthernet1/0/49

description Core1 G3/5

no switchport

ip address 10.200.96.30 255.255.255.252

ip pim sparse-mode

ip route-cache policy

ip policy route-map www_policy

mls qos trust dscp


interface GigabitEthernet2/0/49

description Core2 G3/5

no switchport

ip address 10.200.96.34 255.255.255.252

ip pim sparse-mode

ip route-cache policy

ip policy route-map www_policy

mls qos trust dscp



The ironic thing is that even though all communication ceased between the C3750 and the Core switch/routers, EIGRP peering between the C3750 and the Core switch/routers remained up!


Could this be a bug? Could the C3750 be running out of TCAM resources? I checked the TCAM via 'show platform tcam usage' and 'show platform tcam utilization' (not when PBR was applied) and the resources look to be well within limits.


Thank you in advance for your help.


Ron Buchalski




  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 0 (0 ratings)
Loading.
Peter Paluch Thu, 08/20/2009 - 11:21
User Badges:
  • Cisco Employee,

Hello Ron,


There indeed is a possibility that the TCAM does not have enough space for the policy routing entries.


Have you checked the current state of the Switch Database Management (SDM)? Perhaps you need to change to a different SDM template to allow more space for the policy-based routing entries.


Check these links for details:


http://www.cisco.com/en/US/docs/switches/lan/catalyst3750/software/release/12.2_46_se/configuration/guide/swsdm.html


http://www.cisco.com/en/US/docs/switches/lan/catalyst3750/software/release/12.2_46_se/command/reference/cli2.html#wp9598134


Best regards,

Peter


Jon Marshall Thu, 08/20/2009 - 11:34
User Badges:
  • Super Blue, 32500 points or more
  • Hall of Fame,

    Founding Member

  • Cisco Designated VIP,

    2017 LAN, WAN

Ron


In addition to Peter's post which SDM template are you actually running ?


3750# sh sdm prefer


It needs to be the SDM routing template.


In addition are you running IP Services on your switches ?


Jon

ronbuchalski Thu, 08/20/2009 - 11:40
User Badges:

Peter and Jon,


I've already confirmed the correct operation of SDM. I have the ip services image (which supports pbr) and I had to set sdm prefer to 'desktop routing' in order to enable pbr. This required a reset of the switch.


Ron Buchalski



Jon Marshall Thu, 08/20/2009 - 11:49
User Badges:
  • Super Blue, 32500 points or more
  • Hall of Fame,

    Founding Member

  • Cisco Designated VIP,

    2017 LAN, WAN

Ron


Are you seeing any hits on the PBR acl ?


Is this the exact config you are using on the production switch ?


Jon

ronbuchalski Thu, 08/20/2009 - 12:18
User Badges:

Jon,


Other than the IP addresses of the proxies being blanked, this is the exact configuration that I used.


There are no hits showing on the pbr acl. But, then again, there are no hits showing on the pbr acl on the test network where pbr works exactly like I want it to work.


Ron Buchalski



Jon Marshall Thu, 08/20/2009 - 12:49
User Badges:
  • Super Blue, 32500 points or more
  • Hall of Fame,

    Founding Member

  • Cisco Designated VIP,

    2017 LAN, WAN

Ron


Apologies for asking all these questions.


Are both switches in the stack using the routing template ?


In your lab were you working with a stack as well or was it a single switch ?


Jon

ronbuchalski Thu, 08/20/2009 - 12:58
User Badges:

Jon,


Ask all the questions you need. I'm looking for resolution.


The switches are both using the routing template. I did not explicitly configure both of them to use the routing template, but just the master. However, if I check it, both switches show they are running the routing template:


C3750Sw1#show sdm prefer

The current template is "desktop routing" template.

The selected template optimizes the resources in

the switch to support this level of features for

8 routed interfaces and 1024 VLANs.


number of unicast mac addresses: 3K

number of IPv4 IGMP groups + multicast routes: 1K

number of IPv4 unicast routes: 11K

number of directly-connected IPv4 hosts: 3K

number of indirect IPv4 routes: 8K

number of IPv4 policy based routing aces: 0.5K

number of IPv4/MAC qos aces: 0.5K

number of IPv4/MAC security aces: 1K


C3750Sw1#

C3750Sw1#

C3750Sw1#session 2


C3750Sw1-2#

C3750Sw1-2#

C3750Sw1-2#show sdm prefer

The current template is "desktop routing" template.

The selected template optimizes the resources in

the switch to support this level of features for

8 routed interfaces and 1024 VLANs.


number of unicast mac addresses: 3K

number of IPv4 IGMP groups + multicast routes: 1K

number of IPv4 unicast routes: 11K

number of directly-connected IPv4 hosts: 3K

number of indirect IPv4 routes: 8K

number of IPv4 policy based routing aces: 0.5K

number of IPv4/MAC qos aces: 0.5K

number of IPv4/MAC security aces: 1K


C3750Sw1-2#

C3750Sw1-2#

C3750Sw1-2#exit


C3750Sw1#



In the lab, I only had one 3750, not a stack. One of my co-workers removed a stack of two 3750s from service this morning, so I may be able to grab them and use them to test.


Ron Buchalski


Jon Marshall Thu, 08/20/2009 - 13:18
User Badges:
  • Super Blue, 32500 points or more
  • Hall of Fame,

    Founding Member

  • Cisco Designated VIP,

    2017 LAN, WAN

Ron


Might be worth testing with a stack.


The only mention i could find of a bug was fixed in your release of code and this was to do with QOS and PBR on the same interface.


But even that bug would not stop all traffic between the stack and the internal core switches.


Couple more questions -


are you using static routes for the next-hops towards the Internet in the routing table. Just thought i'd check as this is related to the bug.


In your lab did you enable QOS on the interfaces as well as multicast.


When you say all traffic stops between the core switches and the 3750 stack how did you verify this. Did you try telnetting/pinging to the 3750 stack from the core when you applied PBR ?


Jon



ronbuchalski Thu, 08/20/2009 - 13:52
User Badges:

Jon,


In answer to your questions:


A static route (0.0.0.0/0 pointing to 10.200.48.254) is the default route to the Internet. The 3750 takes this route and injects it into EIGRP. There is no explicit route to the alternate Internet gateway (10.200.48.241) since it is on a directly connected subnet of the 3750.


In the lab I do have QoS enabled. I did not have multicast enabled initially, but it did enable it later and tested it again, and pbr still works.


I verified that traffic stopped in several ways:


1) I was initially logged into the C3750 switch via telnet, and the telnet disconnected. I could not reconnect to the C3750


2) I connected to the console and logged into the C3750, but had to use local credentials for authentication because it could not reach the AAA server (located on an inside network)


3) I tried pinging from the inside to the C3750 as well as other devices beyond the C3750 (toward the internet), and received no response.


As soon as 'ip policy route-map' statement was removed from the Gi1/0/49 and Gi2/0/49 interfaces of the C3750, communication between the inside nets and the C3750 was enabled.


I did not check to see if there was communication between devices on the other side of the C3750 (10.200.48.254 and 10.200.48.241) and the C3750.


And, as I mentioned, the EIGRP peering between the C3750 and the Core switch/routers never came down. That is very odd. Given that the communication block causes disruptions to service, I could not leave pbr in place for a long time to try and troubleshoot the issue. I did leave it in place for about 20 minutes, during which time EIGRP remained peered.


What bugs did you find related to static routes?



Ron Buchalski


yagnesh_tel Thu, 08/20/2009 - 13:53
User Badges:
  • Silver, 250 points or more

Although this does not explain complete stop for packet forwarding, I would like to point as there is not so obvious reason behind this issue.


I always rely on default method- cef while using PBR. So wonder to see 'ip route-cache policy' in your config. I guess you need to have fast-switching enabled for this on switches, right?


I tried to look for any related bug for your code but can't relate any that stops all traffic.

ronbuchalski Mon, 08/31/2009 - 13:39
User Badges:

When I issue the 'show ip cache policy' command, the output shows nothing cached.


I tried it with 'ip route-cache policy' configured on the input interfaces.


I'm not sure what you mean when you say that you rely on the default method- cef while using PBR. Could you please expand on this?


Thank you,


Ron Buchalski


ronbuchalski Mon, 08/31/2009 - 14:15
User Badges:

UPDATES:


Well, I was able to test with a 3750 stack, using as much of the same configuration as possible (99% +) and was not able to get it to replicate the problem.


In response to your questions:


I am using a default route (0.0.0.0/0) to point to the normal default next hop (toward the internet). Originally I did not have a similar 0.0.0.0/0 route for the alternate internet path (that I am trying to policy route towards). In my test stack I did try adding a 0.0.0.0/0 route (with admin cost 210) pointing toward the alternate internet path.


I ran into an interesting problem with my test stack, which may be related to the original problem I was experiencing. With my test stack, when I have policy routing enabled (and working), and I reboot the stack, policy routing no longer works. The stack sends traffic toward the default route, even though the ACL for the policy route (and the route-map) show matches for the traffic that should be policy routed. What I found was that if I issue a 'clear arp' command on the stack, within a few seconds the policy routing works again. If this is the reason why the production stack stops passing traffic I am concerned that this could be an issue if the stack takes a hit and reboots, thus cutting off the network until someone gets to the console to issue a 'clear arp' command. Sounds like a bug to me!


My lab stack had QoS and multicast enabled, just as the production network does. If this is an issue caused by multicast, I believe that I can turn off multicast routing because it doesn't look like we're routing any multicast traffic through this switch stack out to the internet (or any other attached subnets).


Regarding the verification of traffic stopping between the stack and the core switches, I was able to see this because I originally made the change via telnet (from an inside location one L2 hop from the core)and was immediately cut off from the stack. I had to run into the Data Center with my laptop and console into the stack. When I logged into the console, AAA could not authenticate to the TACACS+ server (which is also on an inside subnet). Also, all production traffic from inside subnets which was accessing the internet and other subnets beyond the stack lost connectivity. As soon as I removed 'ip policy route-map xxx' from the incoming interfaces on the stack, connectivity to the inside was immediately restored.


Two other things I discovered about the production switch stack:


1) The two switches were only connected via a single stack cable (between Stack 1 ports on the two switches). I tried emulating this scenario on my test stack, but it did not change the results. I could not find anything which would indicate that this stack connectivity is invalid, although the production stack does indicate (via 'show switch stack-ring speed) that the speed is 16G, and the ring configuration is Half. I plan to install the second Stack cable during a maintenance window, and also terminate them differently (Sw1 Stack 1 to Sw2 Stack 2 and Sw1 Stack 2 to Sw2 Stack 1).


The stack was configured for cluster management ('cluster enable ExternalSys 0' command) which also adds some dynamic ACLs for Cluster-NAT and Cluster-HSRP). I don't know why this was in there, but I removed the cluster enable command, which removed the dynamic ACLs. I tried have this in my test stack, and still could not replicate the problem.



If you could point me to the bugs that you've found regarding this code release I would appreciate it. I'm running out of options.


Thank you,

Ron Buchalski


ronbuchalski Mon, 09/21/2009 - 10:22
User Badges:

I implemented policy routing in the production network, and it is now working without the problems I described initially. While I never did isolate what was causing the problems, I did add the missing stack cable, and cabled the stack properly. So, it is now working as planned.


Thank you to all for your questions and insight.


-rb


Actions

This Discussion