Re: Routers misbehaving for local traffic

ijdod · ‎11-12-2003

We have two 7206/NPE-G1 routers operating together. Both are connected with a 2 channel Gigabit etherchannel to a 3550 each. These switches are connected together, as well as to the rest of the network. Both routers are connected to all vlans, using HSRP. Router 1 is active for uneven vlans, router 2 for the even vlans. No L3 switching is configured. Routers exchange routing info through EIGRP, most vlan interfaces are passive.

I'll stick to 10.170.0.0/16 as example, but we've seen it happen on all tested vlans.

After noticing an unusual amount of traffic on the vlans, we've discovered that a lot of traffic was bouncing between the routers. more monitoring and sniffing reveiled that it was packets with a valid (existing) destination MAC and IP adress, bouncing between the sub-interfaces of the destination network. An unicast packet for 10.170.5.53 (with correct MAC for that host) would bounce between the two router interfaces 10.170.2.251 and 10.170.252. Also, these packets were flooded by the L2 switches. The source IP adress would usually be in a different IP network, but we've also sniffed some packets that were sourced from 10.170.1.1, a server, that were apparently resent by the router (and the bouncing began...).

Another thing we noticed is that pretty much all sniffed traffic is headed for HP Jetdirect external printservers. At one point we did notice an accidentally switched off JetDirect that showed the bouncing when traced, but was okay after it was switched back on again. Traceroutes typically show either one hop (expected) or a single bounce (unexpected...), so it would appear that for some reason a host not responding may be of influence.

What we've tried:

- Enabling EIGRP on the vlan interfaces. Didn't work

- Enabling Proxy arp. Didn't work.

Still to try, no particular order (service windows and all that):

- Upgrade JetDirect firmware

- Non-port-channel, trunked interface

- Non-trunked interface

- Upgrade IOS to either 12.2(15)T9 or 12.3(3a)

A typical interface configuration would be like this:

interface Port-channel1.170

description LAN170_ATOO

encapsulation dot1Q 170

ip address 10.170.2.252 255.255.0.0

ip helper-address 10.201.2.53

ip helper-address 10.203.1.10

no ip redirects

no ip proxy-arp

ip pim dense-mode

standby 170 ip 10.170.2.90

standby 170 priority 110

standby 170 preempt

end

7200 software is JS (Ent. Plus) 12.2(15)T7

We're pretty much grasping at straws now. Any ideas, hints or solutions would be appreciated.

ruwhite · ‎11-12-2003

Could you post a show ip route for one of the addresses this is happening to from both routers they are looping between?

Russ.W

ijdod · ‎11-13-2003

R_NIJV01#sh ip route 10.170.5.23

Routing entry for 10.170.0.0/16

Known via "connected", distance 0, metric 0 (connected, via interface)

Redistributing via eigrp 90

Routing Descriptor Blocks:

* directly connected, via Port-channel1.170

Route metric is 0, traffic share count is 1

R_NIJV01#

R_NIJV02#sh ip route 10.170.1.1

Routing entry for 10.170.0.0/16

Known via "connected", distance 0, metric 0 (connected, via interface)

Redistributing via eigrp 90

Routing Descriptor Blocks:

* directly connected, via Port-channel1.170

Route metric is 0, traffic share count is 1

R_NIJV02#

Po1.170 is a passive interface as far as EIGRP is concerned.

ijdod · ‎11-13-2003

Some additional information:

On the Po1.170 interface, it appears that the routers are essentially redirecting local traffic. Installing an outbound acl (deny ip 10.170.0.0 0.0.255.255 10.170.0.0 0.0.255.255) actually kills of most of the suspect traffic.

On other subinterfaces we have a slightly different problem; the traffic sniffed there does originate from a different network, but still keeps bouncing around untill TTL reaches 0.

Best I can tell, what seems to happen is that a unicast packet is send, but is flooded on the L2 network. Both routers pick up on this, and resend the packet with a decreased TTL. They each pick up the packet the other send out, and so on, and so forth. On a TCP packet we sniffed, the only things changing were the source MAC addresses, and the TTL. Sequence numbers were the same.

rjackson · ‎11-13-2003

Show us a trace route from outside of the 10.170 net to a device inside that net. Then from inside the net to a device outside.

ijdod · ‎11-14-2003

Don't have a copy/paste at hand, and the symptoms are eliminated.

What we saw tracing from the outside:

C:\>tracert 10.170.5.11

Tracing route to ps0085.infonet.remu [10.170.5.11]

over a maximum of 30 hops:

1 <10 ms <10 ms <10 ms 10.203.2.90

2 <10 ms <10 ms <10 ms 10.170.2.252

3 <10 ms <10 ms 16 ms ps0085.infonet.remu [10.170.5.11]

Trace complete.

C:\>

The 10.203.2.90 is R_NIJV01, which is HSRP active for that vlan, and HSRP inactive for 10.170, the 10.170.2.252 is the address for R_NIJV02. Hop 2 should not have been there, me thinks.

Currently, with the workaround in place, the trace is as it should be:

C:\>tracert 10.170.5.11

Tracing route to ps0085.infonet.remu [10.170.5.11]

over a maximum of 30 hops:

1 <10 ms <10 ms <10 ms 10.203.2.90

2 <* ms <10 ms 16 ms ps0085.infonet.remu [10.170.5.11]

Trace complete.

ijdod · ‎11-14-2003

Port-channel was acting up.

After abusing the service window last night, we tried moving one of the vlans to on a normal, non-etherchannel port on the same router. This solved the symptoms.

We then proceeded to shut one of the 2 interfaces in the channel. This also solved the symptoms (after some quite spectacular memory errors (on both routers... )).

So, apparently the two links of the portchannel we the catalyst for the problem.

ruwhite · ‎11-14-2003

That's interesting.... Is it working with the portchannels now, or do you have it down at the moment? If you can, open a TAC case on this one, I think. It certainly seems odd to me.

Russ.W

ijdod · ‎11-18-2003

At the moment it is working in a portchannel config, but with only one link active. I fully agree with you on the odd part.

We can't open a TAC case ourselves. I think I can do so through our reseller, or through our Cisco rep.

ijdod · ‎12-17-2003

Some updates, for those who may be interested:

The problem was not limited to Port Channels. Moving the entire config to a 'normal' dot1q configuration didn't solve the problems. Neither did upgrading the IOS to 12.3.5a.

A TAC case was made, no solution as of yet.

ijdod · ‎01-20-2004

The problem turned out to be the router interfaces going in to promiscuous mode, because of the amount of configured MAC addresses on the interface (caused by HSRP). Because of this, the router started processing all traffic received on it's interface.

tbaranski · ‎01-20-2004

How many HSRP instances were there? Did TAC end up classifying this as a bug? Or is this documented behavior once a given HSRP threshold is crossed?