Re: Post mortem analysis on a multicasting issue?

rvandolson · ‎08-14-2008

Hello all; looking to solicit some opinions on what might have been the cause for an issue we experienced recently. A caveat: I am a systems administrator and therefore I don't have exact knowledge of all the exact models of hardware involved (but could potentially get this information if it would be helpful).

On one of our networks, we have a cluster of servers running an application that depends on multicast. After a recent issue that caused these servers and the networking equipment to cycle power at least once we found that the multicast traffic used by these servers was not properly propagating to the other servers in the cluster.

These servers are all on the same subnet (a /24) and attached to the same catalyst (albeit on different blades) which appears to be configured only to do layer 2 and to not block multicast in any way shape or form.

However, we observed that the initial packet in a multicast transmission would be received by the other members of the multicast group exactly once. No further packets were received. A tcpdump (snoop in this case as these are Solaris machines) showed that the multicast packets were indeed leaving the multicast "transmitter" but a corresponding tcpdump on the recievers showed that after the first received packet, no more were entering the interface (at least to the point where they'd be processed by the sniffer).

When we placed these machines on a dedicated catalyst switch which was attached to the previous switch via an uplink crossover cable, the multicast immediately began working as expected. As soon as the machines were placed once again on the original switch the problem returned.

Any ideas on what might have been the culprit here? We suspected perhaps an IGMP routing device on the same subnet initially, but it would seem that the problem should have persisted even when we added the secondary switch if this were the case.

The MAC's for the multilink groups as reported by arp -an on the machines themselves did correspond with an IGMP MAC address range FWIW.

Just looking for suggestions / brainstorming on this issue as we will likely address it again later.

Thanks in advance, and happy to provide additional technical information if requested.

Ray

rvandolson · ‎08-14-2008

I should also mention that the multicast range used by this software was in the 227.0.0.0/8 subnet.

Joseph W. Doherty · ‎08-15-2008

What you're describing sounds like IGMP snooping is active on the orginal Catalyst without an active IGMP querier.

Since you note the problem appeared after a power cycle, perhaps the start up configuration didn't match the running configuration.

Determine whether IGMP snooping is active on the Catalyst, if it is, disable it and see if the problem disappears. If it does, reactive it and determine if there's an IGMP querier active, normally performed by the router. If not, activate it.

PS:

In some very, very rare instances, I have seen Cisco network equipment stop working correctly have a powercycle. Either the configuration was partially corrupted or other more permanent damage to the equipment. A configuration corruption can be easy to miss since Cisco equipment will revert to default options.