I am having a problem with certain multicast applications and I need some input. THe problem is that I have a certain set of multicast lines that come into my server farm via an outside provider, and the server that is receiving the data reports random disconnects on a single line (not any particular line, just one at time) for right around 2 minutes.
Last week I made the following post:
Where I indicated that it appeared that the data was arriving at my switch uninterrupted, and none of the IGMP debugs, or mroute timers indicated the data had actually stopped.
I was still seeing the problem though, so I took it a step further. First,I sniffed the traffic incoming from my provider to verify that they were sending the data when the server reported an outage. That checked out fine, and in fact, data *was* being sent on the group in question during the alleged outage.
Next, I moved my sniffer over to the servers switchport, and monitored traffic being transmitted out that port to the server. Oddly enough, the switch was not sending data to the server during the outage.
So the switch is receiving the data, but it's randomly deciding to not forward it back out to the server.
As part of the debugging, I disabled the dual-homed NIC and put the server directly onto the 4507R that is multicast routing enabled.
I've been debugging IGMP and sniffing IGMP on that switchport (both directions) and I don't see anything odd.
Does anyone have any ideas what I can look at next?
I have two 4507R's, both with 12.2(20)EWA3.
Thanks in advance,
In the moment of the interruption, when your server is not receiving the multicast streams, can you use the following set of commands to check once more that the your switch has the correct data?
1.) show ip igmp group
Check to see if the server is still subscribed by IGMP to the multicast group
2.) show ip igmp snooping group
Check if the port towards the server is still among the outgoing ports for that multicast group
3.) show ip igmp snooping mrouter
Check if the listing contains the ports towards your provider through which the multicast traffic is coming into your switch
If you are also doing multicast routing for that multicast group, check the command "show ip mroute" to make sure that there is the proper outgoing routed interface among the outgoing interfaces for that multicast group.
Also, are the ports towards your server defined as STP edge ports (portfast)?
Thanks for helping me out with this.
I have multiple servers on the same subnet subscribing to these particular groups, yet I only see the disconnects on 1 server at a time. It's very odd.
I'm seeing this same thing on a lot of different groups. And the most interesting part, sometimes the fix is to get on yet another box and subscribe to the group. That will usually cause the data to "come back" to the server that initially reported it stopped receiving.
I verified that the servers are on STP edge ports, and I also verified the mroute exists for the group in question.
I'm currently running PIM sparse-dense mode on the router interfaces. The groups that I have the most problems with are running as Dense mode groups, but I still have occasional problems with groups in Sparse mode.
I looked through the Bug list for the IOS version I'm running but I didn't find anything that matched. Do you think upgrading anyway might help?
You are welcome. I need to ask further, though, to better understand your network.
Do these servers have a distinct and unique IP addresses, or are they in some form of a virtual box having just a single IP address? Also, you have indicated that in a single network (subnet), there are multiple servers subscribed to the same multicast address. Randomly, one of these servers in the same network stops receiving that multicast while others still receive it - is this correct?
Have you also verified the IGMP outputs, particularily from the snooping? It would perhaps be interesting to do "debug ip igmp A.B.C.D" on the group address to see if there is anything odd going around.
What do you exactly understand by "disconnects"? Is it just a multicast feed stopping, or is there really a disconnect state (like, interface flapping or similar)?
I am afraid that upgrading the IOS without knowing for sure that this is really an IOS issue might be a shooting in the dark. It might help, and again, it might not :-\
Looking forward to reading from you.
These servers are all distinct with unique IP's. No virtual machines. When I talk about disconnects I'm referring to the multicast feed stopping, or from the servers perspective, that it's not receiving data on a particular group. Your understanding of my problem is correct one server stops receiving for group A.B.C.D while other servers are still receiving for that same group.
I've been logging the debug ip igmp A.B.C.D command but the problem is that I can't predict which group is going to experience the problem at any given moment. In all my IGMP debugging though, I've never seen anything odd -- standard queries from the router, and replies from the host for the groups it wants to join. I've even seen the IGMP query/response process occur during a time window when the server itself wasn't receiving data on a group it just sent an IGMP response for.
Thank you for your reponse.
This is interesting. Are the servers connected directly to your 4500, or is there another intermediary switch between them and the 4500?
What I am thinking now is this: If the servers are all connected directly to your 4500 and are in the same VLAN and just one of them suddenly stops receiving multicasts then it is not a problem of multicast routing, as the remaining servers in that same VLAN still receive the multicast traffic. The problem must therefore lie somewhere in the Layer2 replication of multicasts and their delivery onto individual switchports. I would again focus on IGMP snooping and try to check if even during the "blackout" period when a single server stops receiving multicasts, its switchport is still listed as subscribed to that group in the "show ip igmp snooping group" output. If it is, then by all means, that port should be emitting the multicasts. If it is not, we've got probably a bug here. I presume you are not using any kind of storm control, policing, shaping or similar.
I have been troubleshooting the multicast issue on 4 servers -- 2 of them are HP blade servers are are located off of secondary switches. The other 2 servers are directly connected to the 4500's. I typically have all my servers dual-homed, but while troubleshooting this issue I've disabled the link to the 4500 that is not performing the multicast routing.
I'm not using any storm control, policing or shaping.
I know I am repeating myself - I am sorry for that - but can you please answer my previous questions regarding the IGMP snooping? I need to ensure that the IGMP snooping on your switches is (not) to blame.
Specifically, please confirm that either you do not use IGMP snooping on your switches, or that even during the blackout period, the "show ip igmp snooping group" command shows all your switchports to servers as being opened for the respective multicast group.
Thank you in advance.
I agree with what Peter has written.
The issue is clearly related to IGMP snooping activity:
when IGMP snooping is enabled on a vlan the following should happen:
the multicast router sends the periodic general query.
without any IGMP snooping a device answers for each group G using a multicast destination G: other devices that were waiting for their random timer to expire suppress their report.
With IGMP snooping the switches have to do the following:
forward the general query out all ports in vlan;
listen and intercept all IGMP reports for all groups:
the switches shouldn't propagate the IGMP reports between current and potential receivers to avoid report suppression.
The switch forwards for each group a single report to the multicast router to make it happy.
By not forwarding IGMP reports to user ports each host is "isolated" and should answer with its own IGMP reports when its random timer(s) expire.
(I don't recall if there is a single timer or one for each joined group)
So an igmp snooping switch could fail to recognize a receiver if one of the following happen:
it misses the IGMP report for group G
IGMP report from host is not sent timely or it is not sent at all.
two minutes of interruption is likely the time it takes to wrongly pruning the server port at time x and then adding it to L2 replication list for group G in the vlan at next query interval.
Considering that the affected server and group look like randomly chosen it is possible the issue is on the switch side.
Another question that arises is :
what makes different this vlan / subnet from the other ones that don't suffer or are less prone to this issue?
Are the most servers in the affected vlan?
Are the servers in the other vlans connected to a single switch?
Hope to help
I apologize for not addressing that question. I examined IGMP snooping on each switch and observed that during the reported outage, all ports that should've been joined to the group in question, were reported by IGMP snooping as being joined. So nothing out of the ordinary there.
All of the affected servers are in the same VLAN.