%PORT-5-IF_DOWN_TCP_MAX_RETRANSMIT

Unanswered Question
Aug 20th, 2007

I do not know much about storage networking. I have a 9509 that is set up to peer to a 9216 at our DR site.

I got a ton of these errors in the event log tonight and I am not sure what to look at.

The VSAN 50 in the VSAN that holds the link from HQ to DR.

2007 Aug 20 17:34:48 Switch %PORT-5-IF_DOWN_NONE: %$VSAN 50%$ Interface fc

ip1 is down (None)

2007 Aug 20 17:34:48 Switch %PORT-5-IF_DOWN_TCP_MAX_RETRANSMIT: %$VSAN 50%

$ Interface fcip1 is down(TCP conn. closed - retransmit failure)

2007 Aug 20 17:35:04 Switch %PORT-5-IF_DOWN_TCP_MAX_RETRANSMIT: %$VSAN 50%

$ Interface fcip1 is down(TCP conn. closed - retransmit failure)

2007 Aug 20 17:35:04 Switch %PORT-5-IF_UP: %$VSAN 50%$ Interface fcip1 is

up in mode E

2007 Aug 20 17:37:16 Switch %PORT-5-IF_DOWN_NONE: %$VSAN 50%$ Interface fc

ip1 is down (None)

2007 Aug 20 17:37:16 Switch %PORT-5-IF_DOWN_TCP_MAX_RETRANSMIT: %$VSAN 50%

$ Interface fcip1 is down(TCP conn. closed - retransmit failure)

2007 Aug 20 17:37:32 Switch %PORT-5-IF_DOWN_TCP_MAX_RETRANSMIT: %$VSAN 50%

$ Interface fcip1 is down(TCP conn. closed - retransmit failure)

2007 Aug 20 17:37:32 Switch %PORT-5-IF_UP: %$VSAN 50%$ Interface fcip1 is

up in mode E

2007 Aug 20 17:38:33 Switch %PORT-5-IF_DOWN_NONE: %$VSAN 50%$ Interface fc

ip1 is down (None)

2007 Aug 20 17:38:33 Switch %PORT-5-IF_DOWN_TCP_MAX_RETRANSMIT: %$VSAN 50%

$ Interface fcip1 is down(TCP conn. closed - retransmit failure)

2007 Aug 20 17:38:49 Switch %PORT-5-IF_DOWN_TCP_MAX_RETRANSMIT: %$VSAN 50%

$ Interface fcip1 is down(TCP conn. closed - retransmit failure)

2007 Aug 20 17:38:49 Switch %PORT-5-IF_UP: %$V

2007 Aug 20 17:44:52 Switch %PORT-5-IF_DOWN_TCP_MAX_RETRANSMIT: %$VSAN 50%

$ Interface fcip1 is down(TCP conn. closed - retransmit failure)

2007 Aug 20 17:45:08 Switch %PORT-5-IF_DOWN_TCP_MAX_RETRANSMIT: %$VSAN 50%

$ Interface fcip1 is down(TCP conn. closed - retransmit failure)

2007 Aug 20 17:45:08 Switch %PORT-5-IF_UP: %$VSAN 50%$ Interface fcip1 is

up in mode E

2007 Aug 20 17:45:20 Switch %PORT-5-IF_DOWN_NONE: %$VSAN 50%$ Interface fc

ip1 is down (None)

2007 Aug 20 17:45:20 Switch %PORT-5-IF_DOWN_TCP_MAX_RETRANSMIT: %$VSAN 50%

$ Interface fcip1 is down(TCP conn. closed - retransmit failure)

2007 Aug 20 17:45:37 Switch %PORT-5-IF_DOWN_TCP_MAX_RETRANSMIT: %$VSAN 50%

$ Interface fcip1 is down(TCP conn. closed - retransmit failure)

2007 Aug 20 17:45:37 Switch %PORT-5-IF_UP: %$VSAN 50%$ Interface fcip1 is

up in mode E

2007 Aug 20 17:45:49 Switch %PORT-5-IF_DOWN_NONE: %$VSAN 50%$ Interface fc

ip1 is down (None)

There are logs dating back several months and there are not the number of these errors as there were tonight.

My questions are:

The log if quite full, is there any problem (resource wise) to deleting the log?

I do not want to lock the switch up by doing something like that.

2. Has anyone seen anything like this? Is this a pending disaster?

It looks like there was a physical problem on this end.

3. Does anyone have any input as to what could have caused this?

The backups should be running, but this is a new series of events, it doesn't look like this has happened in the last few months to this degree.

I have this problem too.
0 votes
  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 5 (5 ratings)
Loading.
tblancha Tue, 08/21/2007 - 05:29

Is vsan 50 the only vsan on the fcip link or is there more? The problem is most likely not in the SAN but rather in the IP network. The first thing to look for is if the MTU on the SAN gig interfaces are larger than what the routers are set for. For example, is the MTU on the MDS interfaces set to 2300 but the routers are 1500?

You can ping with the extended ping command to test and set the DF bit.

wilson_1234_2 Tue, 08/21/2007 - 07:21

Thanks for the reply.

VSAN 50 is the only one on the fcip link.

I saw this error all last night and now it has stopped for now.

I looked at the switches for errors and there does not seem to be any errors on the physical layer.

I have seen that retransmit error when there are MTU problems and also on bottlenecks when the remote end cannot handle the amount of data pumped into it, but I saw nothing like that here.

This does not seem to have been a problem that MTU is affecting that much, or I would see it all the time coirrect?

Also, what is the differnece between the fc interfaces and the Gigabit interfaces?

I do not see a MTU on the fc interfaces.

tblancha Tue, 08/21/2007 - 08:47

The MTU is on the gigabitethernet interfaces. So, this only occured for a while last night? Is it possible there was a WAN outage last night that would have affected your FCIP connectivity?

wilson_1234_2 Tue, 08/21/2007 - 10:37

No problem with the connectivity to the remote side.

We have a DS3 to the DR site and it was up

I did some testing and the response time to the remote side was 3ms, everything looked good.

I wondered if there were some problems with the WAN.

The Mainframe files copied from HQ side to DR side was erroring with "Broken Pipe" after it bombed out.

There was a log from the SAN the the VSAN50 had been removed last night and re-added again this morning.

I have no access to the SAN and am not familiar with Storage netwoeking, but is this something that can happen dynamically like that, or does someone have to initiate it (add and removal of VSAN 50)?

It seems that that is why VSAN link was falpping even though the connectivity was there correct?

The data was leaving here and no where to put it on the remote end, so I was getting the retransmit errors due to no tcp acks.

Does that sound feasable to you?

suntzzu Tue, 08/21/2007 - 10:44

not sure if this helps but i see this alert alot when our link is being flooded by traffic that is not originatining out of the FCIP interfaces. unfortunately we have a small layer 2 pipe that has to be shared between replication traffic (FCIP), MS clustering along with the heartbeats. its not uncommon for traffic other than FCIP to flood our link and start to choke our FCIP traffic which then sends the %PORT-5-IF_DOWN_TCP_MAX_RETRANSMIT alerts. What traffic rides down your link? replication, cluster traffic, heartbeat, layer 3, layer 2, etc.?

wilson_1234_2 Tue, 08/21/2007 - 11:43

Thanks for the reply.

The VSAN 50 is the only link to DR, from the MDS.

But, there is replication traffic, heartbeat, layer 3 and layer 2 (there is a bridge to that site also).

The mainframe is the bridge traffic to DR.

But is should not be that overwhelming and it has not happened since I have been here (4 months), until last night.

I could see the Fcip interface and the gigabit interface both going down/up, down/up.

But connectivity was no problem, and no erros at all on either MDS switch.

I am thinking the SAN lost the connection to VSAN 50 and there was no where for the HQ sourced traffic to go.

tblancha Tue, 08/21/2007 - 11:47

The key here is that something was occuring in the TCP network for TCP retransmissions to fail. This has nothing to do with the SAN. The SAN uses TCP as the transport mechanism. The FCIP link represents a TCP session and if data is not acknowledged, then the session will fail and that is what occured here. The key is to understanding why your IP network was doing last night that it stopped doing this morning. Is the FCIP link oversubscribed? Was there an outage? Was there maintence? Etc..

wilson_1234_2 Tue, 08/21/2007 - 12:52

Ok, forgive my not understanding,

But if the link was up and there were no physical errors and the routing was fine,

and both switches end to end are not showing any errors,

and if the SAN on the remote side is the endpoint and holding the Mainframe replication destination,

Why could't it have been something on the SAN, even if it was an issue with a tcp connection?

wilson_1234_2 Wed, 08/22/2007 - 04:57

Perhaps you could please help me to understand better?

Apparently this has happened before here.

So, then the fcip link is only the IP link from MDS to MDS accross the DS3, is this correct?

And the idea is that this link is an IP tunnel carrying the SAN traffic?

What does the fcip link look for to be in the "up" state?

Since the HQ end is the one that was going down and causing the errors

The DS3 is 54Mbps and the Gigabit interfaces are configured for that speed, is there something I can set there?

Is it possible that the DS3 was overwhelmed from my end?

If this were the case, should I see physical errors somewhere?

I appreciate any reply

suntzzu Wed, 08/22/2007 - 07:49

the fcip link is just your fibre channel traffic encap. by IP. so in essence it's traffic happily riding down the DS3 along with the rest of the other traffic. the fcip link needs to have both the physical port ( in your case the GigE interface "up") and the fcip interface "up". you can be in a situation where the physical port is "up" but the fcip interface is "down".

there are settings to help curb the bandwidth out of the fcip interfaces like tcp max and min transmission commands (not quite sure of the exact context of the commands. if you go on ciscos site and look at the config guide for the MDS switches it will tell you).

we also have a small link 155MB. its like trying to figure out how to fit a semi-truck into a mouse hole. some things that helped us out were jumbo frames: we set MTU size to 9K on every interface the FCIP traffic travels (fcip doesnt like to be split up). compression set to 'auto' on the fcip interfaces. although our other location is only a few miles away we also turned on write acceleration. this is supposed to help out more for longer distances. although these may not help out in what your initial issue/question was it may help you in some other way.

something things to look into across your link are whats being allowed and what shouldnt be. if you can restrict any VSAN's, VLAN's, etc.if you get a sniffer on there and mirror the traffic to it this would give you some insight of whats going across. something else you may want to look into is implementing QoS across the link.

wilson_1234_2 Fri, 08/31/2007 - 06:37

Thanks for the reply.

At the moment we have the MTU set to 1500 on the interface of the MDS for this link.

This interface then goes through a core switch that also has MTU of 1500, then to a router interface of 1500 that connects the main site to the DR site.

Do you have to set all of these ethernet interfaces to 9000 as well?

If so, how would that affect the existing ethernet traffic on those interfaces that is local LAN traffic

How do you have your topology set up to connect the MDS to router connection to DR?

Also, we have some bridged connections to DR, not sure if this a problem either.

suntzzu Fri, 08/31/2007 - 16:49

That is correct.

all interfaces that carry the traffic will need their MTU size changed to allow the jumbo frames or whatever size you want to set it at. if any of the interfaces are defaulted to 1500 it segments the frames which is not good for FCIP. its like for every frame it tries to send it needs to split it up in two. as long as your MTU is over 3000 i believe, FCIP should be ok.

as far as existing ethernet traffic it shouldnt effect it at all, but i am not a network guru byat all means. i know enough to get around. currently we have regular ethernet traffic (MSCS, heartbeat, MS-AD, etc)running in parallel with our FCIP traffic down our Layer 2 link that has jumbo frames turned on all interfaces the traffic travels. the only issue we tend to see is when something besides the FCIP traffic decides to choke the link and then we start to see our FCIP interface "gracefully shutdown" but it usually comes backup. in some cases it will suspend our async replication.

we have 2 independant fabrics. one of our GigE ports from the 14/2 blade in each of 9509 is plugged into a Gig port on a blade in separate 6513 cat switches. those are then set into a VLAN that are trunked over the EOS ring to our DR location. DR is pretty much mirrored the same way only instead of a 9509's we have 9216i's.

the best way to induce change in your environment is to obviously contact cisco first or pay to have someone to do it. i would start opening some TAC cases asking your specific questions. at least then you can get an official answer from cisco. dont take the word of a forum as a way to do a change control. make sure you research what it is your about to change and know how to roll it back. only you know the little in's and out's of your environment. what worked for us could be a catastrophe in another environment.

Actions

This Discussion

 

 

Trending Topics: Storage Networking