Solved: Re: DLSW peers dropping mysteriously

lhengineer · ‎10-19-2006

Simple DLSW peering setup between two Cisco 2801 routers running 12.3(14)T4 code. DLSW peering seems to drop once or twice a day without apparent cause. We have checked the entire network path and are not seeing any CRC/link level/layer 2 errors that would explain the DLSW peers dropping.

We have also checked all recent bug reports and find nothing related to DLSW that would explain this behavior. We are almost confident the network is clean however we are still attempting to test that as it seems like the most logical explanation.

Further history is that this DLSW peering was fine for years before we replaced the 1721 router at one of the sites with a 2801. The intermittent peer drops have occured since this router was upgraded.

An example "show dlsw circuits history detail" shows the history of one of these drops. This drop corresponds with a loss of SNA traffic for the end users until the peers reestablish.

Config is quite simple...

dlsw local-peer peer-id x.x.x.x

dlsw remote-peer 0 tcp y.y.y.y

dlsw bridge-group 1

!

interface FastEthernet0/0

bridge-group 1

and the reverse on the remote router

show dlsw circuits history detail

Circuit history kept for last 32 circuits, using 4096 bytes

Index local addr(lsap) remote addr(dsap) remote peer

2667578346 4000.1088.1088(04) 0008.5a84.521d(20) 10.100.1.253

Created at : *20:49:49.953 CDT Wed Oct 18 2006

Connected at : *20:49:50.101 CDT Wed Oct 18 2006

Destroyed at : *13:30:15.457 CDT Thu Oct 19 2006

Local Corr : 2667578346 Remote Corr: 1929379862

Bytes: 902224/124936 Info-frames: 2108/2694

XID-frames: 3/4 UInfo-frames: 0/0

Flags: Remote created, Local connected

Last events:

Current State Event Add. Info Next State

-------------------------------------------------------------------

CONNECTED DLC DataInd 0x0 CONNECTED

CONNECTED WAN infoframe 0x0 CONNECTED

CONNECTED DLC DataInd 0x0 CONNECTED

CONNECTED WAN infoframe 0x0 CONNECTED

CONNECTED DLC DataInd 0x0 CONNECTED

CONNECTED WAN infoframe 0x0 CONNECTED

CONNECTED DLC DataInd 0x0 CONNECTED

CONNECTED ADM WanFailure 0x0 HALT_NOACK_PEND

HALT_NOACK_PEND DLC DiscCnf 0x0 CLOSE_PEND

CLOSE_PEND DLC CloseStnCnf 0x0 DISCONNECTED

mbinzer · ‎10-23-2006

Hi,

from the debugging we see those messages:

TCP0: Packet-too-big message received on interface Serial0/0/0:0.16 (MTU 1500

is this your WAN interface where your dlsw peer traffic has to go to?

From the moment the trouble starts until the

tcp giveuptimer kills the tcp connection it

takes that amount of time:

first message about congestion window chamge:

Oct 23 10:48:17.932 CDT:

giveuptimer pops:

Oct 23 10:49:47.712 CDT: TCP0: GIVEUP timeout timer expired

If i am not mistaken that are about exactly 90 seconds. Dlsw is delieberately setting the tcp giveuptime. By default it is 3 * peer keepalive time. Which comes out to 90 seconds if it is not configured.

That means a packet needed a retransmission, the retransmission was not successfull within the tcp giveuptime, 90 seconds in this case, and then the giveup timer kills the connection. The giveuptimer is a saveguard against having problems on the tcp session and sitting there and wait for timeouts for up to 10 minutes. It makes sure we close the tcp connection after a reasonable amount of time and start over.

CP0: MSS changes from 1460 to 1460

that looks like as if you are doing tcp path-mtu-discovery on the router. You could try to turn it off for a test.

Depending on your ios version you can change the tcp mss value with this command:

ip tcp mss

You can set it to i.e. 1400 and see what happens. ( you need to bounce the dlsw peer

since this takes only affect at tcp negotiation )

show tcp for a particular peer will show you the actual value of the mss used on that peer. The maximum segment size is shown at the end of the output.

If you still have problems please open a service request with tac to get all the details and the full picture for further investigation.

thanks...

Matthias

View solution in original post

lhengineer · ‎10-19-2006

Further DLSW peer debugs during the peering drop with IP addresses changed to protect the innocent :)

2006-10-19 13:32:41 Local7.Debug x.x.x.x 4858:

2006-10-19 13:32:42 Local7.Debug x.x.x.x 4859: 003903:

*Oct 19 13:34:33.483 CDT: DLSw: START-TPFSM (peer y.y.y.y(2065)):

event:CORE-DELETE CIRCUIT state:CONNECT

2006-10-19 13:32:42 Local7.Debug x.x.x.x 4860: 003904:

*Oct 19 13:34:33.483 CDT: DLSw: dtp_action_v(), peer delete circuit for peer y.y.y.y(2065)

2006-10-19 13:32:42 Local7.Debug x.x.x.x 4861: 003905:

*Oct 19 13:34:33.483 CDT: DLSw: END-TPFSM (peer y.y.y.y(2065)):

state:CONNECT->CONNECT

2006-10-19 13:32:42 Local7.Debug x.x.x.x 4862:

2006-10-19 13:32:47 Local7.Debug x.x.x.x 4863: 003906:

*Oct 19 13:34:38.079 CDT: DLSw: START-TPFSM (peer y.y.y.y(2065)):

event:CORE-ADD CIRCUIT state:CONNECT

2006-10-19 13:32:47 Local7.Debug x.x.x.x 4864: 003907:

*Oct 19 13:34:38.079 CDT: DLSw: dtp_action_u(), peer add circuit for peer y.y.y.y(2065)

2006-10-19 13:32:47 Local7.Debug x.x.x.x 4865: 003908:

*Oct 19 13:34:38.079 CDT: DLSw: END-TPFSM (peer y.y.y.y(2065)):

state:CONNECT->CONNECT

2006-10-19 13:32:47 Local7.Debug x.x.x.x 4866:

2006-10-19 13:32:52 Local7.Debug x.x.x.x 4867: 003909:

*Oct 19 13:34:43.179 CDT: DLSw: START-TPFSM (peer y.y.y.y(2065)):

event:CORE-ADD CIRCUIT state:CONNECT

2006-10-19 13:32:52 Local7.Debug x.x.x.x 4868: 003910:

*Oct 19 13:34:43.179 CDT: DLSw: dtp_action_u(), peer add circuit for peer y.y.y.y(2065)

2006-10-19 13:32:52 Local7.Debug x.x.x.x 4869: 003911:

*Oct 19 13:34:43.179 CDT: DLSw: END-TPFSM (peer y.y.y.y(2065)):

state:CONNECT->CONNECT

2006-10-19 13:32:52 Local7.Debug x.x.x.x 4870:

2006-10-19 13:32:52 Local7.Debug x.x.x.x 4871: 003912:

*Oct 19 13:34:43.179 CDT: DLSw: START-TPFSM (peer y.y.y.y(2065)):

event:CORE-ADD CIRCUIT state:CONNECT

2006-10-19 13:32:52 Local7.Debug x.x.x.x 4872: 003913:

*Oct 19 13:34:43.179 CDT: DLSw: dtp_action_u(), peer add circuit for peer y.y.y.y(2065)

2006-10-19 13:32:52 Local7.Debug x.x.x.x 4873: 003914:

*Oct 19 13:34:43.179 CDT: DLSw: END-TPFSM (peer y.y.y.y(2065)):

state:CONNECT->CONNECT

2006-10-19 13:32:52 Local7.Debug x.x.x.x 4874:

2006-10-19 13:35:52 Local7.Debug x.x.x.x 4875: 003915:

*Oct 19 13:37:43.455 CDT: DLSw: START-TPFSM (peer y.y.y.y(2065)):

event:CORE-ADD CIRCUIT state:CONNECT

2006-10-19 13:35:52 Local7.Debug x.x.x.x 4876: 003916:

*Oct 19 13:37:43.455 CDT: DLSw: dtp_action_u(), peer add circuit for peer y.y.y.y(2065)

2006-10-19 13:35:52 Local7.Debug x.x.x.x 4877: 003917:

*Oct 19 13:37:43.455 CDT: DLSw: END-TPFSM (peer y.y.y.y(2065)):

state:CONNECT->CONNECT

2006-10-19 13:35:52 Local7.Debug x.x.x.x 4878:

dmcloon · ‎10-19-2006

A "debug ip tcp trans" can help troubleshoot dlsw peer failure. If you collect this perhaps you could open a Service Request to TAC and we can look into the debug.

lhengineer · ‎10-19-2006

Thanks for the quick reponse, we'll try that and post the results when the peering drops again.

lhengineer · ‎10-23-2006

The debug ip trans output the following useful information just prior to a DLSW peer drop...

mbinzer · ‎10-23-2006

Hi,

from the debugging we see those messages:

TCP0: Packet-too-big message received on interface Serial0/0/0:0.16 (MTU 1500

is this your WAN interface where your dlsw peer traffic has to go to?

From the moment the trouble starts until the

tcp giveuptimer kills the tcp connection it

takes that amount of time:

first message about congestion window chamge:

Oct 23 10:48:17.932 CDT:

giveuptimer pops:

Oct 23 10:49:47.712 CDT: TCP0: GIVEUP timeout timer expired

If i am not mistaken that are about exactly 90 seconds. Dlsw is delieberately setting the tcp giveuptime. By default it is 3 * peer keepalive time. Which comes out to 90 seconds if it is not configured.

That means a packet needed a retransmission, the retransmission was not successfull within the tcp giveuptime, 90 seconds in this case, and then the giveup timer kills the connection. The giveuptimer is a saveguard against having problems on the tcp session and sitting there and wait for timeouts for up to 10 minutes. It makes sure we close the tcp connection after a reasonable amount of time and start over.

CP0: MSS changes from 1460 to 1460

that looks like as if you are doing tcp path-mtu-discovery on the router. You could try to turn it off for a test.

Depending on your ios version you can change the tcp mss value with this command:

ip tcp mss

You can set it to i.e. 1400 and see what happens. ( you need to bounce the dlsw peer

since this takes only affect at tcp negotiation )

show tcp for a particular peer will show you the actual value of the mss used on that peer. The maximum segment size is shown at the end of the output.

If you still have problems please open a service request with tac to get all the details and the full picture for further investigation.

thanks...

Matthias

lhengineer · ‎10-30-2006

Matthias,

Thanks, your suggestions were dead on and led us down the right path. It seems that the replacement router had path-mtu-discovery enabled while the original router did not. The far end router has always had path-mtu-discovery enabled. Once BOTH routers had it enabled our MTU for the DLSW peers jumped from 576 to 1460.

Apparently every once in awhile a packet larger than 1500 came through the link (due no doubt to SNA's additional 16 byte header).

Thanks for the help!

rcolorado · ‎11-08-2006

Hi did you resolve the isue, I have same problem but with 3845 router, ios 12.3(12)