DLSw+ reverse media translation and SDLC-attached FEP

Unanswered Question
Apr 13th, 2008

I have an SDLC-attached FEP, and I am doing reverse-media translation to the secondary which is on a remote token-ring LAN.

Whenever they bounced the FEP port, it takes 15 minutes or so to recover. Is there a timer or other setting I need to configure on the router? (DLSw+, SDLC, or LLC2 timers?)

I have this problem too.
0 votes
  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 0 (0 ratings)
Loading.
jihicks Mon, 04/14/2008 - 04:23

Hi Derick,

The connection is really initiated by the endpoints, the token ring PU sending TEST, then XID and the FEP sending SNRM. I don't see anything obvious to change. It may take tracing this at both ends to see what is going on, check the serial interface and dlsw circuit at the same time. You can cause your console commands to be time stamped with "terminal exec prompt timestamp". Make sure the circuit drops when the FEP port is bounced. Now see if one side is slow at starting the connection.

Jim

ringwyrm Mon, 04/14/2008 - 06:46

I'd like to add that this was a STUN connection and, of course, it took something like a few seconds or a minute to recover when it was stun. Then they moved the secondary PU to token ring (because the serial cards are way... way... way... out of support).

##############

interface Serial0/0/0

no ip address

encapsulation sdlc

no keepalive

clock rate 9600

sdlc role secondary

sdlc vmac 4000.6660.3400

sdlc address C1

sdlc partner 4000.3745.0103 C1

sdlc saps C1 08 08

sdlc dlsw C1

##############

So I was looking at this, and I'm curious if I should configure the "sdlc line-rate" command. Also since the attached device is a FEP should I use "sdlc address c1 seconly" instead of "sdlc role secondary?"

jihicks Mon, 04/14/2008 - 06:57

What is the PU type genned in the NCP and what is the device on token ring? Don't be fooled by PU2.0 and PU2.1 as the difference in the NCP gen is XID=YES for PU2.1. The "seconly" keyword is for PU4 to PU4 connections ( FEP to FEP ).

Configuring "sdlc line-speed" is OK, but not a fix for this problem. You really need more information about what is going on.

ringwyrm Mon, 04/14/2008 - 09:30

PU2.0

The secondary is a Tandem. Is there a timer on there that could be set? Obviously with the STUN connection if the serial line goes down... then the serial line goes down. On a token ring, things are different. So I wonder if there is a session timeout or a link timeout type setting on the Tandem...

ringwyrm Tue, 04/15/2008 - 08:45

So, we are about to do some testing on this during a window provided to us by the customer, but I would like to give you some additional details...

The SDLC-attached FEP, when they bounce the serial port, they start sending SNRMs, and they just start receiving DMs back from the router... this goes on for a totally random amount of time, somewhere between 10 minutes (the shortest they've seen) up to an hour (the longest they've seen).

Then, for whatever reason, they DMs stop and the PUs come active.

I'm curious. Since the default direction for the SDLC partner is inbound... when they bounce the FEP port, does the router disconnect the session, and just wait indefinitely for the secondary to send an XID?

Could this be fixed by changing the direction to outbound?

Also, would configuring the "sdlc xid" command have any effect when the sdlc role is secondary?

jihicks Tue, 04/15/2008 - 09:11

Yes, the router waits for the secondary to connect. Since the FEP is sending SNRM immediately, this is a PU2.0 in the NCP. You SHOULD have XID configured on the SDLC interface. However, I don't think it will get used in this case. You can try outbound also, it might help. However, normally the secondary would be retrying immediately, so I wonder what is going on with it. I suggest you open a case and get traces of the secondary ( on the token ring ).

Jim

ringwyrm Tue, 04/15/2008 - 11:54

Thanks for your help..

So strangely, we are receiving XID0s (TEST_STN.Ind), but the DLSw is not establishing. Its as if the router is ignoring it.

When they change to XID3 (ID.Ind), the router immediately recognizes this and trys to establish the circuit.

Its strange. If we wait long enough, the router eventually seems to just accept the TEST_STN.Ind (XID0) and it does recover, but it takes a very long time for this to magically clear and work. Nothing is different.. it just happens by itself.

So the secondary is sending it every five minutes, and I see it.. it seems like the router is ignoring it. When we switch to XID3, the router immediately trys to establish the circuit, every time.

Almost sounds like a bug. Is there something about XID0s that needs to be true for the router to acknowledge it and try to establish the circuit?

jihicks Wed, 04/16/2008 - 06:07

You are very welcome.

When this is going on, is the circuit between 4000.6660.34C1 and 4000.3745.0103 still up? If so, they took down the PU but not the line. If the circuit is still up, this might explain the problem, the XID3 might cause the DLSw circuit to go down due to circuit collision and come back up.

ringwyrm Wed, 04/16/2008 - 07:50

Alright, so completely disregard my previous message. That was a point-in-time evaluation based on the information I had and it is totally bogus (wrong).

So eventually the XID3s stopped coming too. I opened a case with Cisco which lead me to believe that there could be a timer/race issue. So I started packet capturing after a failure and ran all kinds of SDLC/DLSw debug on the other end and waited for it to recover.

What I am seeing is this: When in a failed state, I am seeing periodic UDP CANUREACH coming in from the remote secondary-attached router. Then the local router replies to that with a TCP ICANREACH (is this normal?). This occurs repeatedly. Nothing comes of it, the circuit stays down.

Then for some reason the CANUREACH comes in as *TCP* instead of the numerous previous *UDP* ones. When this occurs, the local router replies with a TCP ICANREACH and the link comes up immediately. It seems that TCP is triggering the right behavior and UDP is not.

Any thoughts on this? If I do "dlsw udp-disable" should the UDP stuff stop?

jihicks Wed, 04/16/2008 - 07:55

Absolutley. Sounds like the canureach ( UDP ) is getting dropped or blocked. If you configure "dlsw udp-disable" then the CANUREACH and ICANREACH are sent over the TCP session and not via UDP UNICAST. Sounds like that will fix it.

ringwyrm Wed, 04/16/2008 - 08:03

Well, they are definitely not being blocked. I see them coming in at the other end.

So the fact it recovered the first time it tried TCP may be a coincidence.

Still... I will disable UDP to see what happens. Do I have to disable this on both ends (not preferred) or just on one end?

jihicks Wed, 04/16/2008 - 08:26

Just one end The fact that you disabled it gets sent to the peer via capabilities, as soon as you enter it. You can see this in the field "UDP Unicast Support" in the output from "show dlsw cap" ( remote peers ) and "show dlsw cap local" for the local peer.

CANUREACH UDP --->

<---ICANREACH UDP

A couple of things. First, If you don't see dlsw UDP disable on either side, I am surprised that the CANUREACH would ever go over TCP. Second, you see the CANUREACH making it across, do you see the icanreach making it back to the router at the other side? It has to make it into the router, past any ACLs on the interface. I suspect it isn't. Otherwise, the problem would not exist. Reachability would be established. Anyway, since it appears to be a problem, the "dlsw udp-disable" should fix that .

ringwyrm Wed, 04/16/2008 - 09:12

Ahh.. well there is the weirdness maybe.

There are no ACLs or firewalls blocking anything in the path.

We are seeing this:

CANUREACH (UDP) --->

<--- ICANREACH (TCP)

We are seeing a TCP response to UDP request. Eventually, a TCP request comes and, and a TCP response occurs, and the circuit recovers.

mbinzer Wed, 04/16/2008 - 09:15

Hi,

just to be sure.

We are talking about cisco to cisco, right?

What versions of ios are you running on both

routers?

thanks...

Matthias

ringwyrm Wed, 04/16/2008 - 09:24

Yes it is Cisco to Cisco.

On the FEP/SDLC end:

2811 w/ 12.4(8d) Advanced Enterprise

On the secondary/ethernet end:

7200 NPE-G2 w/12.4(15)T1 Advanced Ent w/SNASw

jihicks Wed, 04/16/2008 - 08:43

I think what you were seeing is a CANUREACH explorer ( CUR_ex ) over UDP and a CANUREACH CIRCUIT START ( CUR_cs ) over TCP. The explorer comes first and then the circuit start.

ringwyrm Wed, 04/16/2008 - 09:17

Yep... you're right.

So what triggers the circuit start suddenly? This is where I am lost. Is there a timer?

Do you want the case number?

ringwyrm Wed, 04/16/2008 - 09:26

Ok..

Well, I have a trace which will show that the secondary unrelenting sends XIDs every five seconds or so when it is in a failed state. That trace in the case.

jihicks Wed, 04/16/2008 - 09:40

Yes and in the IP trace I see the UDP CANUREACH being sent out and no response received. I suspect at the time we we had the MAC of the SDLC side in the cache on the router. We can answer test locally. We can not answer XID locally. Seems like we are back to the same place, disable UDP and see if that fixes the problem.

ringwyrm Wed, 04/16/2008 - 10:48

It did not fix the problem.

I am sending some new captures shortly. In this case, we see 1-for-1:

CANUREACH (TCP) --->

<--- ICANREACH (TCP)

Both ends show that the MACs/SAPs are found in the DLSw reachability cache. I see XIDs coming in from the secondary every second, and I do not see corresponding ID_STN.Ind going out over DLSw.

One thing I notice in the debugs is that the CANUREACH/ICANREACH packet debugs show the SAPs as (00 08). However, the configuration on the Tandem and on the router is (08 08). Does this different matter?

jihicks Wed, 04/16/2008 - 19:15

No the test frame foing from SAP 08 to SAP 00 is OK.

In the last sniffer trace that you attached, I see two CUR_ex frames that match the correct MAC addresses. Frame 12 ( no response ) and frame 25. The ICANREACH response to frame 25 is over TCP. Then I see CUR_cs ( frame 49 ) and ICUR_CS ( Frame 51 ) and REACH_ACK ( frame 53 ). XIDFRAME ( see RFC 1795 )from 253.1 to 1.73 and the I see the XID Format 0 Type 2 from the Tandem. The circuit keeps progessing to connected. You said you had another with the udp-disable that you were going to attach?

be careful, most of the CUR frames in this trace are for other MAC pairs than the ones we are working with.

jihicks Wed, 04/16/2008 - 09:23

I think what you were seeing is a CANUREACH explorer ( CUR_ex ) over UDP and a CANUREACH CIRCUIT START ( CUR_cs ) over TCP. The explorer comes first and then the circuit start.

ringwyrm Mon, 04/21/2008 - 11:02

This issue has been resolved as bridging loop which was formed by a backside link between two different customers of ours, and a third-party.

This was a surprise to me, because we were seeing the TEST frame P/F just fine...

I have a question, what MAC address does DLSw use for its multicast CUR and ICR messages? I'd like to filter on those on all of our customer-facing interfaces.

jihicks Tue, 04/22/2008 - 04:23

You are not using multicast as you do not have this configured. You are using UDP unicast. All explorers are sourced from the MAC intiating the connection to the target MAC of the connection. Multicast and unicast are in regards to the IP addresss and not the MAC address. If you want to filter, this can be done at the interface ( if bridging ) or via DLSw icanreach and icanreach mac-exclusive.

http://www.cisco.com/en/US/tech/tk331/tk336/technologies_configuration_example09186a0080094135.shtml

ringwyrm Tue, 04/22/2008 - 04:42

Right...

I think we are going to filter on source MAC address on the opposite "sides" of the loop, if you know what I mean by that.

Why did the Test P/F seem to work? I am curious about that. Is that translated to a simple sort of ping/oam between DLSw peers? That would make sense.

jihicks Tue, 04/22/2008 - 04:50

Once reachability has been established and the remote MAC address is in reachabiltiy, then the test frame is answered locally by DLSw ( nothing is sent to the peer )if the cache entry is not stale, age < 4 minutes. I think that is why it was being answered. If cache entry is older than 4 minutes but less than 16 minutes ( defaults ) then reachabilty is verified by sending a CUR back to the peer where it is found in cache. If more than 16 minutes it ages out of cache.

mbinzer Tue, 04/22/2008 - 04:57

Hi,

at canureach_ex is a test or a contextless xid. If you have reachability over a specific

peer it is send to this peer only. If there

is no reachability it is send to all connected peers.

If the other ends status is FOUND local and it is less than 4 minutes old by default, it will answer directly to this with a positive response, without even bothering the real end system.

If the cache status is FOUND, and between 4 minutes and 16 minutes old it will send a verify to this reachability. If answered positive it answers with an icanreach_ex.

If there is no reachability it will search on all local interfaces to find the requested resource.

thanks...

Matthias

Actions

This Discussion