cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1004
Views
0
Helpful
6
Replies

DLSW drop sporadically

g.schaarup
Level 1
Level 1

Setup

6509 (IOS 12.1(13)E6) running standard DLSW ( i.e. non ethernet redundancy)over FastEthernet to 7513 (12.1(14)) connected to Mainframe over CIP2/IBM Channel.

1.

Our PC running LU6.2 over ethernet connected to the 6509 spordic loose thier connection due to the DLSW peer drops. (on the DLSW peer there is max 100 circuits)

2.

The 6509 have orther DSLW peers that run to other routers that are unaffected by the dropping peer.

3.

Likewise the 7513 that is the peer for the dropping DLSW connection has other peers that are unaffected.

4.

On a DLSW debug I see that the drop is caused due to failed DLSW keepalives.

5. As I found the CSCdw58350 I would appreciate some details how to determine if its the CSCdw58350 thats causing my problems ?? Or any other suggestions for the cause - thanks.

Regards

Gert Schaarup

6 Replies 6

dixho
Level 6
Level 6

Just before the DLSw peer drop, do a show tcp on the 7513. Look at the KRTT value. If the value is larger than 4.8 second, you run into CSCdw58350. Based on my experience, you only have 4 minutes to do the show tcp.

Another way is to take a sniffer trace between the 7513 and 6509. (i.e. any IP path between the DLSw connection). If you hit CSCdw58350, you will find the time between 2 retransmissions increases beyond 4.8 seconds.

Thanks for the "how to tell".

But:

1. I havent had the luck to catch a "just before" show tcp but on a sniffer trace I do see retransmit sjust before the DLSW drops. But I also see retransmit for other peers that seems to be unaffected(or mayby justsurvive this time) by the retransmit at that time (FYI they do also drop sporadic).

2. Just before the DLSW I see a lot of events going on. On the LU6.2 client side router a dlsw event debug shows flowing events could you please give me your comment on this thanks :

20:20:30: DLSw: Keepalive Request sent to peer 172.31.0.2(2065))

20:20:30: DLSw: dlsw_tcpd_fini() for peer 172.31.0.2(2065)

20:20:30: DLSw: tcp fini closing connection for peer 172.31.0.2(2065)

20:20:30: DLSw: START-TPFSM (peer 172.31.0.2(2065)): event:ADMIN-CLOSE CONNECTION

And on the CIP router:

-----------------------------------------------------------------------------------------------------

20:16:42: DLSw: START-TPFSM (peer 172.16.0.1(2065)): event:TCP-BUSY XPORT state:CONNECT

20:16:42: DLSw: dtp_action_r(), flow_ctrl (TCP_CONG_HIGH) for peer 172.16.0.1(2065)

20:16:42: DLSw: END-TPFSM (peer 172.16.0.1(2065)): state:CONNECT->WAN_BUSY

20:17:42: DLSw: START-TPFSM (peer 172.16.0.1(2065)): event:DLX-KEEPALIVE REQ state:WAN_BUSY

20:17:42: DLSw: dtp_action_q() keepalive request from peer 172.16.0.1(2065)

20:17:42: DLSw: Keepalive Response send failed to peer 172.16.0.1(2065)

20:17:42: DLSw: END-TPFSM (peer 172.16.0.1(2065)): state:WAN_BUSY->WAN_BUSY

20:18:12: DLSw: START-TPFSM (peer 172.16.0.1(2065)): event:DLX-KEEPALIVE REQ state:WAN_BUSY

20:18:12: DLSw: dtp_action_q() keepalive request from peer 172.16.0.1(2065)

20:18:12: DLSw: Keepalive Response send failed to peer 172.16.0.1(2065)

20:18:12: DLSw: END-TPFSM (peer 172.16.0.1(2065)): state:WAN_BUSY->WAN_BUSY

20:18:42: DLSw: START-TPFSM (peer 172.16.0.1(2065)): event:DLX-KEEPALIVE REQ state:WAN_BUSY

20:18:42: DLSw: dtp_action_q() keepalive request from peer 172.16.0.1(2065)

20:18:42: DLSw: Keepalive Response send failed to peer 172.16.0.1(2065)

20:18:42: DLSw: END-TPFSM (peer 172.16.0.1(2065)): state:WAN_BUSY->WAN_BUSY

20:19:27: DLSw: Keepalive Request send failed to peer 172.16.0.1(2065)

20:19:30: DLSw: START-TPFSM (peer 172.16.0.1(2065)): event:DLX-KEEPALIVE REQ state:WAN_BUSY

20:19:30: DLSw: dtp_action_q() keepalive request from peer 172.16.0.1(2065)

20:19:30: DLSw: Keepalive Response send failed to peer 172.16.0.1(2065)

20:19:30: DLSw: END-TPFSM (peer 172.16.0.1(2065)): state:WAN_BUSY->WAN_BUSY

20:20:29: DLSw: dlsw_tcpd_fini() for peer 172.16.0.1(2065)

20:20:29: DLSw: START-TPFSM (peer 172.16.0.1(2065)): event:ADMIN-CLOSE CONNECTION state:WAN_BUSY

20:20:29: DLSw: tcp fini closing connection for peer 172.16.0.1(2065)

20:20:29: DLSw: dtp_action_b() close connection for peer 172.16.0.1(2065)

20:20:29: DLSw: END-TPFSM (peer 172.16.0.1(2065)): state:WAN_BUSY->DISCONN

The debug shows that the 7206 fails to send DLSw keepalive to the LU6.2 client peer because the TCP queue on the 7206 is too long. I cannot tell if you hit the bug or there is a problem on the IP path from the 7206 to the LU6.2 client site.

As I do not know the network topology, the information about the other peers stay up does not help me. Do the peer staying up during the outage share the same IP network infrastructure?

As per my previous posting, you need to get show tcp just before the outage or a sniffer trace between the DLSw peer connection in order to confirm if you hit the bug. Even though I think that it is very likely the bug is the root cause, you need the information I mentioned before to confirm.

Thanks

The other peers that stays up during an outage on another peer share the same ip infrastructure (pure LAN to LAN / Fastethernet).

But I have a "suspicion" that the TCP queue on the LU6.2 client is getting too long due to LU6.2 data that cant get through the peer to the mainframe at the CIP router end (at the other end of the DLSW peer).

Does this make sence ? Is there any way to check this ?

As I do not know the network topology, it is difficult to comment on this. What is the minimum bandwidth between the DLSw peer connection?

DLSw has a built-in flow control mechanism. If the LU 6.2 sends too much data for the TCP pipe to handle, DLSw will sends half window and reset window indicator, which cause the router to send RNR (receiver not ready) to the LU6.2.

From what I seen in the debug messages, the 7206 is unable to send any TCP segments to the MSFC. Please be aware that the trigger point of the bug is signifcant amount of segment dropped on the network. (If MSS is 536, the minimum data dropped on the network is about 6000 bytes). I have encounter a situation where only one of the DLSw peer drops while other DLSw peer connection stays up, the root cause of the issue is still the TCP timer bug. Upgrading the IOS to 12.1(15) or above resolves 99% of related problems.

To verify your theory, do a show dlsw circuit detail and look at the half window and reset indicator. Also, look at the DLSw circuit send/receive window.

Go to the CIP console and do a llc stat all.

Thanks

1. I will go for the IOS upgrade, but do I need to upgrade both ends of the dlsw peer ? (i.e. the 7513 with the CIP and the 6500 with the LU6.2 workstations)

2. After we moved some of the LU6.2 client to Token Ring (i.e. direct SNA to the host) the dlsw peer seems to stay up but we still see that dlsw circuits gets disconnected and shortly after reconnect without direct loss of LU6.2 data.

3. I do see some few half flow op and reset (like Half: 6/2 Reset 1/0) but could you please comment following output (taken 4 and 9 July when the peer is up, as the peer hasnt gone done since we moved some of the load) to see if these confirm that it is the bug that is the problem. - Thanks

07:54:01.231 PDT Fri Jul 4 2003

----------------------------------------

At the 7513 end:

CIP-Slot4#llc stat all

Maximum connections = 1200

Highest connections = 169

Current connections = 59

Connections rejected = 0

Connect requests = 5615

Connect responses = 0

Connect confirms = 5615

Connect indications = 0

Disconnect confirms = 207

Disconnect indications = 5128

Flow off terminated conn = 0

Session timeouts = 99

FRMRs sent = 1

FRMRs received = 1

SABMEs sent = 5632

SABMEs received = 0

DISCs sent = 542

DISCs received = 5027

UAs sent = 5027

UAs received = 5811

DMs sent = 0

DMs received = 10

IDLE connections = 0

Connection thresheld = 1200

Frames transmitted = 130903182

Transmit completes = 130903182

Frames received = 124105764

Flow offs sent = 43844

Flow ons sent = 43810

Flows (on/off) cancel = 180

Flow offs received = 0

Flow ons received = 0

CCB indications sent = 92883

CCB indications freed = 92883

Unexpected events = 36

Ack Timer Missed = 0

Frames ignored = 4

Data requests = 101195500

Data requests freed = 101193980

Aggregate data req qlen = 0

Maximum data req qlen = 0

Data indications = 87183015

Data indications freed = 87183015

Aggregate data ind qlen = -1457917659

Maximum data ind qlen = 87183015

Polls sent = 152722

Finals sent = 18591962

Polls rcvd = 18625036

Finals rcvd = 158403

RR polls sent = 152722

RNR polls sent = 0

REJ_polls sent = 0

RRs sent = 29607348

RRs received = 36878480

RNRs sent = 80011

RNR received = 27452

REJs sent = 1487

REJs received = 4178

Bad PDUs received = 0

Bad N(r)s received = 5

Bad N(s)s received = 1747

Out-of-seq N(s)s received = 38

Acknowledgements delayed = 81673676

Iframes transmitted = 101203134

Iframes retransmitted = 9115

Retries = 7507

Local busy entered = 96226

Local busy exited = 96226

Remote busy entered = 9737

Remote busy exited = 9429

Unexpec Ack = 0

Adapter registered = 4

SAPs registered = 1

CEPs registered = 59

Msgs to entity mgr = 31

Msgs to PSAP = 58712

Msgs to PCEP = 101206483

Msgs to DLU = 87346909

Invalid handles = 0

SAPs activated = 8

Deactivate SAP requests = 5

Deactivate SAP confirms = 5

SAPs reset by user = 2

SAPs reset by network = 0

SAP deact/reset completed = 7

SAPs deact/reset w/ CCBs = 2

SAPs frame drops = 0

Hash buckets used = 3439

Hash table collisions = 2181

Hash table lookups = 124174704

Hash table caparisons = 198849566

Hash table lookups failed = 58305

Longest collistion chain = 19

Receive erroneous frame = 0

DMA write operations = 130999946

DMA write completions = 130999946

DMA write errors = 0

DMA write waits = 0

DMA read notifications = 124534436

DMA read operations = 124534436

DMA read completions = 124534435

DMA read errors = 0

DMA reads discarded = 0

DMA reads discarded small = 0

DMA read chains freed = 0

Frames exceed MAX frameSZ = 0

Timer ticks (100ms) = 222075154

Timeouts = 6502680

Timers started = 275098246

Timers stopped = 268595506

T1 timers expired = 7607

Idle timers expired = 91521

T2 timers expired = 6403552

Longest timeout chain = 58

P/F timer started = 163700

P/F timer stopped = 158403

Ack timer started = 66234646

Ack timer stopped = 58411329

Busy timer started = 13354

Idle timer started = 145383718

T2 timer started = 63301341

T2 timer stopped = 63301340

Reject timer started = 1487

Reject timer stopped = 1589

LLC Stack locked = 363336715

LLC Stack delayed events = 101

LLC stack pending events = 0

LLC stack most pndng evnts= 1

LLC stack not locked.

--------------------------------------------------------------------------------------------------------

15:17:44.199 PDT Wed Jul 9 2003

-------------------------------------------

At the 7513 end:

CIP-Slot4#llc stat all

Maximum connections = 1200

Highest connections = 169

Current connections = 69

Connections rejected = 0

Connect requests = 10778

Connect responses = 0

Connect confirms = 10778

Connect indications = 0

Disconnect confirms = 301

Disconnect indications = 10216

Flow off terminated conn = 0

Session timeouts = 118

FRMRs sent = 1

FRMRs received = 1

SABMEs sent = 10815

SABMEs received = 0

DISCs sent = 628

DISCs received = 10096

UAs sent = 10096

UAs received = 11037

DMs sent = 0

DMs received = 12

IDLE connections = 0

Connection thresheld = 1200

Frames transmitted = 153147701

Transmit completes = 153147701

Frames received = 146574711

Flow offs sent = 52506

Flow ons sent = 52472

Flows (on/off) cancel = 198

Flow offs received = 0

Flow ons received = 0

CCB indications sent = 115314

CCB indications freed = 115314

Unexpected events = 103

Ack Timer Missed = 0

Frames ignored = 4

Data requests = 119159228

Data requests freed = 119157594

Aggregate data req qlen = 0

Maximum data req qlen = 0

Data indications = 102290655

Data indications freed = 102290655

Aggregate data ind qlen = -2074303823

Maximum data ind qlen = 102290655

Polls sent = 182235

Finals sent = 21206449

Polls rcvd = 21242583

Finals rcvd = 193195

RR polls sent = 182235

RNR polls sent = 0

REJ_polls sent = 0

RRs sent = 33862716

RRs received = 44225534

RNRs sent = 93378

RNR received = 30175

REJs sent = 2075

REJs received = 4686

Bad PDUs received = 0

Bad N(r)s received = 5

Bad N(s)s received = 2441

Out-of-seq N(s)s received = 69

Acknowledgements delayed = 95797452

Iframes transmitted = 119167992

Iframes retransmitted = 10275

Retries = 7696

Local busy entered = 108966

Local busy exited = 108966

Remote busy entered = 11023

Remote busy exited = 10715

Unexpec Ack = 0

Adapter registered = 4

SAPs registered = 1

CEPs registered = 69

Msgs to entity mgr = 35

Msgs to PSAP = 85729

Msgs to PCEP = 119180620

Msgs to DLU = 102514497

Invalid handles = 0

SAPs activated = 9

Deactivate SAP requests = 6

Deactivate SAP confirms = 6

SAPs reset by user = 2

SAPs reset by network = 0

SAP deact/reset completed = 8

SAPs deact/reset w/ CCBs = 2

SAPs frame drops = 0

Hash buckets used = 8529

Hash table collisions = 2283

Hash table lookups = 146676148

Hash table caparisons = 224807587

Hash table lookups failed = 80517

Longest collistion chain = 19

Receive erroneous frame = 0

DMA write operations = 153281642

DMA write completions = 153281642

DMA write errors = 0

DMA write waits = 0

DMA read notifications = 147050284

DMA read operations = 147050284

DMA read completions = 147050284

DMA read errors = 0

DMA reads discarded = 0

DMA reads discarded small = 0

DMA read chains freed = 0

Frames exceed MAX frameSZ = 0

Timer ticks (100ms) = 226558498

Timeouts = 7361429

Timers started = 324359028

Timers stopped = 316997530

T1 timers expired = 7815

Idle timers expired = 109334

T2 timers expired = 7244280

Longest timeout chain = 58

P/F timer started = 203551

P/F timer stopped = 193195

Ack timer started = 78970086

Ack timer stopped = 68892536

Busy timer started = 14376

Idle timer started = 170688653

T2 timer started = 74480287

T2 timer stopped = 74480287

Reject timer started = 2075

Reject timer stopped = 2229

LLC Stack locked = 426985053

LLC Stack delayed events = 120

LLC stack pending events = 0

LLC stack most pndng evnts= 1

LLC stack not locked.