I am experiencing random calls dropping 4 seconds from answer.
The originating gateway always drops the call back to the A-party.
The SS7 REL cause is 41, the Q931 REL cause from the gateway to the PGW is 102 (recovery on timer expiry).
The connection between PGW and Gateway is IUA/SCTP.
I have attached two screen shots of an ethereal trace showing a good call and a bad call.
The originating gateway in these examples is 10.64.4.30, and the PGW is 10.64.3.2
The only difference I can see is that in the case of the bad call, is the different message (SACK CONNECT - packet 10). As there is no acknowledge for this message, the subsequent 'disconnect' occurs 4 seconds later on timer expiry.
On the good call, there is a connect and an acknowledge, and the call continues normally.
It appears that maybe the PGW is not acknowleding the connect from the orig gateway.
This does not happen for every call, and is quite random, and it only occurs on this orig gateway.
Are there any changes I can make to either the orig gateway or the PGW that will fix this problem?
Alternatively, are there any traces you can suggest which will show the differences between the good and bad call. Bearing in mind this is random, and we may go 20 - 30 minutes between instances.
The PGW parameters are all default.
The current gateway parameters are listed below.
WRE_NAS01#sh ip sctp association list
** SCTP Association List **
AssocID: 3578654444, Instance ID: 0
Current state: ESTABLISHED
Local port: 9900, Addrs: 10.64.4.30
Remote port: 9900, Addrs: 10.64.3.3 10.64.6.3
AssocID: 4230620528, Instance ID: 0
Current state: ESTABLISHED
Local port: 9900, Addrs: 10.64.4.30
Remote port: 9900, Addrs: 10.64.3.2 10.64.6.2
WRE_NAS01#sh ip sctp association parameters 4230620528
Usually after sending a "SETUP" message, the calling party waits for 4 seconds (a timer value that can be changed) for any kind of reply from the remote switch - SETUP-ACK, PROGRESS, ALERTING, CONNECT, RELEASE COMPLETE. If no reply comes in these 4 seconds, the connection might be broken.
This might cause a problem when the called end-number is either on ancient analog rotary dial PSTN switch or a mobile number (like a roamed one). There is some delay until the called end-user "equipment" is contacted and during this time the PSTN switch does not send any indicator of progress (PROGRESS or at least SETUP_ACK).
Two solutions here:
1. Either increase the timer at your side (like apprx 10 seconds) and see if it helps.
2. Ask the remote side to send SETUP_ACK or PROGRESS messages.
The issue is more that the PGW does not send a connect acknowledge back to the gateway it recieved the connect from. This is why the gateway drops the call. A snoop on the PGW interface shows the connect come in from the gateway followed 4 seconds later.
It is interesting that an ISUP trace shows no ANM being sent to the calling party either.
I am assuming that the connect acknowledge must be sent to the gateway before the ISUP side sends the answer message.
In all cases, the messages (both Q931 and ISUP) between the PGW and the terminating gateway are correct. It is only between the originating gatway and the PGW that I see the problem.
In some cases, a duplicate setup (same callref) come from the PGW to the orig gateway, 4 seconds after the 1st setup. Prior to this, the snoop trace shows that that both the orig and term gateways have reached the alerting state, waiting for connect. 4 seconds after the duplicate setup, the orig gateway drops the call with cv=102.
It is as though the PGW is missing messages, or is unable to keep track of what it has sent??? The strange thing is, that of the 20 gateways on the network, I only see this issue at one of them.
I have just opened a TAC case for this problem.
As an aside to this, do you know of a way to get ethereal to decode RUDP/SM/MTP3/ISUP on ports other than 7000? In this case, the sessionset between the orig gateway and the PGW is on 7010 and ethereal only sees it as a UDP malformed packet.
If I was you I would troubleshoot the IP connectivity between the originating gateway and the PGW.
It would sound funny, but I had a problem with a VoIP network which partially stoped to work due to a simple IOS upgrade. The reason was --> the newer IOS was using H323v4 (the previous was on H323v3) where the H323 packet
size grows a bit and sometimes goes beyoung the MTU and was not properly fragmented by the firewall.
So, troubleshoot the IP connectivity. Look at the UDP packet sizes. Try to ping between the units with a large ping to see what is the maximum size of an *unfragmented* packet that is allowed to pass between.
For the Ethereal:
* Go to Edit --> Preferences --> Protocols --> MTP3.
Make the changes there according to the SS7 variant you are using (ANSI / ITU)
* Start capturing
* Mark one of the packets in the list, right-click on it and choose "Decode As" ;)
Thanks for the ethereal tip -I'll give that a try.
The initial problem has been found - it looks like an IOS bug.
Sometimes when the gateway responds to messages from the PGW, in the Q921 UA part of the message, the message class is set to 11 (unknown), it should be 5. As the PGW does not recognise this message class, it ignores these messages and does not respond, causing the gateway to timeout. All messages from the gateway to the PGW for the particular callref are affected.
It only affects originating calls from the gateway, and only when IUA over SCTP is used.
I am waiting to hear back from the TAC about when a new release will be available which fixes this problem.
I am using 12.4(6)T3 The problem was also in 12.4(4)T2 I changed IOS's because of a problem where the circuits on the PGW would remain in MATE_UNAVAIL if the SCTP association was lost in any way.
Introduction: The "external-out enable" command is available for
configuration under the "router ospf process" in case of the IOS-XR
operating system. This command basically enables advertisement of
intra-area routes on the device as external routes in th...
IntroductionIn this article we'll discuss how to troubleshoot packet
loss in the asr9000 and specifically understanding the NP drop counters,
what they mean and what you can do to mitigate them. This document will
be an ongoing effort to improve troublesh...