We have the following problem appearing in our environment:
All connections from remote sites dissapears unexpected from AS/400. After 2-3 minutes remote users get normal connectivity again.
The environment is the the following:
Frame Relay/ATM WAN (Frame Relay on remote sites and ATM in datacenter) comes to Cisco 7204VXR router which connected to LAN switch with 10/100/1000 ports. AS/400 with Fast Ethernet interface also connected to the similar LAN switch. CEF is enabled on the router, IOS version 12.4.3a. We have ETHSTD *ALL parameter set on AS/400. Remote users uses TCP/IP and Telnet but there are a number of 5494 controllers wich uses SNA.
We got the problem for the first time after we enabled CEF on the router.
I suspect that the cause of problem is Ethernet frames of different standards due to ETHSTD *ALL, but can neither reproduce the problem nor understand why it behaves so.
Have anybody had similar problem?
I would be very thanksfull if somebody can help me to solve it.
Thank you in advance.
there can be plenty of reasons for the behaviour. So the first task is to locate the problem area. Second you need to fix the problem.
WHich messages are produced in the Cisco router, LAN switch, AS/400?
First, the problem could be within the FR provider network.
Are you sure that during the time the users have problems there is connectivity to the main site?
You could check this (ping -t, SAA, etc.) to be sure.
Second, your LAN switching environment could have problems (SPT loops, interface up/down events, root change, etc.) are you sure everything is stable there?
Third, it could be the router with CEF. What error messages are produced? can the client machine ping the AS/400 or the router during the time they are disconnected?
Hope this helps
The biggest problem is that there are absolutelly no any error messages on routers/switches. On the router everithing looks absolutelly normal. All other network services on other servers (mail, sql, storage etc) are up all the time and reachable from all remote sites. I've tried to check whether the problem appears because of Layer 2 (STP) / Layer 3 (OSPF) events but there is no any corelation.
About pinging the AS/400 diring the outage time: I set up IP SLA monitor wich attempts to connect with TCP port 23 and ping the AS/400 once a minute from some of remote sites. If there is an outage then both TCP connection and ping are unsuccessfull, but there are no problems to ping or connect via Telnet to the AS/400 from the same LAN.
so you are chasing a problem where all active components indicate it´s normal behaviour.
Where does a traceroute to the AS/400 get stuck when there is no connectivity? Can the Cisco 7200 access the AS/400 all the time? Can the AS/400 send packets to a client IP during this time?
It could be a CEF problem indeed. If this is the case then you might end up either turning off CEF or upgrading the software.
Hope this helps
Cisco 7200 itself can access AS/400 as well as all other hosts in the same with AS/400 LAN. There is no any outage in communications which goes to another LAN segment via Netscreen firewall while the communication via Cisco is down. Only those hosts which connected via Cisco are experiencing the problem. I can not say where the traceroute from AS/400 stucks (there are other people who maintain the machine and they never could catch the exact moment of outage).
To turn off CEF I treat as the last option actually and unfortunatelly there is no guarantee that software upgrade will help (we have 12.4.3a now).
So first of all I'm trying to understan WHY it happens because I can not see any systematics in the problem appearance. It can happen 3 times under 2 days and then dissapear for 1-2 months...
I am also pretty much reaching the end of my wisdom ;-)
But you mentioned Netscreen Firewall ... this one could also kill all the connections. Are there any indications that the FW could cause the problem?
The firewall is no problem at all. We have two network segments - one is WAN behind 7204 and another is kind of DMZ behind Netscreen. There was no any problem with traffic to DMZ. The only problem is IP traffic to WAN segment.
is only the SNA traffic interrupted, or just the TCP/IP and Telnet traffic, or both? I would like to see a show tech and log from the 7200 router.
Also, have you opened an SR with Cisco TAC? That is usually the fastest way to reach resolution of a problem.
It seems that only TCP/IP traffic interrupted.
I haven't opened TAC ticket yet - just would like to check one more time that this is not "a well known bug". Right now we have opened ticket in IBM - will see whether it helps... :-)
Log actually doesn't show any usefull information - as I mentioned above the routers behavour under the traffic interuption is absolutelly normal - traffic from all other hosts can be sent without problem. Only traffic to/from AS/400 (probably even only TCP/IP traffic) fails.
Output from "show tech" is too large - which pages would you like to see? I will try to cut them and then publish.
Thank you in advance.
actually, I was wanting to see the show tech if it was SNA traffic that was failing (that's my area of expertise), so I don't need it. However, if you need to open a TAC case (which is looking more and more likely) that will be the first thing they'll ask for.
There is a very old issue with as400's and IP. What happens is when the subnet learns the MAC address of the as400 via one subnetwork protocol. Default in the AS400 is arpa. Then another devices will use a different subnetwork architechure, say SNAP to send and receive ARP from for the AS400. Then the as400 sends out a gratuitous ARP for SNAP and will never answer a ARPA ARP request again. IBM has a fix for it. The workaround is to define a static ARP entry in the router and specify arpa. This usually fixes it if that is the problem which is sounds symptomatic of.
Thank you for your answer.
Behaivour seems to be quite similar to that we have.
I would like just ask one more thing: when you write that "as400 sends out a gratuitous ARP for SNAP and will never answer a ARPA ARP request again" does it mean that it will never answer ANY ARPA ARP request or just ARPA ARP request from host which sent it with other subnet protocol?
Thank you in advance.
It will never answer any other ARPA ARP request from anyone else. And so, while it will still talk on ARPA IP frames, they will cache out their arp cache and re-arp. Without the refresh, the IP will die. Not sure this is your issue but it sounds symptomatic of it. Plus the fix is easy- make a static ARP entry in the router for the as400 with the encapsulation that it was found dynamically.
Well, it seems that it is not exactly our case because even when the hosts behind the router can not reach the AS/400 there are no problems to reach it from the hosts in the same with AS/400 LAN or from the host situated behind another router.
But it looks more and more likelly that it is possible to fix it with static ARP entries on the both sides. Will try to check this. Unfortunatelly, there is no any pure criteria whether the problem is solved - so the verification would be quite difficult.
Thank you for your help.