switch stack often crashes with error message

pokemon · ‎06-16-2016

Hello,

I appreciate if you can give me a direction about switch crash ..

We are using SG500 switch stack but it often crashes with below error in syslog server.

abcdefg01.log:Jun 13 19:14:15 abcdefg01 1 2016-06-13T23:14:14Z abcdefg01 STCK SYSL - UNITMSG - %STCK SYSL-A-UNITMSG: UNIT ID 2,Msg:%SYSLOG-F-OSFATAL: UDPP_inet_dns_transmit_helper - transmit_handle was not initialized ***** FATAL ERROR ***** Reporting Task: AAAT. Software Version: 1.4.2.4 (date 21-Dec-2015 time 16:47:05) 0x16bd2c 0x16c290 0x167e50 0x7f5ce4 0x4581ac 0x45ca04 0x678a64 0x67959c 0x679a58 0x7203f4 0x1223f0 ***** END OF FATAL ERROR *****

Is there anyone who see the same issue ?

And is there solution for this ?

If I see the uptimes, then master and backup crashes repeatedly randomly.

The unit number 1 and 2 has short uptime. But unit number 3 or bigger has much longer uptimes.

Thanks in advance,

Miyoshi

Mark Malone · ‎06-16-2016

Hi

I would move off that software version and go to 1.4.5.02 released 20th of last month looks like some form of bug output that you have hit

pokemon · ‎07-09-2016

Thanks Jonathan and Mark,

I think I found what caused this crash...

Actually I thought this is something related to the stacking ..

Because all other SG500 are in stack in other countries.

But I see the same issue with standalone switch.

This is the firmware information from crashed switch.

And all others use the same boot code and software.

atviesw01#sh ver

Unit SW version Boot version HW version
------------------- ------------------- ------------------- -------------------
1 1.4.5.02 1.4.0.02 V03

atviesw01#

I picked up only suspicious part from configuration.

This switch is running L2 mode with 4 queues.

vlan database
default-vlan vlan 988
exit

dot1x system-auth-control

radius-server host dkradius.mydomain.com priority 10 usage dot1.x
radius-server host jpradius.mydomain.com priority 1 usage dot1.x
radius-server host source-interface vlan 989

ip domain name mydomain.com
ip name-server 10.100.104.103 10.100.32.110 10.100.32.109
ip domain polling-interval 8

interface vlan 106
name Guest
dot1x guest-vlan
!

interface vlan 989
name Data
ip address 10.100.28.21 255.255.252.0
no ip address dhcp
!

When I enable 802.1x in a interface as below, nothing happened yet.

interface gigabitethernet1/1/1

dot1x guest-vlan enable
dot1x reauthentication
dot1x authentication 802.1x mac
dot1x radius-attributes vlan static
dot1x port-control auto

But client computer connect to this interface and switch start 802.1x auth, then it crashed with below error message.

Jul 8 13:26:01 atviesw01 1 2016-07-08T11:26:00Z atviesw01 SYSLOG - OSFATAL - %SYSLOG-F-OSFATAL: UDPP_inet_dns_transmit_helper - transmit_handle was not initialized ***** FATAL ERROR ***** Reporting Task: AAAT. Software Version: 1.4.5.02 (date 20-Apr-2016 time 12:24:28) 0x16bd2c 0x16c290 0x167e50 0x7f2848 0x454d10 0x459568 0x6755c8 0x676100 0x6765bc 0x71cf58 0x1223f0

This error seems to be something related to DNS.

Actually I use host name in tacacs servers or sntp servers too. But they work pretty fine.

However host name in radius server seems to have problem and caused system crash.

If I change the host name to IP address, then 802.1x worked just fine.

What do you think ? Is this bug ?

I appreciate your opinions very much !

Miyoshi

Dmitry Chernyavsky · ‎05-26-2019

I can confirm that bug still exist on 1.4.10.6 for my SG300.

Step to reproduce:

1. Add two RADIUS servers by DNS-names with different priorities.

2. Make first server in list somehow unavailable (stop service or disable radius client, or make your way).
3. Try to connect several (more than one, because with single everything is ok) 802.1x-enabled workstations to switch at same time (1-3 seconds).
4. CABOOOM! FATAL ERROR, switch is rebooting.

I tried same config with only one RADIUS as DNS-name - no problem (except that you don`t have a backup radius connection).
I tried to connect only one workstation at time - no problem, switch uses backup radius.
But a lot of WS on same time and more than one RADIUS as DNS = FATAL ERROR.

Can we get some fix for that or SG300 is close to EOL?

Saro Gurunlu · ‎06-28-2019

I was also just hit by this bug with the same exact configuration as outlined by you. This is on the lower-end SG200 series running firmware 1.4.9.4.

DesmoPR · ‎05-18-2020

Hi guys,

i have in my network sf500 - sf300 -sg500 (4 units in stack) ALL devices with same problem with the last firmware (January firmware)

Step to reproduce:

1)configure radius web or ssh autentication with 2 radius server , in my case the first one is with the IP and backup one with fqdn name(just in case i'll set new ip in my radius)

2)swtich off the the radius server

3)try to login in web ui or ssh and...CABOOOM! FATAL ERROR, switch is rebooting.

Step to fix:

set the backup radius server with different priority!!!!!!!!! yes in my case it was enough set different priority....

before there are 2 servers with priority 1 and after 1 and 2.

Dear Cisco can you check your firmware?

I have 15 units all with the same problem....

Regards Cristian

The Sx550 family.... is not affected

jonrodr2 · ‎06-30-2016

Hello my name is Jonathan and i am one of the Engineers here at the Cisco SBSC team. I apologize for the inconvenience with this units.

I would suggest to first check if all of your units on the stack are on the same firmware version and boot code. if they are not matching among all of the switches, the first step would be to do the upgrade as the latest version is 1.4.5.2 as Mark mentioned and have them match, then it is a best practice to do a factory reset after the upgrade. You can save a backup configuration file in case you would like to proceed with it.

If after these steps you continue to have issues, feel free to contact us and open a ticket at 1-866-606-1866. thanks and have a great day.

Lance L · ‎08-25-2016

I am having the same exact problem with SG300 switches.

The following nmap command will cause any switch running 1.4.5.02 to crash.

nmap -sS -T3 -A -F

We have some switches running 1.3.0.62 which appear unaffected.

EDIT: I meant to post this here => https://supportforums.cisco.com/discussion/13071686/system-crash-again

>_<

switch stack often crashes with error message

Cisco Business Product Family

Cisco Switching Product Family