URGENT: Sensors hang after update to 4.1.1S47(48)!

DSmirnov · ‎07-21-2003

I have my sensors hang every one-two days. Started to happened after upgrade to 4.1.1S37. The only workaround is to power-cycle/reset to a sensor!

Is anyone else observe the same?

DSmirnov · ‎07-22-2003

Sorry to escalate the issue. I have two sensors now in different locations with 4.1.1 hang every second day until I will power-cycle them. No messages in /var/log/messages, no login prompt on a console.

I do have a problem on sensors with 40-50 Mbits on sensing interface. Cisco, may be you do have a problem with new Ethernet drives in 4.1.1 under high-load?

Unfortunately I don't see much sense to go back to 4.0.2 since Cisco stopped making new updates for old version.

csthomas · ‎07-22-2003

I had a possibly similar failure after upgrading to 4.1, sensor (6500 IDSM2) hung after running a couple of days. SH VER showed AnalysisEngine as NotRunning. It seems to reboot OK. If it dies again, I'll open a TAC case.

/Chris Thomas, UCLA

jimmieharden · ‎07-23-2003

I have a 4235 that is having the same problem. I have opened a case. 2nd instance in 8 days. The AnalysisEngine is intermittently Not Running after Version 4.1(1)S47 upgrade

rlew · ‎08-08-2003

same problem with our 4235 units. After upgrading to 4.1.1, both units hang whenever I try to view the EVENTS. I have opened two TAC cases and have not yet received any information on how to tackle the issue.

rlew · ‎08-08-2003

same problem with our 4235 units. After upgrading to 4.1.1, both units hang whenever I try to view the EVENTS. I have opened two TAC cases and have not yet received any information on how to tackle the issue.

rwassom · ‎08-08-2003

How were these systems upgraded (CD or minor update package)?

Any issues with stability prior to this upgrade?

DSmirnov · ‎08-08-2003

Minor update package in my case.

No stability issues prior to upgrade.

Happend only on sensors with high-load: 30-40 Mbits in my case.

rwassom · ‎08-08-2003

Since 30-40Mbps is considered a high load, I assume your sensors are either IDS-4220-E or IDS-4210 appliances. If so, make sure you have installed the available memory upgrade, which is required to run 4.1 code. The upgrade is available free to customers with an active SmartNet contract.

marcabal · ‎08-08-2003

When you say "hang whenever I try to view the EVENTS" do you mean that the CLI just sits there when you execute "show events".

If so then this may just be a misunderstanding, or a side effect of a different problem.

When "show events" is executed (with out additional time parameters) it will start querying the EventStore for new alarms. As it receives new alarms it will display them to the user and continue querying. If there are no new alarms then it will wait and continue querying until new alarms show up.

"show events" is designed to continue this constant querying until the user types "Ctrl-C"

Ctrl-C will stop the querying and return the user to the sensor prompt.

For sensors that are finely tuned or that monitoring traffic with no attacks, it may take sometime before any alarms are ever generated. In these scenarios it may look like the sensor hangs. When in fact the sensor may be operating just fine, and monitoring normal traffic without creating alarms because no attacks are being seen.

Some users have mistakenly thought the sensor was hung, and did not realize that Ctrl-C is the normal way to break out of "show events".

NOTE: Even when parameters are passed to "show events" it may display older alarms on the system, but once old events have been displayed it will continue querying for new events until the user types Ctrl-C.

So not seeing any events with "show events" and having to type Ctrl-C does not necessarily mean that there is a problem.

However, if sensorApp (AnalysisEngine) is dying then you will not see any new events when you execute "show events".

So if you come across a situation where "show events" is not showing new alarms then try doing the following:

1) Execute "show version" to seen if Analysis Engine is still running. (Analysis Engine should ALWAYS be running. If it is not then make sure you are running the latest software version with the latest bug fixes. If the problem persists you can look for open DDTS Issues that may have workarounds. If you can't find an associated DDTS then contact the TAC. The TAC will need to contact engineering for help in determine what is crashing AnalysisEngine)

2) Check the output of "show interface". Check to see if the proper sensing interfaces are being monitored. Check if their link status is up.

3) Wait a little bit and run "show interface" again. Check if their packet counts are increasing from the previous show interface. Under the group statistics you can also check if the alarm count is increasing.

bobgerman · ‎08-09-2003

I am having this problem as well. Here are the symptoms. I'm running a 4230 with the latest signature pack: Cisco Systems Intrusion Detection Sensor, Version 4.1(1)S49

I leave it running, without even touching it, and after a number of hours, it hangs. It is usually accessible, but no longer registering signatures. My network is large enough that my test is whether a signature has been registered in the past ten minutes. To avoid touching the sensor itself and contaminating my results, I am querying the MySQL database on the IDS Event Viewer to determine whether signatures have been registered. On my network, a signature is detected at least every 30 seconds, so 10 minutes is giving it a LOT of leeway.

When it hangs, it is generally after it's been running a while. From two hours to eight hours. A completely unpredictable amount of time. When it's in that state, I can generally ssh to the sensor and issue a reset command. Also, I've been checking memory using "show vers" and generally when it's in this state it's 89% memory usage or higher. During normal usage it runs at 50%-65%.

I opened a TAC case on this once before, and Walter recommended re-imaging the sensor. We have only one sensor, so that's a big deal for us, but this week I finally got around to doing it. After reimaging and re-importing my configs, it is back to the same behavior. I would like to monitor memory usage to be able to correlate memory usage with the hanging behavior, but IDS software doesn't seem to support SNMP and it's too difficult to do an SSH script to extract the memory usage. It's quite simple to nail down what time it hangs by just querying the MySQL db for the last receive_date and receive_time of received signatures from event_realtime_table.

I don't think this is a symptom of an overloaded unit as far as signature processing, because I have disabled a fairly large number of frequent yet bogus or unimportant signatures.

I just caught the last part of rlew's troubleshooting steps, and can verify that Analysis Engine appeared to be running at the last hang. I will compare "show interface" to a known-good status next time it fails and report back here.

klwiley · ‎08-09-2003

There are a couple of potential problems that we are working on that could cause the sensorApp to exit unexpectedly. There are two workarounds one for each problem.

The first thing that you should do is insure that signature 1200 is NOT disabled. This signature should be enabled at all times for the moment. If it is causing you a lot of noise then please filter it with the Alarm filters for all IP addresses. This equates to it being disabled from the alert standpoint.

If your signature 1200 is already enabled and you are still having sensorApp exiting unexpectedly then I would request that you disable the SMB engine. We beleive that there may be a problem in this engine and are working on a fix.

If after you have applied these two workarounds you are still expeirencing problems please let me (klwiley@cisco.com) or the engineer that is watching this thread (kasper@cisco.com) know and we will contact you directly.

There will be patches available shortly for both of these defects.

DSmirnov · ‎08-09-2003

Ok,

I had sig 1200 enabled on all sensors.

I disabled SMB engine on all sensors with high load.

Now I have to wait few days to see if they hang..

Just a reply on previous answer:

I have 4230 so 30-40 Mbit/sec shouldn't be a problem and I think I have enough memory.

BTW, I think in our environment I could setup any kind of monitoring using nagios/ssh - CPU, MEM, interface counts, etc

Please let me know if I could measure something to locate the problem.

bobgerman · ‎08-09-2003

Not to throw things in a different direction, but I found something that's somewhat telling. When my sensor hung last, as stated before, AnalysisEngine did not crash, but most of the memory was used up. In the "show int" stats, it showed over 835,000 TCP Packets currently queued for reassembly. Since that seemed abnormally high to me, I started watching that counter after reset. As I suspected, it continues to rise, even in a low traffic environment. Without knowing internals, might I suggest the following? (a) they're not being properly processed, (b) they're not being properly cleared after being processed. Oh, and this is still occurring (that is, the number rising steadily, now at 196,000 after being up for 3 hours 18 minutes on a low-load day) AFTER adjusting TCP Stream Reassembly settings to "loose", 30 second open established timeout (default is 900), 15 second embryonic timeout, and 32 max queue size.

Should that number continue to steadily rise? If so, the high number of queued packets could explain the high memory usage which causes my sensor to hang.

jakasper · ‎08-09-2003

Thank you Dmitri and Robert.

For Dmitri's sensor, let us watch his memory usage

and the 'sh in gr' stat of TCP queued to see if the

TCP queue growth problem happens there.

Robert, your memory situation does sound serious and

the TCP queued should NOT be staying around for more than a couple minutes. Just a quick note: the memory

status reported by 'sh ver' is not very accurate. It

does not show the main/cached/swap values that are important to determine the real memory usage. We use the service account tool 'top' for this.

Robert, please email me directly at kasper@cisco.com

and include your TAC contact CC'ed on the mail.

We will keep pressing on looking into these possible

bug areas.

-JK