Re: CM 5.1.1: utm.log constantly growing

Martin Ermel · ‎06-05-2009

windows platform

As far as I know UTManager is writing to utm.log. Which issue can cause these messages being constantly written to the file along with some java messages:

class com.cisco.nm.cm.ut.utm.es.snmp.SnmpDot1xAuthES: CB-Error-genEx: Could not retreive information: java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)

class com.cisco.nm.cm.ut.utm.es.snmp.SnmpDot1xAuthES: CB-Error-genEx: Could not retreive information: java.lang.NumberFormatException.forInputString(NumberFormatException.java:48

a snippet of the log file is attached.

I assume it could be invalid data in the User Tracking database but currently it would be critical to just reinit the UT database in hope it will work afterwards. UT is currently not working - stops immediately after starting. UTManager is running and UTMajorAcquisition is idle

Joe Clarke · ‎06-05-2009

This error most likely has nothing to do with data in the ANI database. This error is probably indicative of a problem retrieving dot1x info from switches. Enabling TRACE level debugging for UTManager would help narrow down the problem.

However, if UT acquisition is not running, you have deeper problems. The ut.log and/or UTMajorAcquisition.log should have more details.

Martin Ermel · ‎06-05-2009

attached is the NetPro_utm.log I missed to add in my first post...

and UTMajor.info which contains some snippets of UTMajorAcquisition.log and ut.log.

without debugging enabled UTMajorAcquisition.log contains just a few simple messages of sending icmp and receiving a reply. But all this without a timestamps... only the modification time of the file shows it was last touched 2nd of June

the snippet of ut.log is not a current one but the customer says it looks the same..

I know this is not the best information because it is not actual, but currently I have no option to get the current ut.log. The snippet of ut.log is from an issue in april where I found these entries in ut.log:

[...]

Configured NonRogueOUI's:

2009/04/17 14:57:06 main ERROR VmpsMajorDBUtil: Error Marking Inactive entriescom.sybase.jdbc2.jdbc.SybSQLException: SQL Anywhere Error -157: Cannot convert '' to a timestamp

com.sybase.jdbc2.jdbc.SybSQLException: SQL Anywhere Error -157: Cannot convert '' to a timestamp

at com.sybase.jdbc2.tds.Tds.processEed(Tds.java:3281)

at com.sybase.jdbc2.tds.Tds.nextResult(Tds.java:2548)

at com.sybase.jdbc2.jdbc.ResultGetter.nextResult(ResultGetter.java:69)

at com.sybase.jdbc2.jdbc.SybStatement.nextResult(SybStatement.java:220)

at com.sybase.jdbc2.jdbc.SybStatement.nextResult(SybStatement.java:203)

at com.sybase.jdbc2.jdbc.ParamManager.nextResult(ParamManager.java:722)

at com.sybase.jdbc2.jdbc.ParamManager.getOutValueAt(ParamManager.java:504)

at com.sybase.jdbc2.jdbc.SybCallableStatement.getInt(SybCallableStatement.java:304)

at com.cisco.nm.ani.server.vmpsadmin.VmpsMajorDBUtil.updateInactiveEntriesInDB(VmpsMajorDBUtil.java:1608)

at com.cisco.nm.ani.server.vmpsadmin.VmpsAdminSMFGenerateTable.markInactiveEntries(VmpsAdminSMFGenerateTable.java:3180)

at com.cisco.nm.ani.server.vmpsadmin.VmpsAdminSMFGenerateTable.apply(VmpsAdminSMFGenerateTable.java:274)

at com.cisco.nm.ani.server.framework.EvalSMFApply.apply(EvalSMFApply.java:93)

at com.cisco.nm.ani.server.frontend.EvalGroup.executeMajorNode(EvalGroup.java:578)

at com.cisco.nm.ani.server.frontend.EvalGroup.apply(EvalGroup.java:656)

at com.cisco.nm.ani.server.frontend.EvalGroup.apply(EvalGroup.java:612)

at com.cisco.nm.ani.server.frontend.AniTimeBaseImpl.performScheduledFunction(AniTimeBaseImpl.java:118)

at com.cisco.nm.ani.server.vmpsadmin.UTMajorDiscovery.performScheduledFunction(UTMajorDiscovery.java:286)

at com.cisco.nm.ani.server.vmpsadmin.UTMajorDiscovery.startUp(UTMajorDiscovery.java:168)

at com.cisco.nm.ani.server.vmpsadmin.UTMajorDiscovery.main(UTMajorDiscovery.java:37)

Rogue MAC properties loaded from file: d:\PROGRA~1\CSCOpx\campus\etc\cwsi\MACDetection.properties

EnableRogueMAC property value: false

RogueMAC Notification Email-ID:

EnableNewMAC property value: false

UnQualifiedMAC property value: false

[...]

In april I checked the database and found some rows with an empty timestamp field. I deleted them and the above message was gone. But that is why I thought there could still be an issue with the ANI db.

I am a little worried about enabling UTManager debugging. When I told the customer to do so last time utm.log was growing imensely- several MB per minute an was in a short period around 1GB....

Joe Clarke · ‎06-05-2009

Without current data, it's impossible to say what the problem is. From these logs, there appears to be issues with wrong SNMP credentials, or other device access problems. I see nothing here that points to a database corruption problem except the message above. If the last seen time is empty for a row, then this error will occur.

You should probably have your customer open a TAC service request so live troubleshooting can be done.

Martin Ermel · ‎06-05-2009

On Monday I will be on site for a short period of time - I will try and get the current data.

I also think about just updating to CM 5.1.3 and see if the problem will disappear. Is there any reason I should not update in this situation? To work around the obvious impact on discovery because of CSCsz99839, I could just save the config files for adhoc discovery in case I need them for a restore after applying CM 5.1.3 - and the additional discovery jobs can just be deleted (yesterday I checked a solaris system that did get 4 disco jobs added by CM 5.1.3 - at least the jobIds where i a row and had the timestamp when the CM 5.1.3 was installed)

Joe Clarke · ‎06-05-2009

No, upgrading to CM 5.1.3 should be fine. Of course, without knowing the true root cause, I can't be certain something else won't break.

You might also try deleting NMSROOT/campus/etc/cwsi/DeviceDiscovery.properties to see if that prevents the migration.

Martin Ermel · ‎06-10-2009

installing CM 5.1.3 helps in that way, that I could start a UserTracking Acquisition 'manual' by clicking on 'Start User Tracking Acquisition" on Campus Manager hompage once. However, the _scheduled_ UT Acquisition job did not run; I assumed problems with jrm and deleted and recreated the job. But still no success. Now, trying to start it from the Campus Manager homepage does not work neither.

When I enable UTManager logging with TRACE level it promptly fills up utm.log and I get a 6GB utm.log within a couple of minutes (and without starting a UT process...).

What can prevent UT from starting? Are there any database checks that are done before starting the UT process? I won't exclude any issues with data in the UT tables but I hesitate to flush these tables as the customer currently need them to plan some network changes.

By the way, I did rename DeviceDiscovery.properties but this did not have had an impact. The default jobs were still created.

Joe Clarke · ‎06-10-2009

ANIServer is responsible for starting UT acquisition. Jrm should not come into play here. The relevant logs are UTMajorAcquisition, ut.log, and ani.log. As for the UTM problem, the errors I've seen thus far point to an SNMP issue polling dot1x information. However, without a sniffer trace of the SNMP traffic I cannot say for certain what part of the SNMP response is bad.

Martin Ermel · ‎06-10-2009

I did not find anything obvious ind UTMajorAcquisition.log and ut.log but I did not yet had a look at ani.log. I will do this when I can get access to the server!

I remember that I have seen some information about polling dot1x information but cannot recall where it was. But I know that I stumbled across these messages. If it reveals to be an issue with dot1x information - does this goes into the direction of an IOS problem or an LMS problem? I

Joe Clarke · ‎06-10-2009

Hard to say at this point. That's why a sniffer trace would be helpful. If the device is replying with correct data, then the problem is with UTM.

Martin Ermel · ‎06-15-2009

with "sa_conn_info" I found out that 8 rows of EmbuUserTracking had a deadlock caused by an AniMain process. Is there any know issue with this?

I believe that a minor acquisition caused this issue.

Joe Clarke · ‎06-15-2009

No, there are no known database deadlock issues with UT. Are you sure there was really a deadlock, and not just normal row locks? If the database had really deadlocked, there should be some hanging processes.

Martin Ermel · ‎06-16-2009

I am not used to these kind of database states and used the term "deadlock" due to this explanation (and the mentioned link)

http://www.ianywhere.com/developer/product_manuals/sqlanywhere/1000/en/html/dbugen10/ug-trlcon.html

Perhaps it was a simple row locking but as after these rows get locked no further update of any data from EmbuUserTracking occured. But ANI db was still working as the Data Collection is not affected from the current problem.

perhaps I go ahead and flush the User Tracking data.

Joe Clarke · ‎06-16-2009

ANIServer doesn't write to the EmbuUserTracking table. UTMajorAcquisition on UTManager do that. It is possible you have found some kind of deadlock, but I would expect to see either UTMajorAcquisition hanging, or one of UTManager's threads stuck waiting for a query to finish. You might want to open a TAC service request with the data you've collected so further analysis can be done.

Martin Ermel · ‎06-17-2009

a short update on this. I did not flush EmbuUserTracking yet, but did get UTMajorAcquisition running (for now...);

this is what I found:

- I found 8 connections (all being form June 14th) to ANIDb each blocking a row in EmbuUserTracking; 7 of these seems to wait for 1 to finish or proceed

- each of these connection could be mapped to MACUHIC

- stopping MACUHIC with pdterm did release the rowlocks

- immediately after stopping MACUHIC UTMajorAcquisition started to run without any manual action (i.e. not clicking on any item or button im LMS that would start UserTracking); - I believe the scheduled UTMajorAcquisiton was also just waiting in the background for the EmbuUserTracking table to be released.

- 5 other connections seems to be idle since June 14th; they could be mapped to a java process like UTCtrlMgr or something like this and which had utlite.properties as its config file (...I currently have no access to the exact data..)

- 4 or 5 rows had a negative dbId which I hope is not a problem - but strange; However, the LastSeen field did get updated today

- I also found a few entries with dbId > 4.500.000.000.000.000.000; dbId is defined as 'bigint' so I currently have used half of the numbers; But dbId seems not to increment by 1 instead increment by 30 000. I hope that bigint if used by java is not a problem.... - anyway its strange in a network with around 7000 endhosts....

I think I could provide some more details tomorrow....