Cisco Support Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Announcements
Webcast-Catalyst9k

LMS 4.1 messing with fault counters

I'm wondering if other people are seeing this and perhaps have had an adequate response from cisco.

I regulary find the the numbers displayed don't match the reality, and now I have the most stunning example so far.

I have a customer who doesn't use fault management, so the 'module' is disabled. I can't prevent LMS from starting the services and the processes, but since this is a pretty beefy server I don't mind that much.

After a week the customer asked me why there was a positive number in the alarm counters and if I would be so kind to now correctly disable the fault management. So I checked again the management functions

fault management off.jpg

Then I wondered if there still devices in the RPS files that somehow did not get deleted.

reallly off.jpg

The counter then showed 13 informational alarms. I deleted the .rps files and did a dbrestoreorig on the DFM databases. Problem solved... well, for about 2 weeks.

Today I checked again to find I have again 2 critical alarms

fault counter.jpg

How does LMS do this? Should I try to disable fault management services to prevent this?

Cheers,

Michel

6 REPLIES

LMS 4.1 messing with fault counters

I haven't seen this before, but I was just wondering if "Fault Event Forensics" is enabled? If so, disable it;

Admin > Collection Settings > Fault > Fault Event Forensics Configuration

LMS 4.1 messing with fault counters

Good thinking Martin,

That is on by default.

I never saw this either when I tried LMS in my lab :-). But now I see bad counters all over the place.

And I have customers that tell me it doesn't match up with reality.

I suspect DFM is still listening for traps and converts these into errors/alerts

Not sure what to do about.

Cheers;

Michel

Hall of Fame Super Silver

LMS 4.1 messing with fault counters

Add me to the list of those confused by the counters in the tray not matching the Fault window.

I had checked several things to no avail. I had a TAC case regarding a related issue once and the TAC engineer was unable to resolve it satisfactorily when I brought it up with him. Under the covers there's still a lot of the SMARTS technology that's generally not well understood even among the engineers within Cisco.

I've written it off as a broken feature.

LMS 4.1 messing with fault counters

to completely starve Fault management (or DFM) out from traps you could change port 162 in

    NMSROOT\objects\smarts\conf\trapd\trapd.conf

to any free port like 30162 and restart dmgtd.

trapd.conf forwards to localhost:9020 but as it will not receive any trap on the new port it won't has anything to forward.

(as you would know, they doubled all DFM processes to share the load, so there is also a trapd1.conf)

If there are still alarms showing up, things are starting to get mysterious ...

LMS 4.1 messing with fault counters

Thanks Martin,

I changed the port it listens on. We will see what happens.

Actually trapd1.conf is the second listener on port 9020.

That part is not/cannot be doubled, only one trap listener.

But it won't receive anything anymore.

Thanks Marvin,

Its good to know that I'm not the only one facing this sh!t.

tac and engineering don't have a clue about DFM. They were quoting from manuals on the webex. I got the same

manuals from my EMC minded colleagues but it usually is the connection between LMS and smarts that I have problems with/need to debug.

They also don't understand why can't just delete all those devices reconfigure them perfectly and then rediscover the lot.  ;-)

Yesterday I learned that if a device is already configured with identity (dot1x), that it will not be picked up as a identity capable device. Not sure if I should believe that though. But I told the guy if a cisco vlan configuration tool, would require that all vlans are removed first, nobody would be buying or using it.

LMS 4.1 messing with fault counters

There is hope on the horizon

Seems this might be CSCtq07224. Although cisco thinks this only started to go wrong in version 4.2.

TAC has a SQL procedure to fix this. Now hope it applies to all cases I see.

Cheers,

Michel

519
Views
0
Helpful
6
Replies
CreatePlease to create content