LMS 3.2 DFM Alerts don't match up with real life

Answered Question
Nov 24th, 2009
User Badges:

We get a lot of DFM environmental alerts like this:



EventID: 000H769
PropertyValue
ComponentTEMP-switch1/7030 [Te2/1/2 Module Temperature Sensor-TenGigabitEthernet2/1/2 Module Temperature Sensor]
ComponentClassTemperatureSensor
ComponentEventCode1079
StatusOK
entSensorValue280
CurrentValue280.0 DEGC
RelativeTemperatureThreshold10.0 %
HighThreshold45.0 DEGC



But in real life the values are like this:


            Temperature  Voltage  Current   Tx Power  Rx Power
Port        (Celsius)    (Volts)  (mA)      (dBm)     (dBm)
----------  -----------       -------       --------       --------       --------
Te2/1/2       28.0       0.00       7.9 --        -2.2      -2.8 




The thresholds for the interface are this:



                                      High Alarm  High Warn  Low Warn   Low Alarm
           Temperature         Threshold   Threshold  Threshold  Threshold
Port       (Celsius)              (Celsius)   (Celsius)  (Celsius)  (Celsius)
----------      ------------------       ----------       ---------       ---------  ---------
Te2/1/2      28.0                     74.0        70.0         0.0       -4.0



So actually nothing on the TenGig interfaces DFM is telling us is true, and therefore we get a bunch of false alerts.


Is this a bug (, not correct) or setting mismatch anywhere, please help?

Correct Answer by Joe Clarke about 7 years 6 months ago

It looks like this is a VSS.  In that case, I think you're seeing CSCta08882 which will require you to exclude the problematic entities from your SNMP view.


  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 5 (1 ratings)
Loading.
Correct Answer
Joe Clarke Tue, 11/24/2009 - 09:29
User Badges:
  • Cisco Employee,
  • Hall of Fame,

    Founding Member

It looks like this is a VSS.  In that case, I think you're seeing CSCta08882 which will require you to exclude the problematic entities from your SNMP view.


orsonjoon Tue, 11/24/2009 - 22:58
User Badges:

Yes, youre right it's VSS. So does this bug get solved in an update soon?

Joe Clarke Wed, 11/25/2009 - 10:02
User Badges:
  • Cisco Employee,
  • Hall of Fame,

    Founding Member

It's still waiting on a fix from EMC.  An ETA is currently not available.

orsonjoon Wed, 11/25/2009 - 22:54
User Badges:

EMC ??, do you mean the storage supplier, or something else, and if you do what do they have to do with this?

Joe Clarke Wed, 11/25/2009 - 23:59
User Badges:
  • Cisco Employee,
  • Hall of Fame,

    Founding Member

Yes, EMC the storage company.  They acquired Smarts who writes the backend device management and fault engine for DFM.  The problem is with their engine, and we are awaiting a fix from them.  As of now, a fix is slated to be in DFM 4.0 due out next summer.

Joe Clarke Thu, 11/26/2009 - 00:03
User Badges:
  • Cisco Employee,
  • Hall of Fame,

    Founding Member

And I pasted the wrong bug before.  There are actually two very similar VSS bugs.  The one concerning temperature problems is CSCta18610.  The fix is the same in that EMC will need to provide it, but there is a slightly different workaround.  The easiest solution is to unmanage the problematic sensor in DFM.  However, in some cases, the temperature is high, but not a problem for the device.  In that case, there is a more tactical workaround which can be done in DFM.  I don't think this applies to you, though, because DFM is seeing a value of 280 C.

orsonjoon Thu, 11/26/2009 - 00:59
User Badges:

Hi Joe, thanks for clearing this up ;), but what you are actually saing is that I have to disable the temperature sensor element for each port?

This is way to time consuming to do this manually, because we are talking about thousends of sensors.

And on the other hand we would like to receive real environmental messages about VSS hardware.


DFM.jpg

Another thing is the quality of email messages we receive from DFM, its almost impossible to link each port to the sensor element in DFM.


For example this is the email notification:


EVENT ID = 000H760

ALERT ID = 00005GO

TIME = Tue 24-Nov-2009 13:33:55 CET

STATUS = Active

SEVERITY = Critical

MANAGED OBJECT = switch1

MANAGED OBJECT TYPE = Switches and Hubs

EVENT DESCRIPTION = OutOfRange::Component=TEMP-switch1/6051 [Te2/5/4 Module Temperature Sensor-TenGigabitEthernet2/5/4 Module Temperature Sensor];ComponentClass=TemperatureSensor;ComponentEventCode=1079;Status=OK;entSensorValue=280;CurrentValue=280.0

CUSTOMER IDENTIFICATION = All devices

CUSTOMER REVISION = 1


Where 6051 is the element name linked to a specific port TenGigabitEthernet2/5/4, it's just not logical in my view to use different names for basically the same thing.


The email message is not cristal clear in one view what exactly is the problem, not only for this specific issue but for all email alerts we get from DFM.

You always have to put a lot of effort and time in it to see what is the problem and what could have cause this.

I wished we could actually save time using LMS, not put al lot of needless time in it.


Is there another way to clear up this problem and the millions of false email notification messages from DFM (patch or update)?

orsonjoon Thu, 11/26/2009 - 01:00
User Badges:

Hi Joe, thanks for clearing this up ;), but what you are actually saing is that I have to disable the temperature sensor element for each port?

This is way to time consuming to do this manually, because we are talking about thousends of sensors.

Ofcourse we can use the bulk manage/unmanage method for this, but on the other hand we would like to receive real environmental messages about VSS hardware.


DFM.jpg

Another thing is the quality of email messages we receive from DFM, its almost impossible to link each port to the sensor element in DFM.


For example this is the email notification:


EVENT ID = 000H760

ALERT ID = 00005GO

TIME = Tue 24-Nov-2009 13:33:55 CET

STATUS = Active

SEVERITY = Critical

MANAGED OBJECT = switch1

MANAGED OBJECT TYPE = Switches and Hubs

EVENT DESCRIPTION = OutOfRange::Component=TEMP-switch1/6051 [Te2/5/4 Module Temperature Sensor-TenGigabitEthernet2/5/4 Module Temperature Sensor];ComponentClass=TemperatureSensor;ComponentEventCode=1079;Status=OK;entSensorValue=280;CurrentValue=280.0

CUSTOMER IDENTIFICATION = All devices

CUSTOMER REVISION = 1


Where 6051 is the element name linked to a specific port TenGigabitEthernet2/5/4, it's just not logical in my view to use different names for basically the same thing.


The email message is not cristal clear in one view what exactly is the problem, not only for this specific issue but for all email alerts we get from DFM.

You always have to put a lot of effort and time in it to see what is the problem and what could have cause this.

I wished we could actually save time using LMS, not put al lot of needless time in it.


Is there another way to clear up this problem and the millions of false email notification messages from DFM (patch or update)?

Joe Clarke Thu, 11/26/2009 - 08:50
User Badges:
  • Cisco Employee,
  • Hall of Fame,

    Founding Member

Unfortunately, not.  This issue is not yet resolved, and the only workaround in your case is to unamage each bogus sensor.

Actions

This Discussion