According to the CWLMS Deployment Guide 3.0:
"Device Fault Manager provides the ability to monitor device faults in real-time and determine the root cause by correlating device-level fault conditions."
Can anyone explain in a little more detail exactly how this works, and how the "root cause" event and it's "symptom events" are displayed in the DFM?
Does it use the same "Codebook Correlation Technology" that is used in EMC Smarts?
No, DFM 2.0+ does not use the EMC Smarts code book. The EMC Smarts layer provides the raw events, and the Cisco layer roles those up into an alert. The alert provides a high-level summary of what is wrong with the device (e.g. Utilization problem). The events show you the atomic details (e.g. the CPU is 99% utilized). However, it really doesn't provide you with root cause data. That is, you won't be able to see that it was Virtual Exec taking up the CPU.
Can DFM work out that a number of devices (e.g. other switches) are not available because a port in a switch higher up the chain has failed?
Is it possible for DFM (or another Cisco product) to learn about this type of dependency?...
Or could, say, a "service" be defined, with dependent component relationships, so that DFM could tell the observer which services were affected by the failure of a component?
DFM is a device fault management tool, not a network fault management tool. It will not be able to correlate a device failure into a network segment failure.
Nothing in the CiscoWorks space does this kind of network fault correlation. I do not know of an existing Cisco product that does it for general networking.
Interesting reply ...
So does this mean that, as far as the Network analysis is concerned, another product(like EMS Smarts) would be required if we needed to be able to automatically identify where the root cause was if multiple devices were affected?
On the device level, if multiple events are received, how can DFM identify which one is the "cause" of the problem? (e.g. when multiple thresholds are exceeded).
2. DFM doesn't really. DFM simply shows you a alert which is composed of one or more events. The most recent event which triggered the alert is used to identify the type of alert (e.g. Utilization, Interface, Reachability, etc.). The best way to understand this is with an example. See the attached screenshots.
I have a switch, 220.127.116.11 which has en Environment alert. The events in that alert are a fan problem, plus two unresponsive IP address. Because the fan problem is most recent, the alert is categorized as an Environment problem.
I found your very interesting discussion about DFM and RootCause and I hope to can help a customer with DFM notification spam with continuing this discussion.
1st of all the notification spam is not caused by wrong set polling and thresholds - no at filter and threshold level there is absoluty no problem - its the best tuned DFM I ever found. No the notification spam is caused of 2 importent and not provided features at DFM >=2:
1. missing RootCause at Network Level:
a) Are there any known other solutions/setups you can switch in the notification chain after DFM, e.g. SEC (simple event correlater) or Nagios to can use DFM alerts for network outages as well without making more problems with spam at blackberry during this earnest situations?
b) Did the Smarts DFM v1 realy this network level RootCause with the codebook only without knowing the topology - there was no interface from CM to DFM to get dependency informations from topology?
2. Deduplication: I mean suppression of equal, subsequent alerts in a defined time window, if you have high frequent alerts due for instance flapping situations. I did'nt found Deduplication in DFM until yet. Can I do deduplication with DFM and how can I find it? If not - same as with 1) are there solutions to can switch into the notification chain after DFM?
just a comment...
DFM ... it has its own story .....
DFM 1.2 was really good (at least compared to DFM 2.0+). It has had a GUI where you did see correlated alarms in different colors on a split window and a good classification of the alarm. DFM 2.0+ is an awful castrated version of the original. If you look at the code DFM has its own mechanism to discover the network to know the topology. It does not need any other application to know this - but it is not used in LMS. It has mechanisms to correlate events - but they are not implemented.
I would like to know who made the decission to code something like this (and I would realy like to discuss it...)
1) try to reach EVERY IP address a device has with ICMP (BY DEFAULT !!)
and need a couple of time to implement a script to remove this again
2) REMOVE the feature to start a script based on an event, - just allow email and trap-forwarding
[if anbody was successfull in registering the script notifier again it would be great to share this information...]
3) show up these very special alarmIDs in the GUI and provide them on the device center GUI (Device Alert Identifier) without making it a link to the "Alerts and Activities Detail" Page
4) do not provide the posibility to add a specific trap as a customized event
5) do not show detailed information in "Detailed Device View" if a device is in an unknown state - this could be interesting for troubleshooting..
... and there are a few more ...
I know also both Smarts DFM v1.x and Cisco DFM 2 and 3.x and I share your opinion that v2/3 is all other than a improvement and too bad about the development expenses:)
But I'm looking for solution's and workarrounds to can live with the real existing DFM, currently v3.2. Honestly, I don't count with a useable improvement in the DFM functionality before LMS-4.
I was thinking about 2 DFM notification extender opportunities:
1. SEC: you can reach deduplication, but not root cause
2. Nagios: you could reach deduplication with nagios flap detection and root cause if you can built up the topology with strong hierarchical parent chield relation ships of devices into nagios. Host names must be case sensitive exactly the same as in LMS=>DFM/CM. To get the topology you will need to export CM (format: device, only that neighbors having higher hierarchie e.g. based on hopcount from network center device than device itself) and a perl-script converting this to basic nagios configs containing parent-chield-relations and a passive service to can catch the DFM notifications via SNMP-Trap or syslog. If you suppress u(for Unkown state) in Nagios notification you should get network outages from Nagios view only, that will be similar to a network level root cause.
So we need from LMS an script based export method from CM for topology. What possibilities are there, what fields are most applicable for topology?