The Device Discovery Process doesn't end properly anymore. When I try to stop the process manually (Device Discovery Summary or CLI), the following error message is displayed:
"Exception in thread "main" com.cisco.nm.csdiscovery.CSDiscoveryException: Excpetion while unpublishing discovery urn. CTMRegistryClient::deleteURNEntry() : "CSDiscovery" not present
I can only stop the process with the Job Browser but since the process was canceled, new devices are not discovered.
We are using LMS 3.1.0, CS 3.2.0 on Solaris.
Run the command:
That should stop the current Discovery, and allow you to start another one from the GUI.
The command stopped the process. But when I start the Discovery again from the GUI, the process hangs up after discovering a quarter of our devices without any error message in the logs. After pushing the stop-button, the message "Unable to stop the running the discovery instance" is displayed. Details of CSDiscovery.log are in the attached file. Only pdterm stopps the process.
It looks like Discovery many be crashing at some point. If you enable Discovery Framework debugging under Common Services > Device and Credentials > Device Discovery > Discovery Logging Configuration, then re-run Discovery, and reproduce the problem, the ngdiscovery.log should have some additional errors.
Good hint. Previously I only enabled debugging under Server -> Admin and I was wondering why there was so few output.
Ok, I did as you said and I also made a device update. After that the Discovery Process didn't hang up anymore but it took 4 1/2 hours to discover 538 reachable and 162 unreachable devices and more than 530 devices were updated in the DCR. I've started the process twice and got the same behaviour.
I'm still wondering because before the problem appeared, the process took less than 1 hour and, if I remember well, it didn't update each device in the DCR.
Well, I'll have a closer look to the logs.
What are your configured SNMP timeouts and retries? It is a common mistake of customers to set these to high values not realizing how the code works. The time take grows exponentially when a failing device is encountered. Take for example the config of 2 retries with a 10 second timeout. When an unreachable device is encountered, the first attempt will wait 10 seconds before a timeout. The second attempt will wait 20 seconds. The third attempt will wait 40 seconds. So, for one unreachable device, Discovery has waited 70 seconds. If you multiply this by 162, you get an extra 3.19 hours spent doing nothing.
Because of that, I recommend one configures no more than 1 retry with a 6 second timeout. For 162 unreachable devices, this would still add 48 extra minutes. This also demonstrates how important it is to fix these unreachable devices (or filter them out).
also consider the debugging itself.
I know, that with debugging enabled you can extend discovery from hours to eternity. I do not know if this is caused if you enable a certain module for discovery or only if you enable all (or at least many) modules.
If the snmp settings are as jclarke mentioned, then just disable debugging and give it a new try.
Hello! Thanks for your comments, jclarke and mermel.
SNMP timeouts and retries had default values. More than 100 unreachable devices are small non-Cisco switches with CDP capability. After filtering them out, the process took 3 instead of 4 1/2 hours.
I disabled debugging und started the process again and, here we go again, the process hung up and I was unable to stop it from the GUI.
And again: with enabled debugging the process finished after several hours, after disabling debugging the process hung up. Funny ...
As the process always hangs up after the same number of discovered devices, I'll try to filter out several devices. May be one's misconfigured.
When the process hangs, it is possible to get a full thread dump which should reveal while it is blocked. This procedure is not straight forward, so if you can't track down a bad device, you should open a TAC service request, and have them walk you through the steps.
We experienced the same problem but fixed it by moving the seed devices from the global options to the CDP module in the discovery settings. CDP module was empty (no seed devices) but the checkbox for using the DCR as seed list doesn't seem to work.
After some tests we found that enabling the Discovery Framework debug the Device Discovery runs succesful (545 elements found). But when disabled (as it should) it stops at 23 devices and "hangs" / still runs and cannot be stopped.
Have you opened a TAC service request yet? As I said, the full thread dump would be extremely useful in narrowing down why Discovery is locking up.
Indeed, I didn't do it yet. In November I tuned the duration of the discovery process by excluding serveral IP ranges without Cisco devices. Then I was too busy to open a service request and I postponed it. Finally I forgot and debugging is still enabled.
Next week I'll be at the Cisco Networkers. After that I'll try to work it out with the TAC.
We could locate an IP range which seems to cause the problem. By excluding these IPs the discovery process runs fine with or without debugging. The process hangs up when the filter is not set. The IP range contains exclusively small non-Cisco devices (Nexans).
While we have seen cases where Discovery will stall on devices which do not support SNMP, I find it odd that devices can cause Discovery to completely lock up. Do these devices support SNMP, or are they purely IP nodes?
That may be part of the problem. If they reply with an unknown response, a Discovery thread might crash leading to the hang.
While filtering is a good workaround, it would be helpful to try and fix this. Providing the thread dump and ngdiscovery.log to TAC will help get to the root cause.
OK, I'll open a TAC case on Monday. Weekend has just begun here in Germany and I can't open a case directly.
Have a nice weekend.
that's great for you, but would you mind sharing the solution ? :-)
I still have one customer with the problem of a hanging CSDiscovery. TAC provided a patch for CSCsv42110 - while it solved the problem for one customer it did not for the other one...
The patch for CSCsv42110 didn't solve our problem neither. The issue was resolved after several class-files were updated. Since I don't know what was really changed and if it's applicable for everybody I don't think it would be a good idea to share the files with you.
I just wanted to advise that it works for us now after TAC intervention. Would be great if the analysis of our issue would help you too but it's not up to me to decide. Good luck anyway !
Silvia,thank you for this information! Would it be ok for you to publish the SR number so that I can give this info to the case owner and let him have a look into your case so he can decide if your solution could be of any help for my issue (in case there are any parallels)?
again, thanks a lot Silvia!
the case owner decided that I should install these class files and it worked! today discovery finished the first time for weeks or better month now. While there are still some points we have to troubleshoot more deeper it is a big step forward!
unfortunately, the classfiles fixes our problem only temporarily. The fist adhoc disco cycle finished without a problem as well as the first cycle of all 3 scheduled discovery jobs. But the next day the discovery job hungs again. So we are again at the point of troublshooting...
BTW, if it could be of interest for anybody, the SR is 609765723.
... if cisco staff is reading this... ;-)
I provided some java thread dumps to the SR 609765723 which were collected every 10 mins over a period of 6 hours along with some log files (debugging enabled for the following modules: