DCR internal error in communication channel

Answered Question

I have reported this error before...and have a TAC case open for it...but have found a workaround that I wanted to share that might shed some light on the issue.

The URL of Common Services > Device Management when I get the error contains the FQDN.

If I modify this URL by removing the domain suffix and attempt the same change in the DCR the change is successful.

Any ideas?

I have this problem too.
0 votes
Correct Answer by Joe Clarke about 7 years 1 month ago

I found the problem. As I predicted, it has nothing to do with browser, FQDN, or anything. It is a transient issue that only affects Windows SMP systems. It tends to occur mostly on faster machines. A patch is on its way.

Correct Answer by Joe Clarke about 7 years 1 month ago

Local applications should never need to be registered. This happens at install time. If, however, you lose local applications (like Common Services) there is a command-line only procedure to get them all back. TAC needs to walk you through this, though.

I don't see how it's possible not to have applications which are installed on a server not available for remote registration. I certainly cannot reproduce that. You may very well have a problem with the CMIC registration database, and it might be a good idea to have TAC walk you through the procedure to dump and re-register all local applications on both servers.

Correct Answer by Joe Clarke about 7 years 1 month ago

No, the register from remote server works just fine with short hostnames. I'm using it that way in my lab. Everything comes done to the SSL cert. If you create the certs with short hostnames, then you register servers with each other using short hostnames, and you access servers using the short hostnames, then everything will work.

Alternatively, if all of this is done with FQDN, then FQDN should work.

You need to decide how you want this all to work, then start from scratch. Regenerate the SSL certs on each server using the proper hostname. Remove all accepted peer server certs from each server, then reimport the new certs using the proper hostnames. Finally, re-register the applications from remote servers using the proper hostnames, re-setup DCR integration using the proper hostnames, etc. and everything should just work.

  • 1
  • 2
  • 3
  • 4
  • 5
Overall Rating: 4.9 (14 ratings)
Loading.
Joe Clarke Mon, 10/12/2009 - 12:31

Hostname problems commonly cause errors due to certificate mismatches. One should always access LMS using the same hostname as configured in the certificate.

OK...but the applications registered for remote servers show a FQ hostname in application registration status.

I assume this makes all links to that application point to the FQDN url as well. Which would cause the users trouble when going to CS on the remote server and thus getting this internal comm error because it used the FQDN url for the link.

I just realized that the application registration uses the FQDN when you register using the import from other servers option. If I were to unregister everything yet again. and register them with the 'from template' option they would no longer call the FQDN in the links. Is this accurate?

Why is the remote servers option there if the system doesn't mesh well with FQDN?

Correct Answer
Joe Clarke Tue, 10/13/2009 - 09:18

No, the register from remote server works just fine with short hostnames. I'm using it that way in my lab. Everything comes done to the SSL cert. If you create the certs with short hostnames, then you register servers with each other using short hostnames, and you access servers using the short hostnames, then everything will work.

Alternatively, if all of this is done with FQDN, then FQDN should work.

You need to decide how you want this all to work, then start from scratch. Regenerate the SSL certs on each server using the proper hostname. Remove all accepted peer server certs from each server, then reimport the new certs using the proper hostnames. Finally, re-register the applications from remote servers using the proper hostnames, re-setup DCR integration using the proper hostnames, etc. and everything should just work.

Ok so I tried 'starting from scratch' tonight...

Reverted both servers back to StandAlone mode for both DCR and SSO.

Deleted all peer server certificates.

Unregistered all applications.

Restarted the Daemon Managers.

Regenerated the certificates with the FQDN.

Modified the Homepage settings Server Name to reflect the FQDN.

Restarted the Daemon Managers.

Imported peer certificates using FQDN.

Changed DCR modes appropriately.

Changed SSO mode appropriately.

Restarted Daemon Managers.

Preparing to Register Applications:

On each server I chose import from remote server and wrote down what was detected as already registered on the remote servers.

DCR Master w/RME,DFM: Common Services, Setup Center, CiscoWorks Assistant, and Dev Diag Tools

DCR Slave w/CM,CV,IPM: Common Services, Setup Center, CiscoWorks Assistant, Dev Diag Tools, CM Setup Center

With that in mind I find it strange that CM Setup Center is showing up while IPM Setup Center does not. I have come to assume that you have to register CM, IPM, RME, and DFM but not Common Services. Is this accurate?

In any case...I proceeded to register the main components on their respective servers. The big question here is, after choosing Register From Templates, I am prompted to enter a server name which I assume we would stick to the 'plan' and input the FQDN. Is this accurate?

Having opted for the implied answer and inputing the FQDN I successfully registered all the local applications and proceeded to import the apps from the remote servers using the FQDN.

Having followed your advice I still seem to have missed something for I have some strange things that occur now.

Most prominent is the Device Allocation Summary. On the DCR master it reports, Error In getting Installed Applications in DCR domain. And the slave only shows DFM and RME with all devices managed.

To me this suggests that one server has discrepancies somewhere. I just don't know where to start looking.

First thing I have done it unregister CM since it is one of the applications that isn't showing up correctly. I then re-registered it with the shortname and it shows up accurately in the device allocation summary.

Hence my utter confusion. Help!

Joe Clarke Tue, 10/13/2009 - 20:05

You shouldn't be using templates to register apps from remote servers. You should select the Remote Server option, enter the FQDN, and select the apps from the list. You are right that you should only import the main apps. These include RME, CM, CS, DFM, IPM, and CiscoView.

Where are you seeing this discrepancy in the auto allocation summary. A screenshot would be helpful. It still sounds like your application registration approach is wrong.

I tried doing some of the things I believe you want to do, and so far, I haven't encountered any problems.

I have no doubt that my application registration approach it incorrect. LOL...

What I haven't seen you mention is the process of registering local applications. Maybe this is where the problem lies. Do you have to do this?

For instance: DCR master houses CM and IPM and those applications aren't available to a remote server for import unless I first register them on the local server. Hence the confusion about do I register with the template option and the FQDN or what?

Correct Answer
Joe Clarke Wed, 10/14/2009 - 07:52

Local applications should never need to be registered. This happens at install time. If, however, you lose local applications (like Common Services) there is a command-line only procedure to get them all back. TAC needs to walk you through this, though.

I don't see how it's possible not to have applications which are installed on a server not available for remote registration. I certainly cannot reproduce that. You may very well have a problem with the CMIC registration database, and it might be a good idea to have TAC walk you through the procedure to dump and re-register all local applications on both servers.

Joe Clarke Wed, 10/14/2009 - 08:35

The hostnamechange script does modify the CMIC records, but I'm not sure it will fix all of your problems. It really sounds like somethings are missing which should not be. It's certainly easier to give it a try first, though. However, the procedure to which I refer involves deleting the existing CMIC database, then re-registering the local templates from the command line.

Joe Clarke Wed, 10/14/2009 - 17:19

You can escalate the case as a severity 1 if you are available, and can work on it now. That will queue it the next engineer. Else, if you have your engineer contact me tomorrow, I can help them find the necessary procedure.

I need assistance with the command I was provided for re-registering the applications. There must be a typo...Can't get my engineer to return an e-mail or phone call. Can you advise?

I'm not sure if you want me to paste the command on here or not since it was so hard to come by, but here is the error I get:

Exception in thread "main" java.lang.NoClassDefFoundError: administration/1/0

The filename used was administration.1.0.xml and I was advised to remove the .xml for this command.

Joe Clarke Thu, 10/15/2009 - 12:33

Yes, and you're missing a piece. You forgot the actual class name to execute. The class name, which comes after the end of the classpath argument, and before the filename is:

com.cisco.nm.cmf.registry.CMICApplicationRegistry

I don't think we have resolved this issue.

It still occurs from time to time.

I am still not 100% sure about the do's and don't of hostname vs. FQDN with multiserver setup.

I know you recommended using the shortname, but there are times when that doesn't make sense.

For example, lets say I have 2 servers (ciscoworks-cm.domain.net and ciscoworks-rme.domain.net).

I was advised to generate the certificates using the hostname only (ciscoworks-cm and ciscoworks-rme).

In the Homepage settings then I would have to put the short hostnames as well.

When I register applications from a remote server it asks for a server name and display name both of which I assume should be the short hostname.

Then when the apps are registered you see a hostname column for each app that is registered and it apparently reads the FQDN from somewhere and that is what is shown as the hostname for the remote apps. (I'd imagine this could be the md.properties file.)

You also have to provide the servername when you setup SSO and the DCR Master/Slave settings both of which rely on the imported certificates and therefore must match with the short hostname.

Somewhere though the server is told to use the FQDN for URLs and this throws things off when you have your certificate generated with the short hostname.

Out of the box several weeks ago this was the issue which proved an issue when it was apparently what caused the "internal error in communications channel" issues.

I then began a mission to get everything to reference the FQDN since I couldn't successfully get everything to use the shortname.

Plus it just doesn't seem acceptable to expect users to address the site by its short hostname only. This requires tedious fenagling with each users HOSTS file or DNS settings to make certain that it won't append an alternate domain suffix.

I have chased this goose entirely too long.

I even upgraded to LMS 3.2 hoping that would work out some of these kinks but the "internal communication channel" error still seems to rear its head as it pleases.

Joe Clarke Sun, 10/25/2009 - 13:55

The display name of the registered application can be anything you want. The hostname should be the short hostname.

I did some testing with FQDN vs. short hostname internally on my LMS 3.2 servers, and found things to work generally pretty well when using FQDN except when it comes to application/device mapping (PIDM) and Device Center. I have two machines registered with each other by FQDN, and so far I have not had any communications problem with DCR (though Device Center links use the short hostname of the peer server).

On top of that, the logs I have seen thus far don't point to any real root cause of these issues. There also doesn't appear to be any debugging which can be enabled to give more information. At the very least some code changes would be required to get more clues as to what is going on.

For this reason, you will need to work with TAC so patches can be provided to try and isolate what is going on when this error occurs.

Just to be certain that the issue wan't due to a server domain suffix change after installation, I fully formatted and reinstalled LMS 3.2.

I have 2 licensed servers still, but have added a third HUM trial to the mix. So 2 slaves.

I am determined to get this working with FQDN, which may prove to be more trouble than it is worth. :)

All configurable references to the remote servers use the FQDN. The certs were generated with the FQDN and the Homepage Settings reflect the FQDN.

Still getting "internal error in communication channel. It actually seems more apparent that the slaves initially attach to the master but drop off soon after.

Browsing the attached logs brought me to this theory.

I also noticed that the DCR mode settings report:

Current DCR Settings

Mode: Slave Master Hostname: [masterhostname.domain.com]

Port: 443

Master Certificate: Valid

Master Server is unreachable.

So I changed the DCR settings to call the short hostname only and that wasn't sufficient.

I still get "Certificate HostName [masterhostname.domain.com] and the URL Host Name [masterhostname] do not match

Before Calling the astandalone to slave

--------------------

I obviously have to generate the certs with the shortname until this issue is addressed further...

Again, the problem I have with having the cert use the shortname is the browser complaints of the URLs not matching when our end users access the server by the FQDN. It doesn't seem plausible to expect users to open their browser and go to https://masterhostname instead of https://masterhostname.domain.com.

Any thoughts or additions?

Joe Clarke Tue, 10/27/2009 - 10:42

There is not enough information in these logs to determine why your DCRs are unable to sync up.

I recommend you open a TAC service request, and keep it open until this is working. I know it works as I'm currently running in such a configuration. I can only guess that something still has not been done right (or hostname resolution is not working correctly for FQDN).

As to your last point, given that Device Center will still use short hostnames even if everything else is using FQDN will mean that your users will still get prompted to accept the cert hostname mismatch (and authenticate again if using SSO).

Still seeing the internal error in communication channel error...ugh! ;)

I was hoping you could explain the purpose of the Home Page Server Name in the Home Page Settings under CS > Server > Home Page Admin.

In a multiserver environment should this be your appointed web server and thus match on all servers or just each servers own local hostname or FQDN?

What is the Provider Group Name function...I assume it is one in the same, but does it affect anything I am seeing.

Right now all of my certs are configured with the shortname as advised, however this provider group name or home page server name is the FQDN of each server itself. Should I modify that?

Today when I received the internal communication error I was trying to update the device credentials on a few devices none of which were successful. All returned the error.

I then went to the browser address and modified it to just the shortname. Still saw the error.

I then decided to clear my browser cookies, history, and temp files and tried again. Same error with the FQDN url, but when I tried the shortname I was able to modify the credentials of every device in the db. ;)

Any insight here? Does this help? LOL

Joe Clarke Fri, 10/30/2009 - 17:31

The homepage name can be anything you want. You could call it "Cowboy Server" if you wanted. It's just a logical name to present to users (though there are some internal uses as well). There used to be some issues with making this something other than the hostname, but those should be fixed now.

No, this still doesn't explain why this error is occurring. Given the transient nature, and the fact that I cannot reproduce on two clusters, perhaps there is something wrong with the server itself (e.g. bad memory). Or, maybe there is some conflict with something else installed on this server. What services are currently running on the master?

I assume you mean just LMS services. So I have attached the pdshow.

The servers are all brand new servers with no other "obvious" apps on them. But you never know, I know.

I am starting to wonder if I ever see the error when accessing the DCR directly from the server. I will test that some.

I guess I didn't mention that I usually don't access the server directly when making changes to the devices in DCR. In fact that may be why I couldn't reproduce the exact error while I was with TAC. Hmmm...I'll begin testing that immediately. We definitely have some tight firewalls here that we could be battling with...

However, it is important to note that there are no firewalls between the master and its slaves. They are all in the same subnet. So the master unreachable should be different.

Joe Clarke Fri, 10/30/2009 - 18:06

No, I meant non-LMS services. LMS will not conflict with itself. But other services could be hindering it.

The client shouldn't have a bearing on how DCR works. All of the communication happens either internally or between servers.

Ok.

But I have an idea...

This error may only be occuring from a remote PC that initially logs into SSO with the FQDN without having been prompted for SSO relogin at the URLs that are generated with the short hostname.

This is why it is so flaky and hard to pin down. It is also why it seems to not occur with Firefox, while indeed it does.

Maybe you can reproduce it with this info.

All certs are shortname, but we access the site with the FQDN. (clear all browser cache) Go straight to the DCR and try to modify. Internal Error !!

Modify the URL to just the shortname it will prompt for SSO again, but you can modify the DCR (cause you are already authenticated)!!

yay...I can reproduce consistently.

Joe Clarke Sun, 11/01/2009 - 10:09

No, I can't reproduce. I've done everything exactly how you said. Server certs are short hostname, SSO master/slave, DCR master/slave, HTTPS mode enabled, cleared browser cache, restarted browser, connected to master using FQDN, went to DCR, modified credentials, added a new credential set, modified identity, added a new device. Everything works without error.

Just in case your error has changed, post the new dcr.log after reproducing this problem.

Joe Clarke Sun, 11/01/2009 - 10:52

Yes. I've tried using both Firefox and IE from remote servers. I rarely ever use the LMS server itself as a client.

Joe Clarke Sun, 11/01/2009 - 15:00

I was more interested in seeing a screenshot of the services control panel showing all running services.

Joe Clarke Mon, 11/02/2009 - 09:22

This looks fine. The only major difference is that you have the SNMP service installed where as my servers do not.

The logs are as useless as before. The methods that fail are different, and there doesn't appear to be any real pattern. I will need to provide you some debugging patches through your service request. Have your engineer contact me, and I will send the patches. They should tell us definitively why this is failing (though maybe not the true root cause).

My engineer is out of the office still.

Is this normal?

---------- stdout.log ---------------------

[Sun Nov 01 01:29:30 CST 2009] CCN::initializeSSO Connecting to https://CISCOWORKS:443/CSCOnm/servlet/com.cisco.nm.cmf.servlet.ProcessSSOServlet

[Sun Nov 01 01:29:30 CST 2009] CCN - SSO mode = 1

[Sun Nov 01 01:29:30 CST 2009] CoreContextNexus ValidateConnection:true

[Sun Nov 01 01:29:30 CST 2009] Invoke doLicenseCheck!

[Sun Nov 01 01:29:30 CST 2009] DoLicenseCheck Returned:true

Local Server URL :https://ciscoworks.net.okstate.edu:443

log4j:ERROR Attempted to append to closed appender named [A-client1].

log4j:WARN Not allowed to write to a closed appender.

log4j:ERROR Attempted to append to closed appender named [A-client1].

log4j:WARN Not allowed to write to a closed appender.

log4j:ERROR Attempted to append to closed appender named [A-client1].

log4j:WARN Not allowed to write to a closed appender.

log4j:ERROR Attempted to append to closed appender named [A-client1].

log4j:WARN Not allowed to write to a closed appender.

log4j:ERROR Attempted to append to closed appender named [A-client1].

log4j:WARN Not allowed to write to a closed appender.

log4j:ERROR Attempted to append to closed appender named [A-client1].

log4j:WARN Not allowed to write to a closed appender.

log4j:ERROR Attempted to append to closed appender named [A-client1].

log4j:WARN Not allowed to write to a closed appender.

log4j:ERROR Attempted to append to closed appender named [A-client1].

log4j:WARN Not allowed to write to a closed appender.

Is it possible that this is all due to a TCP session timing out?

How does the DCR allow changes to occur? Lets say I log in and make a change to the device credentials...I have thus done so via a particular TCP port on the server. Is there a specific thread involved as well? Is my TCP port specific to my username or client IP or anything?

If I login via the FQDN vs. the short hostname am I offered another thread or TCP port or user session that might eliminate the effect of the TCP timeout via the alternate login?

Just brainstorming...lol...did all that come out in a sensible fashion?

Joe Clarke Mon, 11/02/2009 - 10:27

As I said before, there is no way this is client-related. All of this happens on the backend. The way it works is that you submit an HTTP request saying you want something to happen. A servlet (like a CGI) takes this request, and calls a remote method via our proprietary CSTM RPC system. CSTM then tries to resolve the desired request using the given Java class and method names. THIS is failing from time to time. Why that failure occurs is not clear. For some reason, the DCR JVM running the CSTM client code is unable to resolve the method name from the given class.

Correct Answer
Joe Clarke Wed, 11/04/2009 - 09:48

I found the problem. As I predicted, it has nothing to do with browser, FQDN, or anything. It is a transient issue that only affects Windows SMP systems. It tends to occur mostly on faster machines. A patch is on its way.

Actions

This Discussion