12-27-2011 09:45 PM - edited 03-01-2019 10:12 AM
Hi,
We have configured Call Home option in UCSM and we are getting below error from Call Home option since last Saturday. We have open TAC with Cisco to troubleshoot this error but as per TAC "The error is a transient error from which the fabric interconnects can automatically recover."
Below is the error messages we are getting
E-mail-1:
Subject:
System Notification from System-A - diagnostic:GOLD-major - 2011-12-27 17:54:09 GMT-00:00 Fabric Interconnect B, management services are unresponsive
Body Message:
System Name:System-A
Time of Event:2011-12-27 17:54:09 GMT-00:00
Event Description:Fabric Interconnect B, management services are unresponsive
Severity Level:6
E-mail-2:
Subject:
System Notification from System-A - diagnostic:GOLD-major - 2011-12-27 17:54:09 GMT-00:00 Fabric Interconnect B, management services are unresponsive
Body Message:
<?xml version="1.0" encoding="UTF-8" ?>
<soap-env:Envelope xmlns:soap-env="http://www.w3.org/2003/05/soap-envelope">
<soap-env:Header>
<aml-session:Session xmlns:aml-session="http://www.cisco.com/2004/01/aml-session" soap-env:mustUnderstand="true" soap-env:role="http://www.w3.org/2003/05/soap-envelope/role/next">
<aml-session:To>http://tools.cisco.com/neddce/services/DDCEService</aml-session:To>
<aml-session:Path>
<aml-session:Via>http://www.cisco.com/appliance/uri</aml-session:Via>
</aml-session:Path>
<aml-session:From>http://www.cisco.com/appliance/uri</aml-session:From>
<aml-session:MessageId>1058:SSI1442BFRC:4EFA0641</aml-session:MessageId>
</aml-session:Session>
</soap-env:Header>
<soap-env:Body>
<aml-block:Block xmlns:aml-block="http://www.cisco.com/2004/01/aml-block">
<aml-block:Header>
<aml-block:Type>http://www.cisco.com/2005/05/callhome/diagnostic</aml-block:Type>
<aml-block:CreationDate>2011-12-27 17:54:09 GMT-00:00</aml-block:CreationDate>
<aml-block:Builder>
<aml-block:Name>UCS 6100 Series Fabric Interconnect</aml-block:Name>
<aml-block:Version>4.2(1)N1(1.43q)</aml-block:Version>
</aml-block:Builder>
<aml-block:BlockGroup>
<aml-block:GroupId>1059:Serial Number:4EFA0641</aml-block:GroupId>
<aml-block:Number>0</aml-block:Number>
<aml-block:IsLast>true</aml-block:IsLast>
<aml-block:IsPrimary>true</aml-block:IsPrimary>
<aml-block:WaitForPrimary>false</aml-block:WaitForPrimary>
</aml-block:BlockGroup>
<aml-block:Severity>6</aml-block:Severity>
</aml-block:Header>
<aml-block:Content>
<ch:CallHome xmlns:ch="http://www.cisco.com/2005/05/callhome" version="1.0">
<ch:EventTime>2011-12-27 17:54:09 GMT-00:00</ch:EventTime>
<ch:MessageDescription>Fabric Interconnect B, management services are unresponsive</ch:MessageDescription>
<ch:Event>
<ch:Type>diagnostic</ch:Type>
<ch:SubType>GOLD-major</ch:SubType>
<ch:Brand>Cisco</ch:Brand>
<ch:Series>UCS 6100 Series Fabric Interconnect</ch:Series>
</ch:Event>
<ch:CustomerData>
<ch:UserData>
<ch:Email>xyz@xyz.com</ch:Email>
</ch:UserData>
<ch:ContractData>
<ch:CustomerId>abc@abc.com</ch:CustomerId>
<ch:ContractId>ContractID</ch:ContractId>
<ch:DeviceId>N10-S6100@C@SSI1442BFRC</ch:DeviceId>
</ch:ContractData>
<ch:SystemInfo>
<ch:Name>System-A</ch:Name>
<ch:Contact>Name</ch:Contact>
<ch:ContactEmail>xyz@xyz.com</ch:ContactEmail>
<ch:ContactPhoneNumber>+00-0000000000</ch:ContactPhoneNumber>
<ch:StreetAddress>Office Address</ch:StreetAddress>
</ch:SystemInfo>
</ch:CustomerData>
<ch:Device>
<rme:Chassis xmlns:rme="http://www.cisco.com/rme/4.0">
<rme:Model>N10-S6100</rme:Model>
<rme:HardwareVersion>0.0</rme:HardwareVersion>
<rme:SerialNumber>SerialNumber</rme:SerialNumber>
</rme:Chassis>
</ch:Device>
</ch:CallHome>
</aml-block:Content>
<aml-block:Attachments>
<aml-block:Attachment type="inline">
<aml-block:Name>sam_content_file</aml-block:Name>
<aml-block:Data encoding="plain">
<![CDATA[
<faultInst
ack="no"
cause="management-services-unresponsive"
changeSet=""
code="F0452"
created="2011-12-27T23:24:09.681"
descr="Fabric Interconnect B, management services are unresponsive"
dn="sys/mgmt-entity-B/fault-F0452"
highestSeverity="critical"
id="2036245"
lastTransition="2011-12-27T23:24:09.681"
lc=""
occur="1"
origSeverity="critical"
prevSeverity="critical"
rule="mgmt-entity-management-services-unresponsive"
severity="critical"
status="created"
tags=""
type="management"/>]]>
</aml-block:Data>
</aml-block:Attachment>
</aml-block:Attachments>
</aml-block:Block>
</soap-env:Body>
</soap-env:Envelope>
We want to understand that what is the impact of this error and is there anything that we can do to prevent this error? Also want to know what might be the cause get this error?
Let me know if anything else is needed from my side
show-tech file uploaded.
Solved! Go to Solution.
12-28-2011 05:33 AM
Amit,
I have reached out to TAC engineer and will get back to you. Also, please upload latest UCSM show tech to SR.
" show cluster extended state " would show cluster state.
For core dumps, you check from the Admin tab of UCSM
Padma
12-28-2011 02:03 AM
Amit,
Since you already have TAC SR for this issue, please get in touch with TAC engineer with an update about reoccuring alerts.
We would need logs to better understand the behavior.
Providing additional information like
Is the alert generated only for FI B or both FIs
Any change in cluster state corresponding to alert time stamp,
Cluster physical link status,
Does FI have any core dumps
etc would be helpful.
Padma
12-28-2011 04:58 AM
Padma,
TAC Engineer sent below mail
Hi Amit,
I’ve checked through the show tech you’ve uploaded and have not found any indicators of errors for the error message you are seeing.
As I mentioned in the call, the error is a transient error from which the fabric interconnects can automatically recover from. The recommended action is to wait for a few (10-15min) to see if the error clears automatically. If the error does not clear then we will need to do further troubleshooting. This error on its own is not a cause for worry. As you have HA in your system the management services would have failed over the to the other fabric interconnect and would not affect your system performance.
We can leave the system under observation for a few days to see if other errors occur concurrently with this error.
I will upload show-tech logs here, find my reply below
Is the alert generated only for FI B or both FIs ->> Amit: Alert generated for FI-B only
Any change in cluster state corresponding to alert time stamp ->> Amit: Unfortunately when this error generating we are unable to see the cluster state because of timing. If you can guide / suggest from any other location I can find the state that will be helpful
Cluster physical link status ->> Amit: Cluster link is OK
Does FI have any core dumps ->> Amit: I don't have any idea about this. How can check this ?
Regards,
Amit Vyas
12-28-2011 05:33 AM
Amit,
I have reached out to TAC engineer and will get back to you. Also, please upload latest UCSM show tech to SR.
" show cluster extended state " would show cluster state.
For core dumps, you check from the Admin tab of UCSM
Padma
12-28-2011 06:09 AM
Padma,
Below is the screen shot of "show cluster extended-state" command. Something is really strange in this, we have total 4 number of chassis but I can see HA READY for only 3 chassis.
I will upload latest "show-tech" to SR and there is no Core dumps available under "UCSM-> Admin-> Core Files" option.
-Amit
12-29-2011 08:03 AM
Amit,
It is normal for only 3 chassis to be displayed in "show cluster extended-state." UCS uses up to 3 chassis for quorum when determining primary/suboridinate roles. The above screenshot shows a stable system.
Although the system is on 1.4(3q), we observing PSU I2C errors (CSCtq10987), likely carried forward from an upgrade. Additional details and resolution steps for your particular case are provided in the TAC SR.
For reference, customers can review the Chassis ->IOM->I2C.log file sections "error_pca9541_per_device" for EBUSY errors to indicate which device is causing the I2C bus noise.
Thanks,
Matthew
12-28-2011 05:18 AM
Please send me a private message with the TAC SR. I will follow up when I return to the office tomorrow.
Sent from Cisco Technical Support iPhone App
12-29-2011 12:27 AM
Hi Matthew,
I have sent you private message.
-Amit
01-03-2012 01:52 PM
Any resolution on this? We are seeing the same issue running 2.0(1s) and have an open SR. Very annoying. Over 30 Call Home email alerts (pairs) in less than 30 days. Some days we get none, other days we get multiple...
01-03-2012 09:24 PM
Hi Robert,
Not sure whether work around will work on 2.0(1s)? because we are having 1.4(3q) where we are facing this issue.
We have got below work around for this.
As suspected, I2C communication is causing the SEEPROM errors which turn causes Callhome alerts.
To move forward, we need to identify the noisy PSU that causes I2C issues.
-- To be on safe side, make sure that you are not running any critical apps on the chassis 1
-- Remove PSU X and gather the output of the following commands every 60 seconds for period of 3 -5 min
connect local-mgmt a
show tech chassis 1 iom 1 brief | no-more
show tech chassis 1 iom 2 brief | no-more
show tech chassis 1 iom 1 brief | egrep 'fixup|lostar'
show tech chassis 1 iom 2 brief | egrep 'fixup|lostar'
If the value stops incrementing for these two counters, then we have removed the defective PSU from the system.
-- If it still increments, repeat the above steps by removing one PSU at a time.
I guess your TAC engineer will give your more clarity for this error for 2.0(1s)
Regards,
Amit
01-05-2012 07:36 AM
Excellent! It seems ressetting the PSUs addressed this for us, as well as addressing some "device CHASISS_SN, error accesssing shared-storage" warning faults we were seeing.
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: