Solved: Fabric Interconnect B, management services are unresponsive

Amit Vyas · ‎12-27-2011

Hi,

We have configured Call Home option in UCSM and we are getting below error from Call Home option since last Saturday. We have open TAC with Cisco to troubleshoot this error but as per TAC "The error is a transient error from which the fabric interconnects can automatically recover."

Below is the error messages we are getting

E-mail-1:

Subject:

System Notification from System-A - diagnostic:GOLD-major - 2011-12-27 17:54:09 GMT-00:00 Fabric Interconnect B, management services are unresponsive

Body Message:

System Name:System-A

Time of Event:2011-12-27 17:54:09 GMT-00:00

Event Description:Fabric Interconnect B, management services are unresponsive

Severity Level:6

E-mail-2:

Subject:

System Notification from System-A - diagnostic:GOLD-major - 2011-12-27 17:54:09 GMT-00:00 Fabric Interconnect B, management services are unresponsive

Body Message:

<?xml version="1.0" encoding="UTF-8" ?>

<soap-env:Envelope xmlns:soap-env="http://www.w3.org/2003/05/soap-envelope">

<soap-env:Header>

<aml-session:Session xmlns:aml-session="http://www.cisco.com/2004/01/aml-session" soap-env:mustUnderstand="true" soap-env:role="http://www.w3.org/2003/05/soap-envelope/role/next">

<aml-session:To>http://tools.cisco.com/neddce/services/DDCEService</aml-session:To>

<aml-session:Path>

<aml-session:Via>http://www.cisco.com/appliance/uri</aml-session:Via>

</aml-session:Path>

<aml-session:From>http://www.cisco.com/appliance/uri</aml-session:From>

<aml-session:MessageId>1058:SSI1442BFRC:4EFA0641</aml-session:MessageId>

</aml-session:Session>

</soap-env:Header>

<soap-env:Body>

<aml-block:Block xmlns:aml-block="http://www.cisco.com/2004/01/aml-block">

<aml-block:Header>

<aml-block:Type>http://www.cisco.com/2005/05/callhome/diagnostic</aml-block:Type>

<aml-block:CreationDate>2011-12-27 17:54:09 GMT-00:00</aml-block:CreationDate>

<aml-block:Builder>

<aml-block:Name>UCS 6100 Series Fabric Interconnect</aml-block:Name>

<aml-block:Version>4.2(1)N1(1.43q)</aml-block:Version>

</aml-block:Builder>

<aml-block:BlockGroup>

<aml-block:GroupId>1059:Serial Number:4EFA0641</aml-block:GroupId>

<aml-block:Number>0</aml-block:Number>

<aml-block:IsLast>true</aml-block:IsLast>

<aml-block:IsPrimary>true</aml-block:IsPrimary>

<aml-block:WaitForPrimary>false</aml-block:WaitForPrimary>

</aml-block:BlockGroup>

<aml-block:Severity>6</aml-block:Severity>

</aml-block:Header>

<aml-block:Content>

<ch:CallHome xmlns:ch="http://www.cisco.com/2005/05/callhome" version="1.0">

<ch:EventTime>2011-12-27 17:54:09 GMT-00:00</ch:EventTime>

<ch:MessageDescription>Fabric Interconnect B, management services are unresponsive</ch:MessageDescription>

<ch:Event>

<ch:Type>diagnostic</ch:Type>

<ch:SubType>GOLD-major</ch:SubType>

<ch:Brand>Cisco</ch:Brand>

<ch:Series>UCS 6100 Series Fabric Interconnect</ch:Series>

</ch:Event>

<ch:CustomerData>

<ch:UserData>

<ch:Email>xyz@xyz.com</ch:Email>

</ch:UserData>

<ch:ContractData>

<ch:CustomerId>abc@abc.com</ch:CustomerId>

<ch:ContractId>ContractID</ch:ContractId>

<ch:DeviceId>N10-S6100@C@SSI1442BFRC</ch:DeviceId>

</ch:ContractData>

<ch:SystemInfo>

<ch:Name>System-A</ch:Name>

<ch:Contact>Name</ch:Contact>

<ch:ContactEmail>xyz@xyz.com</ch:ContactEmail>

<ch:ContactPhoneNumber>+00-0000000000</ch:ContactPhoneNumber>

<ch:StreetAddress>Office Address</ch:StreetAddress>

</ch:SystemInfo>

</ch:CustomerData>

<ch:Device>

<rme:Chassis xmlns:rme="http://www.cisco.com/rme/4.0">

<rme:Model>N10-S6100</rme:Model>

<rme:HardwareVersion>0.0</rme:HardwareVersion>

<rme:SerialNumber>SerialNumber</rme:SerialNumber>

</rme:Chassis>

</ch:Device>

</ch:CallHome>

</aml-block:Content>

<aml-block:Attachments>

<aml-block:Attachment type="inline">

<aml-block:Name>sam_content_file</aml-block:Name>

<aml-block:Data encoding="plain">

<![CDATA[

<faultInst

ack="no"

cause="management-services-unresponsive"

changeSet=""

code="F0452"

created="2011-12-27T23:24:09.681"

descr="Fabric Interconnect B, management services are unresponsive"

dn="sys/mgmt-entity-B/fault-F0452"

highestSeverity="critical"

id="2036245"

lastTransition="2011-12-27T23:24:09.681"

lc=""

occur="1"

origSeverity="critical"

prevSeverity="critical"

rule="mgmt-entity-management-services-unresponsive"

severity="critical"

status="created"

tags=""

type="management"/>]]>

</aml-block:Data>

</aml-block:Attachment>

</aml-block:Attachments>

</aml-block:Block>

</soap-env:Body>

</soap-env:Envelope>

We want to understand that what is the impact of this error and is there anything that we can do to prevent this error? Also want to know what might be the cause get this error?

Let me know if anything else is needed from my side

show-tech file uploaded.

padramas · ‎12-28-2011

Amit,

I have reached out to TAC engineer and will get back to you. Also, please upload latest UCSM show tech to SR.

" show cluster extended state " would show cluster state.

For core dumps, you check from the Admin tab of UCSM

Padma

View solution in original post

padramas · ‎12-28-2011

Amit,

Since you already have TAC SR for this issue, please get in touch with TAC engineer with an update about reoccuring alerts.

We would need logs to better understand the behavior.

Providing additional information like

Is the alert generated only for FI B or both FIs

Any change in cluster state corresponding to alert time stamp,

Cluster physical link status,

Does FI have any core dumps

etc would be helpful.

Padma

Amit Vyas · ‎12-28-2011

Padma,

TAC Engineer sent below mail

Hi Amit,

I’ve checked through the show tech you’ve uploaded and have not found any indicators of errors for the error message you are seeing.

As I mentioned in the call, the error is a transient error from which the fabric interconnects can automatically recover from. The recommended action is to wait for a few (10-15min) to see if the error clears automatically. If the error does not clear then we will need to do further troubleshooting. This error on its own is not a cause for worry. As you have HA in your system the management services would have failed over the to the other fabric interconnect and would not affect your system performance.

We can leave the system under observation for a few days to see if other errors occur concurrently with this error.

I will upload show-tech logs here, find my reply below

Is the alert generated only for FI B or both FIs ->> Amit: Alert generated for FI-B only

Any change in cluster state corresponding to alert time stamp ->> Amit: Unfortunately when this error generating we are unable to see the cluster state because of timing. If you can guide / suggest from any other location I can find the state that will be helpful

Cluster physical link status ->> Amit: Cluster link is OK

Does FI have any core dumps ->> Amit: I don't have any idea about this. How can check this ?

Regards,

Amit Vyas

padramas · ‎12-28-2011

Amit,

I have reached out to TAC engineer and will get back to you. Also, please upload latest UCSM show tech to SR.

" show cluster extended state " would show cluster state.

For core dumps, you check from the Admin tab of UCSM

Padma

Amit Vyas · ‎12-28-2011

Padma,

Below is the screen shot of "show cluster extended-state" command. Something is really strange in this, we have total 4 number of chassis but I can see HA READY for only 3 chassis.

I will upload latest "show-tech" to SR and there is no Core dumps available under "UCSM-> Admin-> Core Files" option.

-Amit

mwronkow · ‎12-29-2011

Amit,

It is normal for only 3 chassis to be displayed in "show cluster extended-state." UCS uses up to 3 chassis for quorum when determining primary/suboridinate roles. The above screenshot shows a stable system.

Although the system is on 1.4(3q), we observing PSU I2C errors (CSCtq10987), likely carried forward from an upgrade. Additional details and resolution steps for your particular case are provided in the TAC SR.

For reference, customers can review the Chassis ->IOM->I2C.log file sections "error_pca9541_per_device" for EBUSY errors to indicate which device is causing the I2C bus noise.

Thanks,

Matthew

mwronkow · ‎12-28-2011

Please send me a private message with the TAC SR. I will follow up when I return to the office tomorrow.

Sent from Cisco Technical Support iPhone App

Amit Vyas · ‎12-29-2011

Hi Matthew,

I have sent you private message.

-Amit

robert · ‎01-03-2012

Any resolution on this? We are seeing the same issue running 2.0(1s) and have an open SR. Very annoying. Over 30 Call Home email alerts (pairs) in less than 30 days. Some days we get none, other days we get multiple...

Amit Vyas · ‎01-03-2012

Hi Robert,

Not sure whether work around will work on 2.0(1s)? because we are having 1.4(3q) where we are facing this issue.

We have got below work around for this.

As suspected, I2C communication is causing the SEEPROM errors which turn causes Callhome alerts.

To move forward, we need to identify the noisy PSU that causes I2C issues.

-- To be on safe side, make sure that you are not running any critical apps on the chassis 1

-- Remove PSU X and gather the output of the following commands every 60 seconds for period of 3 -5 min

connect local-mgmt a

show tech chassis 1 iom 1 brief | no-more

show tech chassis 1 iom 2 brief | no-more

show tech chassis 1 iom 1 brief | egrep 'fixup|lostar'

show tech chassis 1 iom 2 brief | egrep 'fixup|lostar'

If the value stops incrementing for these two counters, then we have removed the defective PSU from the system.

-- If it still increments, repeat the above steps by removing one PSU at a time.

I guess your TAC engineer will give your more clarity for this error for 2.0(1s)

Regards,

Amit

robert.murray · ‎01-05-2012

Excellent! It seems ressetting the PSUs addressed this for us, as well as addressing some "device CHASISS_SN, error accesssing shared-storage" warning faults we were seeing.