cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
2678
Views
0
Helpful
15
Replies

Chassis lost, UCS 2.1.2a

Walter Dey
VIP Alumni
VIP Alumni

I have a customer, having a simple configuration with one chassis, running UCS 2.1.2a; suddenly, the fans run at full speed, then he recognizes, that they lost communication FI - IOM on both fabrics. see file for error message

 

Power cycle the chassis resolved the issue

 

Has this been seen in the field ?

15 Replies 15

padramas
Cisco Employee
Cisco Employee

Hello Walter,

I assume chassis did not lose power

Before power cycling  the chassis,

   --  Did  they checked LED status of IOMs, blades ?

   --  Did they try reseating IOMs ?

Please ask them to open a TAC service request with UCSM and Chassis techsupport log bundle.

Padma

Hi Padma

Customer just did power cycle the chassis, no other information is available; therefore I I posted the show tech files in the original message above. I would not be surprised, that this was the result of a total power failure, and the FI took much longer to boot, than the IOM !

Walter.

FI did definitely not go down

Hardware
  cisco UCS 6248 Series Fabric Interconnect ("O2 32X10GE/Modular Universal Platf
orm Supervisor")
  Intel(R) Xeon(R) CPU         with 16622556 kB of memory.
  Processor Board ID FOC17101ST9

  Device name: FI-BAL16-1-B
  bootflash:   29535848 kB

Kernel uptime is 15 day(s), 23 hour(s), 40 minute(s), 51 second(s)

Last reset
  Reason: Unknown
  System version: 5.0(3)N2(2.11.2a)
  Service:

We had the exact same issue with one of our chassis. After several tac cases it turned out there was a recall on the the PSU in that chassis. These PSU corrupted the I2C bus which caused these symptoms.

Thanks ! Would you mind sharing with us the TAC case nr. 

Even better ( I guess)

Gold AC PSUs (N20-PAC5-2500W) below revision version of 341-0293-10 are missing fixes implemented via ECO E106290.  This fix was applied to SN QCI1534A2YR and later.  One of the useful things to know with the PSUs is the manufacturing date.  To figure out when the PSU was manufactured you take the first 2 numbers and add them to 1996.  The next two digits are the manufacturing week.  So SN QCI1534A2YR was manufactured in week 34 of 2011 (Aug 22-28).

Platinum PSUs have the fix, however they had issues when they were first released that look similar to i2c – check hot issues on dcn-wiki for more info. (CSCtz59519 / CSCtx90410 )

So if the manufacturing date is key to see if you might be affected by it.

Thanks ! I think this might be Field Notice http://www.cisco.com/en/US/ts/fn/636/fn63628.html

Cisco UCS 5100 Series Blade Server Chassis

Field Notice: FN - 63628 - UCSB-PSU-2500ACPL Power Redundancy Failure - Hardware Replacement Required

Revised August 7, 2013

July 16, 2013

Unfortunately, above FN was not applicable for our case. Therefore customer opened a TAC case SR: 627085171

wdey

{Disclaimer: I have not checked the logs yet}

Since there is a case opened already, be sure to check if there is a memory leak, according to CSCuf61116, that issue should be fixed on the version the customer is running, but it is always worth it to be sure    I never rule out until I can confirm the issue is definitely ruled out

-Kenny

Thanks Kenny ! Customer actually has a second UCS domain, exact same configuration (hardware and software), which didn't show this problem. One thing I noticed however, that the 2 datacenter run at different temperature. Could it be temperature issue ! The FI out-temp show 55 Degree C.

Walter.

Walter,

If that is the case, do you know if your customer has call home (SCH) set up for such events?  that might help track that as a possible factor.... Has this happened more than once?

The TAC engineer suspects a power failure apparently, SCH can help with that also.... were there any other devices in the same rack/site/power circuit affected at the same time ? or the issue was isolated to this chassis only?

See below how to set up the policy for SCH to track this, just in case you need it, but I am sure you know how to, but maybe for others

Good luck.

*RCA=  ROOT CAUSE ANALYSIS

-Kenny

Message was edited by: Keny Perez

For your information:

The issue seems to be on particular IOM Modules with below version numbers and this is tracked under the following bug

http://tools.cisco.com/Support/BugToolKit/search/getBugDetails.do?method=fetchBugDetails&bugId=CSCuf18380

PART NUM : 73-13196-04
PN REVISION : C0
FAB REVISION : 4

RMA is initiated

Thanks all who contributed !

Walter.

thanks for updating the thread Walter.

-Kenny

Customer replaced all the IOM according to the above

http://tools.cisco.com/Support/BugToolKit/search/getBugDetails.do?method=fetchBugDetails&bugId=CSCuf18380

After 2 1/2 months, the same happened again; chassis isolated, fans running full speed. A new TAC has been opened.

I cannot believe that we are the only folks having this issue ?

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: