UCS C220M3 + 9266 massive failures, help

joel.sersol · ‎02-09-2014

Hi all,

We recently purchased 70 Cisco UCS C220 servers with the LSI 9266 8-port card. These are specced as follows:

* 2x E5-2680 processors

* 8x SFF 1TB HD's

* Basic networking (2 ethernet, 1 CIMC)

* 128GB of RAM

* LSI 9266 - RAID5 across 7 disks, 1 hotspare

* Firmware: 1.5(3d)

* Operating System: RHEL 6.4

* megaraid_sas version: 2.6.32-358.el6.x86_64

We have recently run into alot of issues with these systems where under full load the system will stop responding because the OS thinks the RAID card has gone away and the CIMC even reports this.

What I would like to know is if anyone else has had issues with these? if so what revision is the RAID card.

Thanks,

-Joel

Walter Dey · ‎02-10-2014

Checking the field notes http://www.cisco.com/en/US/products/ps10493/prod_field_notices_list.html

joel.sersol · ‎02-10-2014

I see the notice for the 9266-8I controllers with the CacheVaultkit attached:

http://www.cisco.com/en/US/ts/fn/636/fn63601.html

ive checked a few of our cards and there are not listed here, but I really appreciate the information! I had not seen the field notices page before, thank you.

-Joel

Keny Perez · ‎02-10-2014

Joel,

We have this bug CSCuh86924 BUT TAC will need to confirm it first, this is NOT a Field Notice, so this cannot be diagnose just by looking at the controller like in the Field Notices we release.

If you are running ESXi, the first symptom you might see, is a PSOD with the "Exception 14" message.

For you and other users seeing the same failure I described above:

The issue observed is rare, and is application-centric. The symptom has been primarily observed in a VMWare environment, and the probability of having the failure is low (compared to the amount of servers out there in the market that have been sold and do NOT face this issue). The issue is due to a marginal signal level in the current LSI 9266 design. Cisco has confirmed there is no data loss due to this issue, but the fix is at the hardware level, meaning that a RMA is needed to solve this situation.

As I mentioned before, TAC needs to diagnose this first, please open a TAC case if the above describes the issue you see.

-Kenny

joel.sersol · ‎02-10-2014

We are running Redhat Enterprise linux 6.4 on all of these systems, sorry I didnt include that in the first go round. Is it possible that its the hardware/driver combination? If so i'd like to know what LSI driver revision is being used in your version of ESXi?

I have:

username@host:~$ strings /lib/modules/2.6.32-358.el6.x86_64/kernel/drivers/scsi/megaraid/megaraid_sas.ko |grep -i vermagic

vermagic=2.6.32-358.el6.x86_64 SMP mod_unload modversions

Keny Perez · ‎02-10-2014

Joel,

I still would suggested have TAC see if you are hitting this bug, cause we have seen it primarily in ESXi, but this is not a problem on VMware's side so it is not exclusive of that OS.

-Kenny

joel.sersol · ‎02-10-2014

Certainly! See my post below as well - we do have these systems stacked without spacing, and from the notes below it would seem these systems have some heat issues - would have been ncie for Cisco to let us know this before having them installed, they should probably list this as a warning in the packing.

I have been working with our Rep at cisco and she has helped us by sending our all new RAID cards for these systems so im going to assume that there are a number of issues like this being reported.

Thanks Kenny!

-Joel

Keny Perez · ‎02-10-2014

Joel,

If you see the replacement works, please dont forget to come back and mark this is a solved for other users to know you resolved the situation... we see a good amount of post like this one that users found in the future and never know how it ended up.

-Kenny

joel.sersol · ‎02-10-2014

I dug around some more and found this page:

it mentions:

LSI

Symptom An LSI 9266-8i Raid Controller may overheat in a C220 M3 server with a full compliment of HDD. An overheating LSI controller may behave unpredictably, losing VD's, and rebooting.

Workaround To minimize the risk of overheating the LSI controller, please make sure top vents are not obstructed. Do not stack anything on top of the server. If racking, please allow for several centimeters of space between servers. Avoid covering front or rear vents with labels. Use servers in a well ventilated area with lower ambient temperature. Make sure there is no obstruction to airflow in the front, rear or top of the server. (CSCue16903)

We have no abstruction and we have worked with our datacenter provider to get the cooling down to 66F or less and this is still happening. I will start gathering heat levels for the chip on the RAID card and graphing them

Keny Perez · ‎02-10-2014

Joel,

You do not mention your firmware level either, but out latest 1.5 firmware includes a new fan algorithm that helps with that bug you make reference to and that issue is not happening anymore.

https://tools.cisco.com/bugsearch/bug/CSCue16903 "Fixed Releases: 1.5(1b) "

-Kenny

joel.sersol · ‎02-10-2014

We are running 1.5(3d) on all of these systems unfortunately.

Keny Perez · ‎02-10-2014

Joel,

That means you are not hitting that bug (CSCue16903)

-Kenny

joel.sersol · ‎02-10-2014

yeah I had figured that, which is a shame because I would welcome any "This will fix it and stop wasting my weekends and nights" type of solution ;-)

Thank you for your responses and help - really, im glad to have someone's advice and assistance. Just grumpy.

For context: we baught these as a replacement for some systems (HP DL165's) that were no longer cutting it and so far the new hardware has actually had more issues. We went with cisco because we were after stability and best in class integration (something the sales guys promised) and was pretty much the only thing that swayed us from HP and to be frank we are regretting this and losing customers.

use on these systems: We run a large database (read heavy 90% of the time, hence we went with RAID5 for redundancy would have done 6 if we had 2 more drives/space) on these systems and it is expected to churn out data every day to populate systems for customers, unfortunately we have gone over that deadline by quite a large bit lately

joel.sersol · ‎02-10-2014

Would it still be advantagious for us to move all these systems so that they are 1U apart? just as a precaution?

Keny Perez · ‎02-10-2014

Joel,

That is always a good idea, but I would not define it as a best practice cause not all users have the luxuy to waste 1RU just for that, our servers are supposed to work fine in one of our R series racks; there was actually a customer the other day that was having some problems cause they were using a Glass door rack that was making the fan speed increase when the door was closed. This was the thread: https://supportforums.cisco.com/message/4153481#4153481.

Bottom line, your choice, if you can.. do it, that will at least let us rule our that and see if we can add that in case of a possible TAC case.

-Kenny