We recently purchased 70 Cisco UCS C220 servers with the LSI 9266 8-port card. These are specced as follows:
* 2x E5-2680 processors
* 8x SFF 1TB HD's
* Basic networking (2 ethernet, 1 CIMC)
* 128GB of RAM
* LSI 9266 - RAID5 across 7 disks, 1 hotspare
* Firmware: 1.5(3d)
* Operating System: RHEL 6.4
* megaraid_sas version: 2.6.32-358.el6.x86_64
We have recently run into alot of issues with these systems where under full load the system will stop responding because the OS thinks the RAID card has gone away and the CIMC even reports this.
What I would like to know is if anyone else has had issues with these? if so what revision is the RAID card.
I see the notice for the 9266-8I controllers with the CacheVaultkit attached:
ive checked a few of our cards and there are not listed here, but I really appreciate the information! I had not seen the field notices page before, thank you.
We have this bug CSCuh86924 BUT TAC will need to confirm it first, this is NOT a Field Notice, so this cannot be diagnose just by looking at the controller like in the Field Notices we release.
If you are running ESXi, the first symptom you might see, is a PSOD with the "Exception 14" message.
For you and other users seeing the same failure I described above:
The issue observed is rare, and is application-centric. The symptom has been primarily observed in a VMWare environment, and the probability of having the failure is low (compared to the amount of servers out there in the market that have been sold and do NOT face this issue). The issue is due to a marginal signal level in the current LSI 9266 design. Cisco has confirmed there is no data loss due to this issue, but the fix is at the hardware level, meaning that a RMA is needed to solve this situation.
As I mentioned before, TAC needs to diagnose this first, please open a TAC case if the above describes the issue you see.
We are running Redhat Enterprise linux 6.4 on all of these systems, sorry I didnt include that in the first go round. Is it possible that its the hardware/driver combination? If so i'd like to know what LSI driver revision is being used in your version of ESXi?
username@host:~$ strings /lib/modules/2.6.32-358.el6.x86_64/kernel/drivers/scsi/megaraid/megaraid_sas.ko |grep -i vermagic
vermagic=2.6.32-358.el6.x86_64 SMP mod_unload modversions
I still would suggested have TAC see if you are hitting this bug, cause we have seen it primarily in ESXi, but this is not a problem on VMware's side so it is not exclusive of that OS.
Certainly! See my post below as well - we do have these systems stacked without spacing, and from the notes below it would seem these systems have some heat issues - would have been ncie for Cisco to let us know this before having them installed, they should probably list this as a warning in the packing.
I have been working with our Rep at cisco and she has helped us by sending our all new RAID cards for these systems so im going to assume that there are a number of issues like this being reported.
If you see the replacement works, please dont forget to come back and mark this is a solved for other users to know you resolved the situation... we see a good amount of post like this one that users found in the future and never know how it ended up.
I dug around some more and found this page:
Symptom An LSI 9266-8i Raid Controller may overheat in a C220 M3 server with a full compliment of HDD. An overheating LSI controller may behave unpredictably, losing VD's, and rebooting.
Workaround To minimize the risk of overheating the LSI controller, please make sure top vents are not obstructed. Do not stack anything on top of the server. If racking, please allow for several centimeters of space between servers. Avoid covering front or rear vents with labels. Use servers in a well ventilated area with lower ambient temperature. Make sure there is no obstruction to airflow in the front, rear or top of the server. (CSCue16903)
We have no abstruction and we have worked with our datacenter provider to get the cooling down to 66F or less and this is still happening. I will start gathering heat levels for the chip on the RAID card and graphing them
yeah I had figured that, which is a shame because I would welcome any "This will fix it and stop wasting my weekends and nights" type of solution ;-)
Thank you for your responses and help - really, im glad to have someone's advice and assistance. Just grumpy.
For context: we baught these as a replacement for some systems (HP DL165's) that were no longer cutting it and so far the new hardware has actually had more issues. We went with cisco because we were after stability and best in class integration (something the sales guys promised) and was pretty much the only thing that swayed us from HP and to be frank we are regretting this and losing customers.
use on these systems: We run a large database (read heavy 90% of the time, hence we went with RAID5 for redundancy would have done 6 if we had 2 more drives/space) on these systems and it is expected to churn out data every day to populate systems for customers, unfortunately we have gone over that deadline by quite a large bit lately
That is always a good idea, but I would not define it as a best practice cause not all users have the luxuy to waste 1RU just for that, our servers are supposed to work fine in one of our R series racks; there was actually a customer the other day that was having some problems cause they were using a Glass door rack that was making the fan speed increase when the door was closed. This was the thread: https://supportforums.cisco.com/message/4153481#4153481.
Bottom line, your choice, if you can.. do it, that will at least let us rule our that and see if we can add that in case of a possible TAC case.
Ok we will space them, we have rack space for these since they are a bit power hungry.
Glass door on a rack huh? must be for showing off ot execs? I wouldnt have done that personally or professionaly since it would surely impede airflow.. I guess everyone has different needs for these.
Thanks again Kenny:
That was my post :-) There was no deliberate choice of glass for the door, it was just that the cabinet was that way from new - and it hadn't been a problem before. Having the C220 installed in an engineering/test rack like that which is easily accessable and allows us easy access to the hardware has been very useful in the short term for testing/training etc.