Cisco Support Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Announcements

Welcome to Cisco Support Community. We would love to have your feedback.

For an introduction to the new site, click here. If you'd prefer to explore, try our test area to get started. And see here for current known issues.

Cisco Employee

Decoding UCS N20-B6625-1 hardware error

Understanding they are end of life, it will be a slow budgeting replacement for all blades.  Until then, trying to decode the below alarms being reported through esxi.  Any advice or reference documentation :

 

      |----Product Name.............................................N20-B6625-1
      |----Vendor Name..............................................Cisco Systems Inc

 

Memory Critical Fault:

        <CIM_NumericSensor key="755548592">
          <CurrentReading>6375000</CurrentReading>
          <Name>DDR3_P1_B1_ECC(43.0.32.99)</Name>
          <ElementName>Memory Module 2 DDR3_P1_B1_ECC</ElementName>
          <HealthState>30</HealthState>
          <CurrentState>Upper Fatal</CurrentState>
          <BaseUnits>1</BaseUnits>
          <RateUnits>0</RateUnits>
          <UnitModifier>-2</UnitModifier>
          <UpperThresholdFatal>1600000</UpperThresholdFatal>
        </CIM_NumericSensor>

 

System Board:

 

        <CIM_NumericSensor key="-880602477">
          <CurrentReading>10000</CurrentReading>
          <Name>SEL_FULLNESS(84.0.32.99)</Name>
          <ElementName>System Board 0 SEL_FULLNESS</ElementName>
          <HealthState>25</HealthState>
          <CurrentState>Upper Critical</CurrentState>
          <BaseUnits>0</BaseUnits>
          <RateUnits>0</RateUnits>
          <UnitModifier>-2</UnitModifier>
          <UpperThresholdCritical>8000</UpperThresholdCritical>
        </CIM_NumericSensor>

1 REPLY
New Member

SEL_Fullness error is simply

SEL_Fullness error is simply the System Event Log is getting full on that particular server. Not a terribly big deal, unless something else happens that should have been logged. You can set an SEL policy to automatically backup the SEL and then delete it if it gets to a certain amount full to prevent this. 
The first error, though, sounds like it could possibly be that the counter for memory errors has gone over a certain amount of errors within a certain amount of time. This should show up in UCS Manager as a fault on a certain blade. It appears to be DIMM B1 on whichever blade generated the error.

Before troubleshooting the memory errors, get logs. Specifically, generate one UCS technical support file with the UCSM option, and a second tech support file with the chassis option selected (and specifying the chassis that contains the blade with the DIMM error). Instructions for doing generating these files are here. Save those logs for a week, in case the errors come back.The files can be in the tens of MB.

After you have the logs, you can find DIMM Troubleshooting steps here. If you just have single bit errors, you can acknowledge the faults, reset the DIMM errors,  clear the SEL, and reset the CIMC, and see if the errors come back or not. If the errors come back in the same slot, you likely need to have the memory replaced.

131
Views
0
Helpful
1
Replies