Solved: 3750 12.2(52) mem crash

Keith Abbott · ‎05-16-2011

Hi,

I have 4 3750's, all at the same IOS rev, in a stack.

11 days ago, member 1 crashed due to a very slow memory leak.That is all straight-forward, I can go back and review SNMP charts that show free proc memory disappearing up to the point of the crash. That is not the primary concern.

After that crash, the next day, the switch crashed again even tho the charts showed over 50% proc mem free.

Suspecting a hardware problem, I replaced member 1 with a backup switch that weekend. After the new switch was in, I rebooted the entire stack to insure a clean start. Then about 12 hours later that night, member 1 crashed again; and several more times over the next few days.

Each time, the logs claimed it was a memory issue (see below)

2011-05-08 03:34:01 Local7.Critical aa.bb.cc.d 206: 000046: May 8 08:33:57: %SYS-2-MALLOCFAIL: Memory allocation of 19824 bytes failed from 0x1772C10, alignment 0 (BackBone3-1)
2011-05-08 03:34:01 Local7.Critical aa.bb.cc.d 207: Pool: Processor Free: 150280 Cause: Memory fragmentation (BackBone3-1)
2011-05-08 03:34:01 Local7.Critical aa.bb.cc.d 208: Alternate Pool: None Free: 0 Cause: No Alternate pool (BackBone3-1)
2011-05-08 03:34:01 Local7.Critical aa.bb.cc.d 209: -Process= "HRPC qos request handler", ipl= 0, pid= 146 (BackBone3-1)
2011-05-08 03:34:01 Local7.Critical aa.bb.cc.d 210: -Traceback= 2726D08 2729228 272948C 27FE890 1772C14 1772F58 1B2B7A8 1B22278 (BackBone3-1)
2011-05-08 03:34:32 Local7.Debug aa.bb.cc.d 211: 000154: .May 8 08:34:29: %SPANTREE-7-RECV_1Q_NON_TRUNK: Received 802.1Q BPDU on non trunk GigabitEthernet3/0/28 VLAN1.
2011-05-08 03:34:32 Local7.Debug aa.bb.cc.d 212: 000155: .May 8 08:34:29: %SPANTREE-7-BLOCK_PORT_TYPE: Blocking GigabitEthernet3/0/28 on VLAN0001. Inconsistent port type.
2011-05-08 03:34:32 Local7.Debug aa.bb.cc.d 213: 000156: .May 8 08:34:29: %SPANTREE-7-RECV_1Q_NON_TRUNK: Received 802.1Q BPDU on non trunk GigabitEthernet3/0/27 VLAN1.
2011-05-08 03:34:32 Local7.Debug aa.bb.cc.d 214: 000157: .May 8 08:34:29: %SPANTREE-7-BLOCK_PORT_TYPE: Blocking GigabitEthernet3/0/27 on VLAN0001. Inconsistent port type.
2011-05-08 03:34:47 Local7.Critical aa.bb.cc.d 215: 000158: .May 8 08:34:44: %SPANTREE-2-UNBLOCK_CONSIST_PORT: Unblocking GigabitEthernet3/0/28 on VLAN0001. Port consistency restored.
2011-05-08 03:34:47 Local7.Critical aa.bb.cc.d 216: 000159: .May 8 08:34:44: %SPANTREE-2-UNBLOCK_CONSIST_PORT: Unblocking GigabitEthernet3/0/27 on VLAN0001. Port consistency restored.
2011-05-08 03:38:18 Local7.Alert aa.bb.cc.d 322: 000015: May 8 08:38:16: %PLATFORM-1-CRASHED: System previously crashed with the following message: (BackBone3-1)
2011-05-08 03:38:18 Local7.Alert aa.bb.cc.d 323: 000016: May 8 08:38:16: %PLATFORM-1-CRASHED: Cisco IOS Software, C3750 Software (C3750-IPBASE-M), Version 12.2(52)SE, RELEASE SOFTWARE (fc3) (BackBone3-1)
2011-05-08 03:38:18 Local7.Alert aa.bb.cc.d 324: 000017: May 8 08:38:16: %PLATFORM-1-CRASHED: Copyright (c) 1986-2009 by Cisco Systems, Inc. (BackBone3-1)
2011-05-08 03:38:18 Local7.Alert aa.bb.cc.d 325: 000018: May 8 08:38:16: %PLATFORM-1-CRASHED: Compiled Fri 25-Sep-09 02:53 by sasyamal (BackBone3-1)
2011-05-08 03:38:18 Local7.Alert aa.bb.cc.d 326: 000019: May 8 08:38:16: %PLATFORM-1-CRASHED:   (BackBone3-1)
2011-05-08 03:38:18 Local7.Alert aa.bb.cc.d 327: 000020: May 8 08:38:16: %PLATFORM-1-CRASHED: Debug Exception (Could be NULL pointer dereference) Exception (0x2000)! (BackBone3-1)
2011-05-08 03:38:18 Local7.Alert aa.bb.cc.d 328: 000021: May 8 08:38:16: %PLATFORM-1-CRASHED:   (BackBone3-1)
2011-05-08 03:38:18 Local7.Alert aa.bb.cc.d 329: 000022: May 8 08:38:16: %PLATFORM-1-CRASHED: SRR0 = 0x0274148C SRR1 = 0x00029230 SRR2 = 0x017628EC SRR3 = 0x00021000 (BackBone3-1)
2011-05-08 03:38:18 Local7.Alert aa.bb.cc.d 330: 000023: May 8 08:38:16: %PLATFORM-1-CRASHED: ESR = 0x00000000 DEAR = 0x00000000 TSR = 0x8C000000 DBSR = 0x10000000 (BackBone3-1)
2011-05-08 03:38:18 Local7.Alert aa.bb.cc.d 331: 000024: May 8 08:38:16: %PLATFORM-1-CRASHED:   (BackBone3-1)
2011-05-08 03:38:18 Local7.Alert aa.bb.cc.d 332: 000025: May 8 08:38:16: %PLATFORM-1-CRASHED: CPU Register Context: (BackBone3-1)
2011-05-08 03:38:18 Local7.Alert aa.bb.cc.d 333: 000026: May 8 08:38:16: %PLATFORM-1-CRASHED: Vector = 0x00002000 PC = 0x01B2981C MSR = 0x00029230 CR = 0x30000003 (BackBone3-1)
2011-05-08 03:38:18 Local7.Alert aa.bb.cc.d 334: 000027: May 8 08:38:16: %PLATFORM-1-CRASHED: LR = 0x01B297E0 CTR = 0x024833DC XER = 0xC0000000 (BackBone3-1)
2011-05-08 03:38:18 Local7.Alert aa.bb.cc.d 335: 000028: May 8 08:38:16: %PLATFORM-1-CRASHED: R0 = 0x01B297E0 R1 = 0x04493DC0 R2 = 0x00000000 R3 = 0x045DD698 (BackBone3-1)
2011-05-08 03:38:18 Local7.Alert aa.bb.cc.d 336: 000029: May 8 08:38:16: %PLATFORM-1-CRASHED: R4 = 0xFFFFFFFE R5 = 0x00000000 R6 = 0x04493DA8 R7 = 0x00000000 (BackBone3-1)
2011-05-08 03:38:18 Local7.Alert aa.bb.cc.d 337: 000030: May 8 08:38:16: %PLATFORM-1-CRASHED: R8 = 0x00000000 R9 = 0x00000000 R10 = 0x031F0000 R11 = 0x045DD694 (BackBone3-1)
2011-05-08 03:38:18 Local7.Alert aa.bb.cc.d 338: 000031: May 8 08:38:16: %PLATFORM-1-CRASHED: R12 = 0x00000000 R13 = 0x00110000 R14 = 0x01772EA4 R15 = 0x00000000 (BackBone3-1)
2011-05-08 03:38:18 Local7.Alert aa.bb.cc.d 339: 000032: May 8 08:38:16: %PLATFORM-1-CRASHED: R16 = 0x00000000 R17 = 0x00000000 R18 = 0x00000000 R19 = 0x00000000 (BackBone3-1)
2011-05-08 03:38:19 Local7.Alert aa.bb.cc.d 340: 000033: May 8 08:38:16: %PLATFORM-1-CRASHED: R20 = 0x00000000 R21 = 0x00000000 R22 = 0x00000000 R23 = 0x00000000 (BackBone3-1)
2011-05-08 03:38:19 Local7.Alert aa.bb.cc.d 341: 000034: May 8 08:38:16: %PLATFORM-1-CRASHED: R24 = 0x01974A04 R25 = 0x00000004 R26 = 0x00000025 R27 = 0x03F7BEB0 (BackBone3-1)
2011-05-08 03:38:19 Local7.Alert aa.bb.cc.d 342: 000035: May 8 08:38:16: %PLATFORM-1-CRASHED: R28 = 0x00000000 R29 = 0x00004D70 R30 = 0x00000000 R31 = 0x00000000 (BackBone3-1)
2011-05-08 03:38:19 Local7.Alert aa.bb.cc.d 343: 000036: May 8 08:38:16: %PLATFORM-1-CRASHED:   (BackBone3-1)
2011-05-08 03:38:19 Local7.Alert aa.bb.cc.d 344: 000037: May 8 08:38:16: %PLATFORM-1-CRASHED: Stack trace: (BackBone3-1)
2011-05-08 03:38:19 Local7.Alert aa.bb.cc.d 345: 000038: May 8 08:38:16: %PLATFORM-1-CRASHED: PC = 0x01B2981C, SP = 0x04493DC0 (BackBone3-1)
2011-05-08 03:38:19 Local7.Alert aa.bb.cc.d 346: 000039: May 8 08:38:16: %PLATFORM-1-CRASHED: Frame 00: SP = 0x04493DD0    PC = 0x01B297E0 (BackBone3-1)
2011-05-08 03:38:20 Local7.Alert aa.bb.cc.d 347: 000040: May 8 08:38:16: %PLATFORM-1-CRASHED: Frame 01: SP = 0x04493DD8    PC = 0x0176E670 (BackBone3-1)
2011-05-08 03:38:20 Local7.Alert aa.bb.cc.d 348: 000041: May 8 08:38:16: %PLATFORM-1-CRASHED: Frame 02: SP = 0x04493E10    PC = 0x01772C5C (BackBone3-1)
2011-05-08 03:38:20 Local7.Alert aa.bb.cc.d 349: 000042: May 8 08:38:16: %PLATFORM-1-CRASHED: Frame 03: SP = 0x04493E30    PC = 0x01772F58 (BackBone3-1)
2011-05-08 03:38:20 Local7.Alert aa.bb.cc.d 350: 000043: May 8 08:38:16: %PLATFORM-1-CRASHED: Frame 04: SP = 0x04493E38    PC = 0x01B2B7A8 (BackBone3-1)
2011-05-08 03:38:20 Local7.Alert aa.bb.cc.d 351: 000044: May 8 08:38:16: %PLATFORM-1-CRASHED: Frame 05: SP = 0x00000000    PC = 0x01B22278 (BackBone3-1)
2011-05-08 03:38:20 Local7.Alert aa.bb.cc.d 352: 000045: May 8 08:38:16: %PLATFORM-1-CRASHED:   (BackBone3-1)

Each time logs claimed the issue was memory and fragmentation. I could see that I had plenty of memory free, and one would suspect that a complete stack reboot would resolve any fragmentation, but I was still getting crashes.

Finally, I shut down a SPAN session that was spanning a very busy VLAN to port12 on member 1 and the reboots stopped.

However I had been running that SPAN for over a week before the first crash, and once it seemd the switch was stable (3+ days of running with no crashes) I turned the SPAN back on trying to force a crash to prove this as the source of my trouble. The problem is that it has been running for more than 3 days and everything seems fine.

I added a couple of counters to my charts to watch largest block free for Proc mem and for IO mem and neither of these show any significant impact from restarting the SPAN session.

I've pasted 'show mem' and 'show buff' commands into the output interpreter both before and after restarting the SPAN session, and after it had been running a while and they always say the same things in all cases (even immediately after the stack reboot):

(Show Mem results)

ERROR: This device may be experiencing a processor memory fragmentation issue.
This problem occurs when memory is fragmented into very small blocks, to such an
extent that processes which need a larger block of memory are unable to get the
memory they require. This may happen when the router gets very low on memory, because
there is no de-fragmentation routine in the IOS.

buff results
SHOW BUFFERS ANALYSIS
ERROR: Since it's last reload, this router has created or maintained a relatively
large number of 'VeryBig buffers' yet still has very few free buffers.

ERROR: Since it's last reload, this router has created or maintained a relatively
large number of 'Syslog ED Pool buffers' yet still has very few free buffers.

The above symptoms suggest that a buffer leak has occurred.

Can anyone explain what is going on here?

The first reboot, as previously mentioned, was just a case of the thing running out of memory. However for subsequent reboots,it seems the reboot while having a very busy SPAN session active must have somehow caused severe memory fragmentation or mis-allocation, but if that were the case I would expect the switch to crash again when I reactivated the SPAN session (by the way it was deactivated by doing a shut on the destination interface), since the stack had not been rebooted after the span session was stopped (which would have cleared up the fragmentation created by the span session).

I've attached a file with various mem usage commands to help.

The problem, at this point, seems nullified. But I would like to make sure we know how to prevent a repeat; understand what is going on; and have something that sounds sensible to tell the bosses.

Thanks for any help you can give.

BTW I'll keep an eye on the slow leak and track it down when it starts to appear but my main worry is it's aftermath.

gephelps · ‎05-19-2011

Do the memory values match now in a known state? Does the NMS device reflect the same values as the CLI?

Also are the memory values you are polling on only reflecting the stack master switch? Each switch will have its own independent memory. It may be possible the values you were monitoring weren't for the switch which crashed?

View solution in original post

Leo Laohoo · ‎05-16-2011

IF the stack crashes then chances are there are crashinfo files. Can you post the most recent ones or the ones which have big file sizes?

If you believe it's a memory leak then why continue using the same IOS? Why not try to upgrade? Chances are TAC will recommend the same.

Keith Abbott · ‎05-17-2011

Hi and thanks for the response.

I am attaching the most recent crashinfo_ext file. All of the crash files had a much older date.

As to the question about upgrading the IOS, there are several reasons. A couple are:

The stack was up for a year with no probs and the original 1 crash (that can be strongly associated with a mem leak) probably doesnt yet justify the change.

Second: after going through about 1100 bug reports while looking for info on this problem, the version we're on seems the most stable.

Thanks for your help!

Leo Laohoo · ‎05-17-2011

the version we're on seems the most stable.

I'm no expert on the 12.2(52SE train (but I've used it once and we've never had something like this) but the SE train is a "version 0.0". There's a reason why they issue the rebuild (like SE1 or SE2) and it's mostly due to bugs.

Your crashinfo doesn't contain much information but I tend to agree with you that this could be due to a memory leak on a particular "family" of 3750. Can you please provide the "sh version" please?

Keith Abbott · ‎05-17-2011

Gladly done, Here it is:

Leo Laohoo · ‎05-17-2011

Hey Keith,

Thanks for the quick response. Nothing spectacular about the outpout.

How many stacks of switches do you have? Can you try 12.2(53)SE1? I've noticed that there is no rebuild of the 12.2(52)SE.

gephelps · ‎05-19-2011

With the data presented there is not enough to determine the root cause other than the switch was out of memory. In the crashinfo we see:

000046: May 9 13:47:00: %SYS-2-MALLOCFAIL: Memory allocation of 38128 bytes fai
led from 0x199F1F4, alignment 0
Pool: Processor Free: 152088 Cause: Memory fragmentation
Alternate Pool: None Free: 0 Cause: No Alternate pool
-Process= "HQM Stack Process", ipl= 0, pid= 145
-Traceback= 2726D08 2729228 272948C 27FE890 199F1F8 1974834 1B2B7A8 1B22278

000047: May 9 13:47:18: %PLATFORM_RPC-0-RESOURCE_CRASH: System is unable to alloc memory for RPC

With memory allocation failures a lot of the time the process reported is not the problem, but a victim. Almost always the cause reported is fragmentation, but normally the problem is really a lack of memory. For instance in the above output a process requested 32KB of memory which failed. The total amount of free memory is 152KB and we know the largest free block is less than 32KB since the request failed. This problem is reported as fragmentation because there is enough free memory for the block, but it is fragmented. Really though the issue is the switch has very little free memory.

There are no othe rmemory details in the crashinfo so we do not know what process was holding all of the memory. It could be the process reported, but typically once the free memory is that low nearly every process which requests memory will have the request fail.

Keith Abbott · ‎05-19-2011

Good Morning,

Thank you all for the responses.

Leo (I think),

This is the only 3750 stack we have in operation - it is our core stack. We're not really in a position to update the IOS at the moment.

George,

The strange thing about the report - one of the things that confused me all along - is that the particular crash for which the crash-file was posted occurred in the period that I was running an SNMP monitor of both free Proc memory and Free IO memory. According to those pointers, the processor mem free value was nearly 40MB. The update interval for that chart was about 20 secs. What I was NOT charting at that point, was largest free bloc of Proc mem and IO mem. (I've since added them)

So this suggests one of the following:

1. in a period of less than 20 seconds, we went from 40MB available to less than 200K or

2. CiscoMemPoolFree counter is not displaying the same information as the memory pool referred to in the crash-file or

3. CiscoMemPoolFree counter is just flat-out wrong. or

4. CiscoMemPoolFree counter is only showing memory pool for the whole stack (which seems natural) AND

the crash file is displaying actual memory pool for the specifically crashed switch AND

by design or bug, the crashed switch did not have access to the overall stack memory pool. (which in that case, the SNMP counter is pretty useless as it doesnt show a meaningful value)

Is this a correct synopsis?

Does anyone know which is true?

If it is 4, how can I monitor the state of the switch memory pools so I can see if problems are coming?

How does any of this tie in to the fact that:

a: the first crash occurred with a very busy SPAN session active and visible low mem (from SNMP) BUT

b: subsequent crashes occurred with same SPAN session active but SNMP mem seemed OK HOWEVER

c: Once I disabled SPAN (via SHUT on dest port) crashes ceased AND

d: even tho I've reactivated the same SPAN session/parameters, no further crashes - which were happening a couple of times a day - have taken place

That's big question #2 (#1 is Are the SNMP pointers reilable for watching out for this problem on any/all switches in the stack) How did SPAN play into this since the indicator strongly suggests that it did?

I have to understand if I am putting the stack stability at risk by SPANing busy VLANs to my NMS.

Thanks for your help

gephelps · ‎05-19-2011

Do the memory values match now in a known state? Does the NMS device reflect the same values as the CLI?

Also are the memory values you are polling on only reflecting the stack master switch? Each switch will have its own independent memory. It may be possible the values you were monitoring weren't for the switch which crashed?

Keith Abbott · ‎05-19-2011

Do the memory values match now in a known state? Does the NMS device reflect the same values as the CLI?

Also are the memory values you are polling on only reflecting the stack master switch? Each switch will have its own independent memory. It may be possible the values you were monitoring weren't for the switch which crashed?

Good thought - I didn't think to check the obvious!

I checked and the values displayed by NMS are well within 1% of the values shown from CLI, so that looks ok,

However when I issued a session 1 command and repeated the sh mem sum command, while the IO mem was still pretty much spot-on, the Processor memory was 10% different than what the NMS showed. Indeed if I transposed the used vs free values between NMS and CLI, the values were again within 1% (IE 1-NMS ProcMemUsed/CLI ProcMemFree and 1-NMS ProcMemFree/CLI ProcMemUsed). I am thinking this has to be a coincidence, and that the values displayed at the nms only apply to the stack commander and not the indivudual switches.

Actually after connecting sessions to each switch via CLI this indeed seems to be the case. Each one had somewhat different values for Processor free and used values. IO values switch to switch were nearly identical tho.

Unfortunately, this stinks! Because it appears there is no way to monitor individual switch memory except via CLI. You cant poll the individual switches because they lose their IP identity.

Does anyone know of a way - via snmp - to monitor these values on a per-switch basis?

thanks

Keith Abbott · ‎05-20-2011

I re-walked the entire 3750 stack MIB and looked through it for any OIDs that break out memory counters for the individual stack members and didn't have any luck.

So the next question is, what would be the effects of changing stack masters every hour, or every 15 minutes, so the counters could watch the memory for each each member at intervals?

Would this be detrimental to stack operation in any way?

Can this be done via SNMP (if so I can write a script to accomplish the task)?

Has anyone tried this before?

Thanks for your input...