cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
18215
Views
0
Helpful
12
Replies

SYS-2-MALLOCFAIL: Memory allocation of %d bytes failed

hramtsov
Level 1
Level 1

Greetings!

We started to receive SYS-2-MALLOCFAIL messages on our CAT3550-12T.

#sh ver | i IOS

IOS (tm) C3550 Software (C3550-I5K2L2Q3-M), Version 12.1(13)EA1c, RELEASE SOFTWARE (fc1)

Messages:

Dec 13 19:11:06: %SYS-2-MALLOCFAIL: Memory allocation of 1680 bytes failed from 0x1627E4, alignment 0

Pool: I/O Free: 9868 Cause: Memory fragmentation

Alternate Pool: None Free: 0 Cause: No Alternate pool

-Process= "Pool Manager", ipl= 0, pid= 6

-Traceback= 1A647C 1A7AA0 1627E8 1B389C 1B3AE4 1C70E0

Dec 13 20:43:53: %SYS-2-MALLOCFAIL: Memory allocation of 1680 bytes failed from 0x1627E4, alignment 0

Pool: I/O Free: 11492 Cause: Memory fragmentation

Alternate Pool: None Free: 0 Cause: No Alternate pool

-Process= "Pool Manager", ipl= 0, pid= 6

-Traceback= 1A647C 1A7AA0 1627E8 1B389C 1B3AE4 1C70E0

Is that software bug or something else?

Best regards,

Dmitry N. Hramtsov

12 Replies 12

robho
Level 3
Level 3

Usually not. You need to figure out what led up to this message - topology or config change, etc. I've seen this happen many times when there was a temporary loop. It is much better behavior than crashing the box.

It is difficult to say what led up to this message.

Nearest message in log dated about 30 minutes before MALLOCFAIL.

Don't know can this help, but here is more complete log for interesting period:

Dec 13 18:31:39: %OSPF-5-ADJCHG: Process 1, Nbr 10.10.213.126 on Vlan54 from 2WAY to DOWN, Neighbor Down: Dead timer expired

Dec 13 18:31:51: %OSPF-5-ADJCHG: Process 1, Nbr 10.10.213.126 on Vlan54 from DOWN to INIT, Received Hello

Dec 13 18:31:51: %OSPF-5-ADJCHG: Process 1, Nbr 10.10.213.126 on Vlan54 from INIT to 2WAY, 2-Way Received

Dec 13 19:11:06: %SYS-2-MALLOCFAIL: Memory allocation of 1680 bytes failed from 0x1627E4, alignment 0

[...SYS-2-MALLOCFAIL details skipped...]

Dec 13 19:12:33: %OSPF-5-ADJCHG: Process 1, Nbr 10.10.213.126 on Vlan54 from 2WAY to DOWN, Neighbor Down: Dead timer expired

Dec 13 19:13:10: %OSPF-5-ADJCHG: Process 1, Nbr 10.10.213.126 on Vlan54 from DOWN to INIT, Received Hello

Dec 13 19:13:10: %OSPF-5-ADJCHG: Process 1, Nbr 10.10.213.126 on Vlan54 from INIT to 2WAY, 2-Way Received

Dec 13 20:43:53: %SYS-2-MALLOCFAIL: Memory allocation of 1680 bytes failed from 0x1627E4, alignment 0

[...SYS-2-MALLOCFAIL details skipped...]

Next messages dated Dec 14.

Could still be caused by a loop in the network. You will need to check that first. Here is a bug that had same symptoms, CSCec13716. Your OSPF does not seem stable as well. Could be due to the same symptom. Try fixing that as well to see if it could be related.

ruwhite
Level 7
Level 7

I don't think CSCec13716 is related, from looking at the bug (listed as unreproducable). It sounds like you just have a memory fragmentation problem in the input/output pool of memory, so that's where I would attack the problem first. Most likely, you've gotten a lot of receive or through traffic in small enough packet sizes that you're still holding for some reason to fragment the i/o memory.

So, what I'd most likely do here is to start looking at show buffers old and show buffers input-interface to see if I could figure out what sorts of packets are being held in memory for a long time, then figuring out what I could do about it--are they all being received? Is there something throwing a lot of traffic at the router (a virus or an attacker)?

Beyond this, I would restart the box so the memory fragmentation problem goes away, and I could start with a clean slate. This may just be a residue of some earlier virus, attack, or network condition which no longer exists.

:-)

Russ.W

Hello Russ,

> It sounds like you just have a memory fragmentation problem

Agree with you, but AFAIK it should not happens. Is it possible that this problem appears because of incorrect software behavior?

I tried "show buffers" as you say.

As far as I see there is no "old" buffers. May be they was at the moment when MALLOCFAIL occures, who knows.

#show buffers old

Header DataArea Pool Rcnt Size Link Enc Flags Input Output

"show buffers failures" returns interesting results:

#show buffers failures

Caller Pool Size When

0x2AF13C Big 1524 1d01h

0x2AF13C Big 1524 1d01h

0x2AF13C Big 1524 1d01h

0x2AF13C Big 1524 1d01h

0x2AF13C Big 1524 1d01h

0x2AF13C Big 1524 1d01h

0x2AF13C Big 1524 1d01h

0x2AF13C Big 1524 1d01h

0x2AF13C Big 1524 1d01h

0x2AF13C Big 1524 1d01h

Best regards,

Dmitry N. Hramtsov

Then the network condition that caused this doesn't exist any longer.... It could have been any number of things, but it's going to be impossible to track down now that it's gone.

:-)

Russ.W

Like I said, I have seen many, many times that this was caused by a temporary loop. These malloc errors could be caused by anything. A detailed technical reason can be given but will not provide an answer as to why??? The bug I had mentioned was not a root cause. Did you check the U-comments? It turned out to be caused by a loop.

It could have been, or a code red attack, or some ddos attack on the router itself, or a number of other things. At this point, I'd say reload the router to clear the fragmentation, and watch the router for a week or two to see what the fragmentation level look like.

If you see a serious upswing in the fragmentation level in i/o memory over a very short period of time, then I would look back at any other logs and see what other events have occurred around the same time as the increase in fragmentation--routing changes, etc. If you see a small, but constantly climbing fragmentation over a longer period of time, I would be more likely to suspect a defect (but not necessarily).

So, what we really need here is more information--reload the router, and watch.

:-)

Russ.W

Hello,

Interesting post.

I had got the same phenomen on customer site one week ago.

Here are the traces found on 4 3550.

(version : IOS (tm) C3550 Software (C3550-I9K2L2Q3-M), Version 12.1(14)EA1a )

Dec 8 15:03:18.215: %SYS-2-MALLOCFAIL: Memory allocation of 1680 bytes failed from 0x164AD4, alignment 0

Pool: I/O Free: 2712 Cause: Memory fragmentation

Alternate Pool: None Free: 0 Cause: No Alternate pool

-Process= "Pool Manager", ipl= 0, pid= 5

-Traceback= 1ABD2C 1AD358 164AD8 1B933C 1B9584 1CCCD8

After investigation, it's sound like a tempory loop like says ruwhite.

Are you using RSTP ?

Regards.

I had familiar problem at edge router c851. We found loops at client network.

%SYS-2-MALLOCFAIL: Memory allocation

I  had this problem too!

 

Sep 2 14:30:55.323: %SYS-2-MALLOCFAIL: Memory allocation of 756 bytes failed from 0x18911CC, alignment 16 (GCPPC-C18-01-3)
Pool: I/O Free: 45920 Cause: Memory fragmentation 
Alternate Pool: None Free: 0 Cause: No Alternate pool 
-Process= "Pool Manager", ipl= 6, pid= 5 
-Traceback= 165F778z 16670C0z 18911CCz 167F014z 167F26Cz 
Sep 2 14:31:24.862: %SYS-2-MALLOCFAIL: Memory allocation of 756 bytes failed from 0x18911CC, alignment 16 
Pool: I/O Free: 38272 Cause: Memory fragmentation 
Alternate Pool: None Free: 0 Cause: No Alternate pool 
-Process= "Pool Manager", ipl= 6, pid= 5
-Traceback= 165F778z 16670C0z 18911CCz 167F014z 167F26Cz 
Sep 2 14:31:25.375: %SYS-2-MALLOCFAIL: Memory allocation of 756 bytes failed from 0x18911CC, alignment 16 
Pool: I/O Free: 45920 Cause: Memory fragmentation 
Alternate Pool: None Free: 0 Cause: No Alternate pool 
-Process= "Pool Manager", ipl= 6, pid= 5 
-Traceback= 165F778z 16670C0z 18911CCz 167F014z 167F26Cz 

 

 

It make my AP gona Down T-T

 

 

uriel1211100318
Level 1
Level 1

It's been ages of this, but I guess some people might still have this issue. On a Cat 4510 I had the %SYS-2-MALLOCFAIL log too. I was unable to read anything from the standby supervisor, I verified that as follows:

attach module <id_of_standby_sup>

dir nvram:

show flash:

Running those commands didn't work, or sent errors trying to read them.

Someone at my department found this solution, reset the card's power-cycle:

# hw-mod slot <id_of_standby_sup> reset power-cycle

 

You can check which is your active and standby supervisors, run this command:

# show module

Mod Redundancy role Operating mode Redundancy status
----+-------------------+-------------------+----------------------------------
5 Standby Supervisor SSO Standby hot
6 Active Supervisor SSO Active

 

As you can see module 5 is my standby, so to me the command was as follows:

# hw-mod slot 5 reset power-cycle

 

After this, errors stopped.

 

Hope this is still useful for someone