Hi Pedro, This error message

Pedro Morais · ‎07-11-2014

Hello All,

We are doing some scalability tests with the new typhoon based hardware (24 x 10G - SE), and when configuring sub-interfaces with a egress service-policy applied got the following error message:

% Failed to commit one or more configuration items during a pseudo-atomic operation. All changes made have been reverted. Please issue 'show configuration failed' from this session to view the errors

RP/0/RSP0/CPU0:A9K-LAB05(config-subif)#show configuration failed
Fri Jul 11 09:44:59.268 WEST
!! SEMANTIC ERRORS: This configuration was rejected by
!! the system due to semantic errors. The individual
!! errors with each failed configuration command can be
!! found below.

interface TenGigE0/2/0/5.43947
service-policy input SCH_IN_parent_L3_NG1_100M
!!% 'prm_server' detected the 'warning' condition 'An operation that was requested was aborted - data integrity may be compromised.'
!
end

Service policy configuration is the following:

policy-map SCH_OUT_parent_L3_NG1_100M
class class-default
service-policy SCH_OUT_child_L3_NG1_100M
shape average 100032 kbps
end-policy-map
!
policy-map SCH_OUT_child_L3_NG1_100M
class EDGE-L3-VOICE
priority level 1
police rate 30 mbps
class EDGE-L3-VIDEO
priority level 2
police rate percent 45
class EDGE-L3-HIGH
bandwidth remaining percent 40
random-detect 128 kbytes 256 kbytes
class EDGE-L3-MED
bandwidth remaining percent 30
random-detect 128 kbytes 256 kbytes
class EDGE-L3-LOW
bandwidth remaining percent 20
random-detect 64 kbytes 128 kbytes
class EDGE-L3-BE-NG1
bandwidth remaining percent 10
random-detect 32 kbytes 64 kbytes
class class-default
end-policy-map

If we configure more sub-interface without the service-policy, configuration is accepted. Right now we have around 17K sub-interfaces configured:

RP/0/RSP0/CPU0:A9K-LAB05#sh int summary location 0/2/CPU0
Fri Jul 11 11:45:45.005 WEST
Interface Type          Total    UP       Down     Admin Down
--------------          -----    --       ----     ----------
ALL TYPES               17503    17485    0        18
--------------
IFT_TENGETHERNET        24       7        0        17
IFT_VLAN_SUBIF          17479    17478    0        1

We know that we are not hitting queue limits, but we don't know what kind of limit, if any, are we reaching. Can anyone help us understand what kind of limit are we reaching?

Thanks,

Pedro

smilstea · ‎07-11-2014

Hi Pedro,

This error message indicates an error when trying to program the TCAM, that the SW and HW values are not the same hence the data integrity error.

A sub-int would be able to be committed without a QoS policy as only things like ACLs and QoS policies take up entries in the TCAM.

Can you open an SR and ask for me?

Thanks,

Sam

smilstea · ‎07-24-2014

Thanks for the time Pedro.

For closure in the community here is what we found:

There were 17476 sub-interfaces with the same ingress and egress policies applied. Because the ingress policy had 8 classes and the egress policy had 7 (including class-default) this means we had (7+8)*(17476) records in the NP QOS_INTF data structure (262140 out of 262144 available) allocated. When adding either service-policy to a new sub-int this would push us over the limit and hence the error message.

Looking at the chunks for QoS-EA and WFQ because the classes did not have large configurations these resources were not exhausted yet. Different scale numbers exist depending on the LC.

On the TCAM a service-policy is only applied once (only takes up resources once) and why the TCAM was not exhausted either.

Because of the limit we are hitting the only real ways to alleviate the issue is to use fewer sub-ints, move some to another NP, or use fewer classes.

*More details on QoS HW resource consumption*

The first QoS configuration in a class creates a TCAM and NP Struct record (including class-default)
Every copy of a QoS service-policy applied creates more NP Struct records, but TCAM is done only once.

The NP Struct is essentially the aggregate of all the classes applied in the queuing ASICs, as long as the class has some QoS configuration we must allocation the appropriate resources.

Priority

Levels 1, 2, and 3

affects QoS-EA chunk

Police Rate

affects QoS-EA internal policer

Queuing

Shape, bandwidth guaranteed, and bandwidth remaining

affects QoS-EA chunk
affects WFQ profile chunk

Random Detect

affects WRED profile chunk

These resources can be checked with the following commands:

show prm server tcam summary 144 qos np0 loc 0/0/cpu0 (144 indicates the ipv4 tcam table and qos that we are only looking at the QoS application)
show qoshal resource summary np 0 loc 0/0/cpu0 (for looking at the chunks)
show controller np struct 2 np0 loc 0/0/cpu0 (struct 2 is the QOS_INTF struct)

Regards,

Sam

ASR9K - 4.3.4 - 24x10GE-SE