This document provides an explanation on the punt fabric data path failure symptom occasionally seen on the ASR9000 platform.
What are Diag messages
DIAG messages are sent from EACH of the RSP's down to all NPU's in the system, similar like a ping packet. These DIAG messages are then responded by the NPU and sent back to the orignating RSP.
If a number of DIAG packets are not received in a particular period of time this error message is alerted.
The underlying issue can be manifold, it can range from a faulty NPU, a loaded NPU, other faulty FPGA's. In this document we'll highlight a few of the causes and how it can be determined and what to do next.
This picture visualizes how the np diag messages flow. These are just 2 examples from the active and standby RSP to an individual NPU, but it needs to be understood that BOTH RSP's send DIAG messages to ALL NPU's.
Understanding the error message
An NP DIAG failure can be hardware, can be software and can be transient. One of the most important things to correlate from this error is the SET or CLEAR operation in the error mesage.
This is indicative of a situation with the common component on this linecard, such as for instance the linecard fabric. It could also be that all np's have a lockup on this card. If that is the case they would need to carry the same feature sets likely and take the same amount of traffic.
If the issue appears on multiple linecards it could be the RSP or the central fabric. If it is the central fabric, you'd also expect teh standby RSP to send a similar failure. This is not necessary but it provides extra proof of the suspicion.
A simple command to verifyt the diag packets that are being set
RP/0/RSP0/CPU0:A9K-BNG#show controller np count np0 loc 0/0/CPU0 | i DIAG
Mon Apr 15 16:00:26.722 EDT
235 PARSE_RSP_INJ_DIAGS_CNT 3268 0
248 PARSE_LC_INJ_DIAGS_CNT 3260 0
788 DIAGS 3260 0
902 PUNT_DIAGS_RSP_ACT 3268 0
Set followed by a clear
If the message is SET but subsequently followed by a clear then there is no direct need for concern. While not ideal, this can be ignored and no further action is required.
If it continues to happen, capture the outputs from the suggestion below and report that in a case.
It could mean that the NPU was temporarily overloaded.
In case of oversubscription scenarios it needs to be know that the DIAG packets are high priority and should get preferred service, but due to timing it could be missed resulting in an erroneous DIAG failure.
In XR 4.1 we had a string of NP lockup sw faults. This means that a thread inside the NPU enters an endless loop that it cant break out of. The NPU lockup eventually blocks all threads and stages inside the npu and prevents it from forwarding traffic.
Including DIAG packets.
If you are running XR 4.1 make sure that you are at least use 4.1.2 and run the latest pack to take care of npu lockups.
Also that release has the right instrumentation in place in order to derive the reason for lockup and bring it to closure.
Lockups have not been seen yet in XR 4.2.x nor XR 4.3.x
If an NPU fails for whatever reason, besides traffic impact you'll see a message of the datapath failure for this /NPU only.
It is important to see if there is an issue with the traffic on the ports served also.
NPU's failing belonging to the same Bridge might suggest a problem with the bridge FPGA.
This is uncommon, but can happen, usually we see bridge CRC errors along side.
If multiple NPU's are failing that report to the same FIA, then it could be very much an FIA failure. We would usually see sync loss messages on the SERDES (see show commands below) or a fabric path failure on that port.
a reseat of the LC is recommended, or a slot location move if the reseat does not resolve the issue.
If still at fail, replace the LC.
If ALL NPU's are generating an alert, and then especially reported by BOTH RSP's then it is clearly an indication of an LC failure.
RSP or Fabric path failure
If there is a complete RSP or fabric failure, all linecards will report the error (Eventually).
it can mean an affectd fabric, backplane or bad RSP.
Though I am personally aware of only one chassis swap, this is not a common scenario and highly recommended against doing so.
A replacement, reseat of the RSP in question is the first order of action to be conducted.
LC NPU loopback test has failed
This is a failure in the line card local punt path.
LC/0/7/CPU0:Aug 18 19:17:26.924 : pfm_node: %PLATFORM-PFM_DIAGS-2-LC_NP_LOOPBACK_FAILED : Set|online_diag_lc|Line card NPU loopback Test(0x2000006)|link failure mask is 0x8.
This means this test failed to get loopback packet from NP3: "link failure mask is 0x8" i.e. bit 3 set==>NP3.
If this issue ever happens on an ASR9000 router, it typically happens within the first month of a system being deployed, up and active. The possibility of a later occurrence is minimal.
If this issue is observed on your ASR9000, there is no need for a hardware replacement (i.e no need for RMA).
Cisco is working on the fix through CSCuj10837. Once the fix is verified, Cisco will make it available to customers through a SMU.
Reseat the line card showing fabric re-train.
Move the affected line card closer to RSP.
ASR9922 SFC upgrade failure
ASR9K A99-SFC2 Fabric Cards (FC) is down (IN-RESET) after an IOS XR Release 5.3.3 Field Programmable Device (FPD) upgrade. With the fabric down, punt diags can fail as the fabric is not properly forwarding traffic.
If A99-SFC2 doesn't have enough disk1a: available space during a IOS XR Release 5.3.3 FPD upgraded, the FC will only be partially upgraded, will fail to boot up and remain in an “In-Reset” state after the upgrade.
This issue has been fixed in IOS XR Release 5.3.4 + by CSCuz13904
A disk utilization check is needed before starting an IOS XR Release 5.3.3 FDP upgrade to prevent partial upgrades on the A99-SFC2 Fabric cards
Ensure disk1a: has enough space by running the following command:
If there is not enough space remove unwanted files by running:
# rm -f ipv4*x86.txt
This is not the sign of a hw failure but a failed fpd update that can be done if there is enough diskspace as per guidance above.
What to collect if there is still an issue
If after this explanation there is still an issue that can't be explained by the write up, and a support case is in order, make sure you collect at minimum the following information from the router when filing that case:
- show install active sum
!gives an overview of the running sw version and applied SMU's
- show log (or syslog output from the time of the problem)
!dumps the logging buffer to for correlation to other events
- admin show platform
!provides details of the linecards and types installed in the system
- admin show hw fpd location all
!clarity on whether the right fpd versions are run on the FPGA's
- admin show diagnostic result location 0/0/CPU0 detail
!more details on the diagnostic results that are run inline on the linecard.
- admin sh diagnostic trace engineer location 0/0/cpu0
!extensive detail on the diag traces
- show controllers fabric fia ltrace location 0/0/cpu0
!Ltrace for the fabric interface asic
- show controllers fabric fia stats location 0/0/CPU0
!Fabric interface asic statistics (FIA)
- show controllers fabric fia stats location 0/RSP1/CPU0
!FIA stats from the RSP (inject side)
- sh controllers fabric fia errors egress location loc 0/0/cpu0
!Reported errors from the FIA
- sh controllers fabric fia link-status location 0/0/cpu0
!Link status details from the FIA
- show controllers fabric fia bridge sync-status location 0/0/cpu0
!Link status between the FIA and the Bridge FPGA
- show pfm location 0/0/cpu0
!Platform Fault Manager details of the linecard
- sh controller np ports all loc 0/0/cpu0
!Port mapping from teh NPU to the physical port numbering.
-show pfm location all
!All faults reported in the system
-show tech fabric
!most relevant traces and captures from the fabric. Copy out the result file as it shows on console when the command is finished.
-admin sh diagnostic trace error loc 0/rsp0/cpu0
-admin sh diagnostic trace error loc 0/rsp1/cpu0
!Diagnostic traces from both RSP's.
replace the RSP keyword with 0 or 1 depending on who is the active.
replace 0/0/CPU0 with the LC in question that is having the issue.
Note: A deliberate "!" sign was added to the explanation of the command to ease copy pasting to a router console