cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1471
Views
15
Helpful
28
Replies

C9800-80 in HA cannot install APSP6

revision6420
Level 1
Level 1

Hello everyone,

I'm trying to figure out a frustrating issue. 

Attempting an install of APSP6 keeps bombing out and I'm unsure what steps to take next. 

Software: Version 17.09.04a

ROM: 17.3(3r)

Error in question:

 

 

Dec 13 21:21:01 Eastern: %INSTALL-3-OPERATION_ERROR_MESSAGE: Chassis 1 R0/0: install_engine: Failed to install_add package bootflash:C9800-universalk9_wlc.17.09.04a.CSCwh93727.SPA.apsp.bin, Error: FAILED: install_add /bootflash/C9800-universalk9_wlc.17.09.04a.CSCwh93727.SPA.apsp.bin: Improper State./bootflash/C9800-universalk9_wlc.17.09.03.CSCwe97460.SPA.apsp.bin not present. Please restore file for stability.

 

 

 

C9800-universalk9_wlc.17.09.03.CSCwe97460.SPA.apsp.bin (APSP1) is not on the controller at all when I try to install C9800-universalk9_wlc.17.09.04a.CSCwh93727.SPA.apsp.bin (APSP6) 

I have tried from the webui and cli to install the APSP and it fails each time. 

C9800-universalk9_wlc.17.09.03.CSCwe97460.SPA.apsp.bin is mentioned in an error each time, so clearly there is something deeper in the controller that is wrong. 

I have attempted failovers and reloads and during bootup, I also get this error. 

 

 

RSA Signed RELEASE Image Signature Verification Successful.

Image validated

Dec 14 01:48:55.694: %BOOT-3-BOOTTIME_SMU_MISSING_DETECTED: R0/0: install_engine: SMU file /bootflash/C9800-universalk9_wlc.17.09.03.CSCwe97460.SPA.apsp.bin missing and system impact will be unknown

 

 

 

This is my show install summary 

 

 

 

show install summary
[ Chassis 1/R0 2/R0 ] Installed Package(s) Information:
State (St): I - Inactive, U - Activated & Uncommitted,
            C - Activated & Committed, D - Deactivated & Uncommitted
--------------------------------------------------------------------------------
Type  St   Filename/Version
--------------------------------------------------------------------------------
IMG   C    17.09.04a.0.6

--------------------------------------------------------------------------------
Auto abort timer: inactive
--------------------------------------------------------------------------------

 

 

 

I have even tried reinstalling APSP1. 

Below are the webui logs of that attempt. 

 

 

NSTALL ADD OPERATION:

--- Analyzing file C9800-universalk9_wlc.17.09.03.CSCwe97460.SPA.apsp.bin ---
Package Type is APSP
Initiating INSTALL_ADD operation for the package C9800-universalk9_wlc.17.09.03.CSCwe97460.SPA.apsp.bin
install_add: START Wed Dec 13 21:28:41 Eastern 2023
install_add: Adding SMU
install_add: Checking whether new add is allowed ....
install_add: install-add is allowed.

--- Starting initial file syncing ---
[1]: Copying bootflash:C9800-universalk9_wlc.17.09.03.CSCwe97460.SPA.apsp.bin from chassis 1/R0 to chassis 2/R0
[2]: Finished copying to chassis 2/R0
Info: Finished copying bootflash:C9800-universalk9_wlc.17.09.03.CSCwe97460.SPA.apsp.bin to the selected chassis
Finished initial file syncing

--- Starting SMU Add operation ---
Performing SMU_ADD on all members
[1] SMU_ADD package(s) on chassis 1/R0
FAILED: install_add /bootflash/C9800-universalk9_wlc.17.09.03.CSCwe97460.SPA.apsp.bin: Invalid installation package. Version 17.09.03 does not match with 17.09.04a.0.6.
[1] Finished SMU_ADD on chassis 1/R0
[2] SMU_ADD package(s) on chassis 2/R0

FAILED: install_add /bootflash/C9800-universalk9_wlc.17.09.03.CSCwe97460.SPA.apsp.bin: Invalid installation package. Version 17.09.03 does not match with 17.09.04a.0.6.
[2] Finished SMU_ADD on chassis 2/R0
Checking status of SMU_ADD on [1/R0 2/R0]
SMU_ADD: Passed on []. Failed on [1/R0 2/R0]
Finished SMU Add operation

FAILED: install_add exit(1) Wed Dec 13 21:29:53 Eastern 2023

 

 

 

Any help at this point would be greatly appreciated! 

 

 

 

 

 

 

28 Replies 28

Leo Laohoo
Hall of Fame
Hall of Fame

@revision6420 wrote: 
C9800-universalk9_wlc.17.09.03.CSCwe97460.SPA.apsp.bin not present

File not present in the WLC.  

Leo Laohoo
Hall of Fame
Hall of Fame

@revision6420 wrote: 
FAILED: install_add /bootflash/C9800-universalk9_wlc.17.09.03.CSCwe97460.SPA.apsp.bin: Invalid installation package. Version 17.09.03 does not match with 17.09.04a.0.6.

SMU file is for 17.9.3 but the controller is on 17.9.4a.  

Rich R
VIP
VIP

- Have you tried to use ISSU install?
- Have you previously had that APSP installed with 17.9.3?

Use ISSU at your own risk - this is one of the many problems it can cause.  As Leo recommends - always have TAC available on WebEx before starting ISSU install.

My own advice although officially this is not supposed to be necessary (but me and others have found it is to avoid this):
Before upgrading to a new release for each installed SMU and APSP installed you must:
- Deactivate the APSP/SMU
- Commit
- "install remove inactive" (or the specific SMU/APSP if you prefer)

If you don't do that these entries get "stuck" in the install database (and damn near impossible to remove) and the install script fails as you've discovered.

How to resolve it now that you're in that state - requires the nuclear option:
https://community.cisco.com/t5/wireless/9800-issu-behavior/m-p/4905112/highlight/true#M259614
9800 (config)# service internal
9800#clear install state
9800 (config)#no service internal


@Rich R wrote:

Use ISSU at your own risk - this is one of the many problems it can cause.  As Leo recommends - always have TAC available on WebEx before starting ISSU install.


Correct.  

@revision6420

Do not, under any circumstances, perform any ISSU upgrades without a proactive TAC Case.  

TAC needs to be in a WebEx to witness what is going on.  If the ISSU process goes wrong, which invariable can, a TAC engineer WebEx into the platform can quickly intervene

ISSU, FSU/xFSU/eFSU is not a one-stop-solution and does not apply to everyone.  In my humble opinion, ISSU, FSU/xFSU/eFSU only works in "corner cases", labs or demo.  If-and-when ISSU, FSU/xFSU/eFSU invariably/eventually fails the only way to stop the primary/secondary from the-ping-pong-of-death is to physically power down one of the two units.  

Rich,

Thank you for this information. 

Yes, I was and this is where the problem started. 

I upgraded the system software to 17.09.04a in ISSU and the first attempt to install APSP6 did go through. 

I did both from the webui

Everything broke down when I tried to Activate APSP6 via webui, the script failed, and shortly thereafter our HA stability went down. 

Currently, we have a TAC case open, and we are not any closer to recovering the HA. 

One of our HA nodes went in a bootloop which required beaking HA and booting off a USB via ROMMON, but the unit is stuck in bundle mode. TAC was unable to convert it to install mode. We are unable to set the boot variable and I'm waiting on TAC to investigate now. 

I'm definitely bringing back what you and others have shared with me to my team, as this is not the first time a code upgrade has burned us. 

Sorry but I couldn't help laughing out loud reading that, but like Leo says we've all learned that the hard way.  I can't even get ISSU to work reliably in the lab so we never risk it on production.  We've just given up on ISSU altogether (causes more problems than it solves and we just plan for a short outage on reload and we have N+1 WLC for the APs to move to during reload) and we manually deactivate and remove the SMU/APSP before upgrade to avoid this mess.

Good luck sorting it out.  We also ended up having to boot off USB (fortunately in lab) after a similar debacle previously.
I think if you clear the install db on both, boot both in bundle mode, then install mode then rejoin HA I reckon that should work.

I also prefer to do it all on CLI but I know opinions differ on that.  I am just more confident that I see any errors or warnings and less prone to timeouts etc.

If you and TAC can't get this to work in a day, I would request an RMA and be done with it.  I'm not a fan of SSO, but that is me and the issues I have had in the past.  It doesn't matter what others do or like, it's up to you to decide what works best for you in your environmnet.  I only use SSO in a lab, just so that I have an instance in case I want to test, but at home and in other environment I have setup, its all N+1.

-Scott
*** Please rate helpful posts ***


@revision6420 wrote:
One of our HA nodes went in a bootloop which required beaking HA and booting off a USB via ROMMON, but the unit is stuck in bundle mode. TAC was unable to convert it to install mode. We are unable to set the boot variable and I'm waiting on TAC to investigate now. 

I'm definitely bringing back what you and others have shared with me to my team, as this is not the first time a code upgrade has burned us. 


That's not just an upgrade that failed.  That's an ISSU failing.  

Everything that has happened, particularly the part when the HA needs to be severed, is the only way to fix an ISSU upgrade gone horribly wrong.  

The only way to fix this is to power down and disconnect the secondary from the network.  Upgrade the primary, by hand (i.  e.  no ISSU, FSU/eFSU/xFSU and no automation) and bring it up normally.  

While offline, manually upgrade the secondary and then power it down.  Connect the network cable and power up the secondary and hope it joins the primary.

Important Question:  Has TAC invoked the command "clear install state" yet?

Scott Fella
Hall of Fame
Hall of Fame

Make sure you are downloading the right image.  Since you are on 17.9.4 then you need to use 17.9.x. You can't use an image that is from a different code train.

https://software.cisco.com/download/home/286321396/type/286325254/release/17.9.4a

C9800-universalk9_wlc.17.09.04a.CSCwi44524.SPA.apsp.bin
C9800-universalk9_wlc.17.09.04a.CSCwh93727.SPA.apsp.bin

-Scott
*** Please rate helpful posts ***

jasonm002
Level 1
Level 1

I have a TAC case open on this and I'm invoking my account team to get this fixed. This appears to be a problem involving SMUs on other IOS XE platforms supporting HA SSO and ISSU and not just the 9800 - have seen this on 9500s with stackwise virtual too. The problem appears to be that certain installation routines and the install autoupgrade feature are incorrectly referencing inactive SMUs (i.e., that do not match the current running image) from previous install rollback points on the system (visible by doing "show install rollback" and "show install rollback id <x>"). If those old inactive SMUs don't exist then it puts the system into a bad state and all kinds of install or install autoupgrade functionality will fail in the future.

Furthermore, it's very easy to have this happen because "install remove inactive" will remove the inactive SMUs referenced in install rollback points. I'm hearing so far that this is intended behavior, so it appears the real problem is - everything else the installer subsystems are doing with the inactive SMUs in the rollback points.

 

The workaround at the moment as others also mentioned is to configure "service internal" in config mode and then do "clear install state" in exec mode _before_ installing the first SMU for your active image. So for example if you go to 17.9.5 and you want to put an SMU on 17.9.5, do "clear install state" after moving to the 17.9.5 base image, reload both boxes, then install however many SMUs you want on 17.9.5. Then in the future if you got to 17.12.4 for example, do "clear install state" again after moving to the 17.12.4 base image - then proceed to install however many SMUs you want on that base image. 

This is highly annoying and clearly broken, I am driving a case on it right now so hopefully it will be fixed in the next bugfix release for 17.9.x and 17.12.x.

 

Update: I actually was able to reproduce a flavor of this on a 9300 stack, triggering stack splits and merges by disconnecting and reconnecting stack cables with missing SMUs referenced in rollback points will put the standby switch into a v-mismatch state because it's trying to copy SMUs to it that aren't even active on the stack active switch.

Try plugging in a console cable and rebooting the switch or 9800 and reboot. 

Observe the entire boot-up process and it will list a number of SMU that the platform is trying to run but not found. 

The command "install remove inactive" has a known bug where it does not really "remove" SMU as intended.  We have to use the command "install remove file" to clean up.

In any case they should either fix install remove inactive such that it either doesn't remove SMUs referenced in rollback points, or just change all the other installer routines such that they don't count SMUs missing on flash from rollback points as causing a v-mismatch condition - because they really aren't if they're not currently active. The current implementation in my opinion is just simply broken - which is why I'm seeing all these threads about SMUs causing all kinds of problems on IOS upgrades. Currently if you deploy SMUs at all you will potentially put the box into what can appear to be a randomly broken state - not a good look for Cisco on this one. 

 

revision6420
Level 1
Level 1

Thank you all for the above advice and tips. 

We are going to be looking at more WLC code upgrades this year, and we'll be looking through here to refresh our process. 

@jasonm002 , I will definitely be trying your recommendations and let you know how it goes. 

Do not upgrade to 17.9.4/17.9.4 or 17.9.5.

 

Review Cisco Networking products for a $25 gift card