NCS boot loop at ADE OS (VMDK Recovery)

Document

Wed, 08/06/2014 - 07:58
Aug 28th, 2012
User Badges:
  • Cisco Employee,
 
The following procedure is not supported by TAC, the Wireless
Networking Business Unit or any other entity at Cisco.
 
The issue seems to be related to a problem identified by VMware:
 
 
Linux based file systems become read-only
 
VMware has identified a problem where file systems may become read-only after encountering busy
I/O retry or SAN or iSCSI path failover errors.  NCS users have also encountered this issue after the 
storage has been uncleanly removed, usually brought on by a power outage.
 
If you can get to the shell prior to a reboot you can try issuing the following.  If you don't have access 
to a CLI becuase the vm is in a boot loop, proceed to the next section:
 
mount -o remount /
 
Recovering an NCS Virtual Machine stuck in a boot loop:
 
1. Download a live linux distro locally to your machine.  Users have reported success with Fedora.
2. In vSphere, left click the NCS VM -> Summary tab -> Storage -> right click the storage -> browse datastore -> click the icon to upload a file -> browse to the ISO. 
3. Exit out of the datastore browser.
4. Right click the NCS VM -> edit settings -> CD/DVD drive -> enable 'Connected' and 'Connect at power on' -> select the radio button 'Datastore ISO File' -> browse to the ISO you just uploaded -> save
5. Reload the VM and boot to the ISO
6. Get to the CLI (the exact steps to do so will completely depend on the linux distro)
7. Determine which designation has been given to the volumes we need to repair.  In the output below, this 
particular linux distro has given the volumes the 'sdb' designation. This can vary.  There will be three of them (sdb1, sdb2, sdb3)
 
# fdisk -l
 
Disk /dev/sdb: 209.7 GB,     209715200000 bytes
255 heads, 63 sectors/track,     25496 cylinders
Units = cylinders of 16065 * 512     = 8225280 bytes
 
Device     Boot          Start            End     Blocks   Id  System
/dev/sdb1       *               1              64       512000   83  Linux
Partition 1 does not end on cylinder boundary.
/dev/sdb2       64             77       102400   83  Linux
Partition 2 does not end on cylinder boundary.
/dev/sdb3                   77       25497       204184576   8e  Linux LVM
 
8. Scan for volume groups:
 
# lvm vgscan -v
 
  Wiping cache of LVM-capable devices
  Wiping internal VG cache
  Reading all physical volumes.  This may take a while...
  Finding all volume groups
  Finding volume group "smosvg"
  Found volume group "smosvg" using metadata type lvm2
  Archiving volume group "smosvg" metadata (seqno 12).
  Creating volume group backup "/etc/lvm/backup/smosvg" (seqno 12).
 
9. Activate all volume groups:
 
# lvm vgchange -a y
 
  11 logical volume(s) in volume group "smosvg" now active
 
10. List logical volumes:
 
# lvm lvs –a
 
      LV                   VG Attr LSize  
  altrootvol               smosvg -wi-a-  96.00M
  home                   smosvg -wi-a-  96.00M
  localdiskvol           smosvg -wi-a-  29.28G
  optvol                   smosvg -wi-a- 123.22G
  recvol                   smosvg -wi-a-  96.00M
  rootvol                 smosvg -wi-a-   3.91G
  storeddatavol       smosvg -wi-a-   9.75G
  swapvol               smosvg -wi-a-  15.62G
  tmpvol                 smosvg -wi-a-   1.94G
  usrvol                   smosvg -wi-a-   6.81G
  varvol                   smosvg -wi-a-   3.91G
 
11. Use fsck to check all the partitions on the drive.  It is ok if you receive errors for one of these.
 
# fsck -t ext3 –y /dev/sdb1
# fsck -t ext3 –y /dev/sdb2
# fsck -t ext3 -y /dev/sdb3
 
12. Perform the same steps for all of the logical volumes in the group identified with the lvscan command (remember to use the –y flag in all cases).
 
# fsck -t ext3 –y     /dev/smosvg/altrootvol
# fsck -t ext3 –y     /dev/smosvg/home
 
# fsck (repeat for all the others     from step 10)
 
You are looking for similar output to this:
 
fsck 1.39 (29-May-2006)
e2fsck 1.39 (29-May-2006)
/home: clean, 34/128016 files, 33751/512000 blocks
 
13.  Cleanly shut down the vm, remove the ISO configurations, and restart the server.  It should now boot successfully
 
With this information, and the volumes activated, you should be able to mount the partitions and volumes:
 
a. mount /dev/sdb1 /media/boot ( not the Linux LVM )
b. mount /dev/sdb2 /media/storedconfig ( not the Linux LVM
c. mount /dev/smosvg/localdiskvol /media/NCSbackup
d. Move the most current backup file from the localdiskvol volume, as well as the startup config from the
storedconfig volume, redeploy the VM using the OVA file, then restore from the
backup archive:
 
 

 

 

Loading.
Thomas Boettcher Fri, 11/23/2012 - 01:32
User Badges:

instead of deploying a parallel linux VM the NCS virtual system can be booted with a Linux Live ISO Image.


I tried this today with a Ubuntu Server 12.04.1 64bit image, going to rescue mode to fsck our broken virtual filesystems since our NCS did not successfully shutdowned as there was a migration from our VM Department on schedule.

patoberli Mon, 12/03/2012 - 05:26
User Badges:
  • Bronze, 100 points or more

Thanks @sschmidt and Thomas, this fixed the rebooting of my CPI 1.2. The issue happened as the nfs connected storage of the ESX had a CPU problem and the CPI decided to switch to read-only filesystem. Login wasn't anymore possible so I had to reboot, which caused the reboot loop.

Booting with a Fedora 17 Live ISO image and issuing the lvm and fsck commands fixed the server

I find myself in the same boat with NCS.  I have attached to the VM using Fedora live ISO but not having much Linux experience I'm not sure how to run the commands. I was able to secure the backup file but if I could fix the VM as you described that would be ideal. Any help you could offer would be greatly appreciated.  Thanks in advance for your time.

Scott McKellar Mon, 10/14/2013 - 17:37
User Badges:

Thanks, I can confirm these steps worked on using an Ubuntu 13.04 Live Boot Image


In my case the volumes were sda1-3 not sdb1-3

richard.borgia Mon, 12/30/2013 - 15:10
User Badges:

Taking the time to thank Steve Schmidt for the excellent doc!


We used CentOS 6.5 live CD ISO

we had to enter the "SU" command to get to the correct prompt, and our volumes were also "sda" 1-3.


Thanks Steve!!

richard.borgia Fri, 01/03/2014 - 05:43
User Badges:

Dzmitryj Jakavuk


We went through the entire process, it found a few issues and "fixed" them. I don't know for sure the output (results) from that specific command, but I will ask; is the partition correct? "sdb"? my partitions were "sda" (with an "a").


Rich

aviwollman Tue, 05/13/2014 - 03:33
User Badges:

worked fine (mine was also sda) using fedora 15 live.

#10 & #11 do 10! times so this is a script

lvnam= `lvm lvs –a| awk '{print $1}'| grep -v LV`

for index in $lvnam; do fsck -t ext3 –y     /dev/smosvg/$index; done

jb800113707 Mon, 05/19/2014 - 11:26
User Badges:

Fedora 17 Live Disc worked for us as well, Latest release (20) was unable to see any of the Disks.

Had to run the steps and restart three times before our ADS VM would boot correctly, and the first boot after the repair took a little over an hour.

Thank you for the Fix, I wish that Cisco would add this to their official documentation. We were referred to this post by Cisco support.

gfedergr Sat, 05/31/2014 - 06:29
User Badges:

Also happened to me with PI2.2

Here is my workaround:

1. After the VM entered a loop, I've downloaded the PI ISO file and uploaded it to Datastore

2. I chose "Connect to ISO Image on a Datastore..."

3. On the following loop the VM booted from ISO

4. I started a new fresh PI installation, the first step is a format of the exist OVA installation (including the Linux OS) and after it the PI installing starts over.

When the installation ends, the VM rebooted successfully. 

regards,

Gadi.

hvangruijthuijsen Wed, 08/06/2014 - 07:58
User Badges:

After running out of disk space we also hit this problem

Thanks very much, your instructions fixed our problem.

We used Ubuntu 13.04 Desktop Live CD

 

Actions

This Document