05-19-2009 07:21 AM
My disk00 went defunct and there is no /local file system present. If disk00 is part of the array, why would this failure cause the waas to go offline or not find /local? TAC is suggesting that I need to rebuild the box from the rescue cd and I want to avoid that.
thanks,
06-01-2009 06:31 AM
Hi ,
You can obtain a detailed software RAID status, which includes the disk utilization, the number of disks on the WAE, and operational status, by using the EXEC command show disks details. The following example shows a two-disk system in which the disks are operating normally:
WAE# show disks details (Send me the output of the following command)
Physical disk information:
disk00: Normal (h00 c00 i00 l00 - DAS) 76324MB( 74.5GB)
disk01: Normal (h01 c00 i00 l00 - DAS) 76324MB( 74.5GB)
Mounted filesystems:
MOUNT POINT TYPE DEVICE SIZE INUSE FREE USE%
/ root /dev/root 34MB 28MB 6MB 82%
/swstore internal /dev/md1 495MB 212MB 283MB 42%
/state internal /dev/md2 4031MB 65MB 3966MB 1%
/disk00-04 WAFSFS /dev/md4 63035MB 32MB 63003MB 0%
/local/local1 SYSFS /dev/md5 3967MB 313MB 3654MB 7%
.../local1/spool PRINTSPOOL /dev/md6 991MB 16MB 975MB 1%
/sw internal /dev/md0 991MB 289MB 702MB 29%
Software RAID devices:
DEVICE NAME TYPE STATUS PHYSICAL DEVICES AND STATUS
/dev/md0 RAID-1 NORMAL OPERATION disk00/00[GOOD] disk01/00[GOOD]
/dev/md1 RAID-1 NORMAL OPERATION disk00/01[GOOD] disk01/01[GOOD]
/dev/md2 RAID-1 NORMAL OPERATION disk00/02[GOOD] disk01/02[GOOD]
/dev/md3 RAID-1 NORMAL OPERATION disk00/03[GOOD] disk01/03[GOOD]
/dev/md4 RAID-1 NORMAL OPERATION disk00/04[GOOD] disk01/04[GOOD]
/dev/md5 RAID-1 NORMAL OPERATION disk00/05[GOOD] disk01/05[GOOD]
/dev/md6 RAID-1 NORMAL OPERATION disk00/06[GOOD] disk01/06[GOOD]
Currently SW-RAID is not configured to change.
More common than a total disk failure is a partial disk failure. When errors occur in one or a few sectors of a disk, a partial failure has occurred. Because the RAID devices are configured on a partition by partition basis, some partitions may continue to operate using the respective disk partitions from both disk drives. Sector errors are typically detected when the software attempts to read an affected sector. After retrying the read operation internally a number of times, the disk drive eventually gives up, and returns an error to the operating system and the RAID driver code. The kernel RAID-1 code stops accessing the affected physical partition. The notification of these errors is visible using the show disks details EXEC command.
To attempt recovery from a partial disk failure, follow these steps:
--------------------------------------------------------------------------------
Step 1 Review the syslog.txt file or run the show alarms critical EXEC command to determine the name of the disk drive experiencing the errors.
Step 2 Run the disk delete-partitions EXEC command on the drive with the failures.
Step 3 Reboot the WAE using the reload EXEC command.
--------------------------------------------------------------------------------
Upon reboot, the standard RAID-1 resynchronization is performed. Resynchronization overwrites all of the failed drive's contents, giving the disk drive a chance to remap any bad sectors. If additional disk I/O errors subsequently occur in a short period of time, or if the disk drive cannot be detected by the software after a reboot, the disk drive has probably failed past the point of repair, and replacement is needed.
On systems with SCSI disk drives, another recovery option for partial failures is available. This option involves marking the disk as bad and reformatting the disk.
Contd 2....
06-01-2009 06:33 AM
page 2....
To attempt recovery from a partical disk failure on a system with SCSI drives, follow these steps:
--------------------------------------------------------------------------------
Step 1 Review the syslog.txt file or run the show alarms critical EXEC command to determine the name of the disk drive experiencing the errors.
Step 2 Run the disk mark diskname bad EXEC command on the drive with errors.
Step 3 Reboot the WAE using the reload EXEC command.
Step 4 Run the disk reformat diskname EXEC command.
Step 5 Reboot the WAE using the reload EXEC command.
--------------------------------------------------------------------------------
Note This process removes all data and the partition table on the specified disk. The standard RAID-1 resynchronization is performed after the second reboot.
Kindly refer following documnet for recovering from disk failures:
http://www.cisco.com/en/US/docs/app_ntwk_services/waas/wafs/v30/configuration/guide/sysparms.html
You must run a script (the WAAS disk check tool) that checks the file system for errors that can result from a RAID synchronization failure.
You can obtain the WAAS disk check tool from the following URL:
http://www.cisco.com/pcgi-bin/tablebuild.pl/waas40
When you run the WAAS disk check tool, you will be logged out of the device. The device automatically reboots after it has completed checking the file system. Because this operation results in a reboot, we recommend that you perform this operation after normal business hours.
Copy the script to your WAE device by using the copy ftp disk command.
WAE# copy ftp disk
Run the script from the CLI, as shown in the following example:
WAE# script execute disk_check.sh
This script will check if there is any file system issue on the attached disks
Activating the script will result in:
Stopping all services. This will log you out.
Perform file system check for few minutes.
and record the result in the following files:
/local1/disk_status.txt - result summary
/local1/disk_check_log.txt - detailed log
System reboot
If the system doesn't reboot in 10 minutes, please re-login and check the result files.
Continue?[yes/no] yes
Please disk_status.txt after reboot for result summary
umount: /state: device is busy
umount: /local/lPAM_unix[26162]: ### pam_unix: pam_sm_close_session (su) session closed
for user root
waitpid returns error: No child processes
No child alive.
After the device reboots and you log in, locate and open the following two files to view the file system status:
â¢disk_status.txt- Lists each file system and shows if it is "OK," or if it contains an error that requires attention.
â¢disk_check_log.txt-Contains a detailed log for each file system checked.
If no repair is needed, then each file system will be listed as "OK," as shown in the following example:
WAE# type disk_status.txt
Thu Feb 1 00:40:01 UTC 2007
device /dev/md1 (/swstore) is OK
device /dev/md0 (/sw) is OK
device /dev/md2 (/state) is OK
device /dev/md6 (/local/local1/spool) is OK
device /dev/md5 (/local/local1) is OK
device /dev/md4 (/disk00-04) is OK
If any file system contains errors, the disk_status.txt file instructs you to repair it.
If an upgrade cannot be performed immediately, the customer should reload the system after the RAID resync is complete. RAID resync can be checked in the sh disk details output.
Kindly share your opinion with me if is any useful to you or tell me what I can do further to resolve it for you.
Sachin Garg
06-01-2009 06:47 AM
hi sachin, thanks for the very complete reply. TAC chose to replace the drive, the array rebuilt itself and I am up and running again.
06-01-2009 08:04 AM
Hi ,
Thanks for your quick response as your rating is very valuable to me.
Kind Regards,
Sachin garg
03-05-2018 08:10 PM
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: