ES40 won't boot properly after system restore

by Community Manager on ‎09-13-2011 07:53 AM

Question

We had some disk failures on our ES40. We've put in replacement disks, partitioned the disk array, and restored from backups.

We did the restore while booted to a Tru64 CDROM.

Now, when we boot single-user mode to the actual array, several strange things happen.

1. Get a message during boot saying vm_swap_init: Unexpected swapon for swap device /dev/disk/dsk0b

2. hwmgr -view device
gives no output (no disk, no tape, nothing)

3. disklabel -r dsk0
Error message: No such device or address

If I look into /dev/disk, I see the special device files for dsk0{a,b,c,d,e,f,g,h}

If I boot onto CDROM, the disk appears perfectly normal as dsk0 (and that's how it was before the failure).

Now did I do something wrong during the root filesystem restore?????

System details:
ES40, running Tru64 5.1B-4. The system has some patches (but obviously the CD does not). The SRM console version 7.2-1, and the NHD-7 CD has been applied.

The "disk" is actually a 6*72GB RAID5 array, the array is a HP SmartArray 5300A (v3.56). One disk failed, then a second disk failed before the first replacement arrived.

P00>>> show config
Slot Option Hose 0, Bus 0, PCI
3 HP Smart Array 5300A pya0.0.0.3.0
dya0.0.0.3.0
(and more)
P00>>> show device
dya0.0.0.3.0 DYA0 CPQCISS
pya0.0.0.3.0 PYA0
(and more)

I haven't attempted multi-user (I think the above is more than enough problems).

I've got certain types of hwmgr output prior to the failure. Let me know if it is needed and I'll get it extracted. However, I do not have the disklabel -r output from before the failure: I had to guestimate the sizes during the restore.

Answer

OK, here was the approximate flow of how I solved the problem.

I've tagged the command inputs and outputs below like this:
now#
(when the system is booted on restored disk)

before#
(saved output from before the failure)

CD#
(when the system is booted on CD)

>>>
(at the console prompt: more precisely a P00>>> prompt)

###
(Explanatory comments, not part of the input or output)

Step 1
======
I had located a Compaq patch advisory, for i2o disks, that suggested this procedure. I wrote it down, but did not execute it at this stage.

>>> boot -fl s
# mountroot
# /sbin/hwmgr view devices
HWID: Device Name
58: /dev/disk/dsk5c
### Note bogus name dsk5

# cat /cluster/members/(member)/etc/i2oNameData.log
25: iop-0-tid-514: dsk0
### Note former HWID = 25

# hwmgr -delete component -id 25
# /sbin/hwmgr -R hwid 25
# /sbin/dsfmgr -m dsk5 dsk0
### renames (-m)oves disk
# shutdown -h now

Note I did not execute this procedure, but simply noted it.


Step 2
======
### Execute as suggested by Martin
now# hwmgr -show scsi
HWID Scsi host type subtyp owner #path dev 1st_path
69: 0 (none) disk none 0 1 (null)
-1: 4 (none) disk none 2 1 (null) [1/0/0]

### Digging into the previous (saved) output, it so happened I did have the output from that same command ...
before# hwmgr -show scsi
69: 0 revan disk none 2 1 dsk0 [1/0/0]

### As an additional step I got -full output
now# hwmgr -show scsi -full
69: 0 (none) disk none 0 1 (null)

WWID: 01000010:6005-08b1-0010-4344-4150-5331-474e-0002

Bus Target Lun Path Status
1 0 0 Stale

-1: 4 (none) disk none 2 1 (null) [1/0/0]

WWID: 01000010:6005-08b1-0010-4344-4150-5331-474e-0003
### Note the last digit different on the "new" device

Bus Target Lun Path Status
1 0 0 Valid

### I also had that same command output saved away.
before# hwmgr -show scsi -full
69: 0 (none) disk none 0 1 dsk0 [1/0/0]

WWID: 01000010:6005-08b1-0010-4344-4150-5331-474e-0002

Bus Target Lun Path Status
1 0 0 Valid

### For curiosity, on the CD
CD# hwmgr -show scsi
69: 0 revan disk none 2 1 dsk0 [1/0/0]

CD# hwmgr -show scsi -full
69: 0 (none) disk none 0 1 dsk0 [1/0/0]

WWID: 01000010:6005-08b1-0010-4344-4150-5331-474e-0003

Bus Target Lun Path Status
1 0 0 Valid

### So I see the HW database has the old WWID recorded against B/T/L 1/0/0 (Stale), and it sees a new WWID against the same B/T/L.

Step 3
======
### A variation of the procedure described in Step 1

>>> boot -fl s
### note swap error msg
now# mountroot
### get output for creation of extra device files for dsk1

now# hwmgr -show scsi
HWID Scsi host type subtyp owner #path dev 1st_path
69: 0 (none) disk none 0 1 (null)
91: 4 (none) disk none 2 1 (null) [1/0/0]
### OK, my disk now has HWID 91

now# hwmgr -delete component -id 69
now# dsfmgr -R hwid 69
now# dsfmgr -m dsk1 dsk0
### output for all the partition device names being renamed
now# shutdown -h now

>>> boot -fl s
### note no swap error this time
now# disklabel -r dsk0
### Good output, listing partitions a-h.

now# shutdown -r -s now
### wait for reboot
### See all other filesystems being mounted
### Problem solved, but see Step 4.

Step 4
======
Not entirely sure if relevant, but I found that I got an invitation to do "configuration" (post-install), my X didn't come up, and most of my /var filesystem was missing. A quick reboot to single-user mode, mount that filesystem on a temporary mount, and restore that again from backup.

After a reboot, the OS was fully functional.

Search
Showing results for 
Search instead for 
Do you mean 
HP Blog

Technical Support Services Blog

Featured


Follow Us
Twitter Stream
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation.