Re: How I temporarily recovered from the system crash (393 Views)
Reply
Regular Advisor
Geert Van Pamel
Posts: 126
Registered: ‎12-31-2002
Message 1 of 4 (423 Views)

Alphastation 500 with Tru64 V4.0f domain panic bfs_close: invalid bf set ptr

I had a failed disk on a RAID-5 Mylex domain on AlphaStation 500 with Tru64 V4.0f.

 

The engineer proposed me to do a swxcrmgr "make optimal" without replacing the failed disk...

 

But I believe this was not the right way and this is why we might have corrupted the RAID-5 AdvFS filesets on the domain?

 

Since it concerns the file domain where /usr and /var are located, this situation is painfull, because we are not able to boot any more.

 

When I boot in single user with boot -flags 0,1 I can mount the root filesystem but the system crashes immediately when attempting to mount the /usr or /var fileset.

 

boot -fl 0,1

 

/sbin/update

 

mount -u /

 

mount /usr

 

Panic (cpu 0): bfs_close: invalid bf set ptr

Syncing disks… done



I have searched in this forum and on Google for bfs_close but did not find any article.

 

Would this be an AdvFS error, or is it a pure Tru64 OS error?

 

For you information: we just changed the power supply after the system crashed yesterday due to a power supply failure. The powersupply went in safe mode so the system crashed being without power. Could this problem have induced this disk problem? Could there be something wrong with the powersupply? Or with the intervention (cables not reconnected correctly?).

 

Can somebody advice me a solution?

 

  • Booting from the installation CD?
  • and then what?

Thanks!

Please use plain text.
Honored Contributor
Martin Moore
Posts: 214
Registered: ‎03-19-2003
Message 2 of 4 (411 Views)

Re: Alphastation 500 with Tru64 V4.0f domain panic bfs_close: invalid bf set ptr

There's a good chance that the sudden crash induced some metadata corruption into the AdvFS domain.  You could try one of the AdvFS tools to correct the corruption: /sbin/advfs/verify or /sbin/advfs/fixfdmn.  Check the man pages on your system for specifics on how to run them.  This may solve the whole problem.  If neither tool is successful, you're probably looking at remaking the domain and restoring from backup.

 

Martin

Every complex problem has a solution that is simple, elegant--and utterly wrong.
Please use plain text.
Regular Advisor
Geert Van Pamel
Posts: 126
Registered: ‎12-31-2002
Message 3 of 4 (404 Views)

How I temporarily recovered from the system crash

I could still boot in Single user:

b -fl s

 

mount -u /

 

Fxfdmn did not work, see error in attach.

Removing all the mount commands from /etc/fstab for all filesets on the failing domain did not work.

Still we got the bfs_close error in multiuser mode.

I had to remove the symbolic link from /etc/fdmns/re0c to disconnect the file domain completely from the OS.

 

For the time being I am running again production (partially) without the failed domain.

I have recreated the /usr and /var filesets & other file systems on another domain, and adapted the /etc/fstab accordingly.

 

Note that there is no vi in single user mode, so I had to:

 

  • Make a safe copy of the current fstab
  • Create a minimal fstab via cat > /etc/fstab and retyping all mount commands to create a minimum working disk configuration.
  • Before booting I also disabled all application startups (luckily I gathered all application startup in /sbin/rc3.d/S99upandrunning.
  • To disable the application startups, I just renamed this file into disable-S99upandrunning.

 

cd /sbin/rc3.d

 

ls -ld disable*

lrwxr-xr-x   1 root     bin           14 Oct 23  2005 disable-S57cron -> ../init.d/cron

-rwxr-----   1 root     bin         1441 Feb 21  2010 disable-S99upandrunning

 

We have lost already too much time.

So we prefer to run without the corrupted file domain for the time being.

The only drawback is that we lose 26 GB of unused storage.

 

From a theoretical point of view, could I just run mkfdmn again to overwrite the current (corrupted) contents?

Would there again be not a risk that I induce another system crash when the new RAID domain would be recreated?

I would imagine that I cannot do a rmfdmn because therefore I would need the symbolic link to the corrupted that I have removed from /etc/fdmns/re0c to avoid a system crash at the end of the boot process.

As long as this symbolic was there, I always got a bfs_close panic, even if I did not mount any file set on that domain!

Could this be caused by the RAID controller program swxcrmgr accessing the raid domain?

Or any other file system utility accessing the re0c file domain?

Could I recreate a new raid domain, without removing the previous one?

Could this prevent the bfs_close system panic?

 

I have been searching in this ITRC conference and in Google for the bfs_close error. Could it be that this is an internal AdvFS error that is not documented in the standard User & Reference Manuals?



Please use plain text.
Honored Contributor
Martin Moore
Posts: 214
Registered: ‎03-19-2003
Message 4 of 4 (393 Views)

Re: How I temporarily recovered from the system crash

> From a theoretical point of view, could I just run mkfdmn again to overwrite the current (corrupted) contents?

 

Yes.

 

> Would there again be not a risk that I induce another system crash when the new RAID domain would be recreated?

 

You would be creating new metadata from scratch so any corruption would be removed.  However, if there was an underlying hardware problem, that could still be there.  Any errors would be logged in binary.errlog, but since that file is itself inside the problem domain, there would be no record of them.

 

> I would imagine that I cannot do a rmfdmn because therefore I would need the symbolic link to the corrupted that I have removed from /etc/fdmns/re0c to avoid a system crash at the end of the boot process.

 

Yes, rmfdmn uses the link in /etc/fdmns.

 

> As long as this symbolic was there, I always got a bfs_close panic, even if I did not mount any file set on that domain!  Could this be caused by the RAID controller program swxcrmgr accessing the raid domain?

 

Possibly.

 

> Or any other file system utility accessing the re0c file domain?

 

If you're running the AdvFS daemon, advfsd, that would do it.  There might be others as well.

 

> Could I recreate a new raid domain, without removing the previous one?

 

Yes.  You could have even done it with the domain link still there, with mkfdmn -o.

 

> Could this prevent the bfs_close system panic?

 

Assuming no underlying hardware problem, it should.

 

> I have been searching in this ITRC conference and in Google for the bfs_close error. Could it be that this is an internal AdvFS error that is not documented in the standard User & Reference Manuals?

 

That is correct.

 

Martin



 


 

Every complex problem has a solution that is simple, elegant--and utterly wrong.
Please use plain text.
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation