Re: NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4 (901 Views)
Reply
Occasional Advisor
NicolasR
Posts: 8
Registered: ‎05-21-2014
Message 1 of 14 (1,039 Views)
Accepted Solution

NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

Hi There,

 

We've got 2 problems for the moment and some lack of information to solve it.

The specs:

CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4 -

 

Case:

 

1) no reboot into PXE

 

We´ve just installed CMU and we are unable to reboot the nodes from the Management console.

If we right-click the node, and press Backup (Capture image), the node will shutdown and we have to manually power it on.

 

Of course if we power it on, it will not boot from PXE but it will just go directly to grub.

 

2) From the OS prompt, running tcpdumps and dhclient´s I can see the DHCPDISCOVER + OFFER ok.

However, if we force it manually to boot from PXE, the node will not get an DHCP offer from the CMU server. (media cable error / timeout)

 

Here we are suspecting we have a switch problem where:

 

1. Verify that broadcast is enabled and is redirected to the switch. -----> DONE

2. Verify that the spanning tree is disabled on all ports connected to a node. ------> due to business specs we cannot disable it, do we have another option?

3. Verify that « multicast IGMP snoop loop » is disabled on the switch. ------> DONE

 

We don´t have point 2 very clear, still looking for an answer or an option.

Can you recommend us an alternative?

Thanks.-

 

 

Occasional Advisor
NicolasR
Posts: 8
Registered: ‎05-21-2014
Message 2 of 14 (1,016 Views)

Re: NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

Ok, Problem 2 is partially solved.

There was an issue with our Infiniband Mellanox fibre cards,  replaced them and solved.

 

Now back to problem 1:

 

Why is it possible that when from CMU selecting BACKUP will not boot the server, rather than just shutting it down????

 

Advisor
Chintala
Posts: 12
Registered: ‎09-19-2013
Message 3 of 14 (1,005 Views)

Re: NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

Hello,

 

It looks like node is taking more than 30sec to shutdown.

Please increase the CMU_ILO_OS_SHUTDOWN_TIMEOUT value to 60 sec, in /opt/cmu/etc/cmuserver.conf file on management node.

 

CMU_ILO_OS_SHUTDOWN_TIMEOUT=60

 

And, try a backup again. Let us know how it goes.

Occasional Advisor
NicolasR
Posts: 8
Registered: ‎05-21-2014
Message 4 of 14 (983 Views)

Re: NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

Hi Chintala,

 

Thanks for the answer.

Setting higher the timeout value hasn´t help us.

 

We still have the "no rebooting" problem.

If we select "BACKUP" the node won´t reboot, will shutdown directly.

 

Setting it from CMU to boot from netboot directly will launch node into PXE/TFTP mode.

However, it will fail to complete backup.

 

The error logs I could see at this moments are:

 

/opt/cmu/tmp/cmu_backup_err_1904676650556995234.tmp
error: CMU_NETBOOT_TIMEOUT(480 seconds) reached while waiting for hadwrk03p to network boot: check 'odin'. Debug information : NEW=1400684378 - ORIG=1400683897

 

/opt/cmu/tmp/power_osoff_hadwrk03p.err

Wed May 21 16:51:08 CEST 2014 ssh succeeded, waiting 60 seconds to let system shut down...
Enter the username: Enter the password: <?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='The RIBCL version is incorrect. The correct version is 2.0 or later.'
/>
</RIBCL>

 

 

 

Why is complaining about RIBCL version if the installed one is higher?

 

Advisor
Chintala
Posts: 12
Registered: ‎09-19-2013
Message 5 of 14 (976 Views)

Re: NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

Hello,

 

Is it a customer cluster or an internel cluster ?

 

If it is a customer cluster, please raise a support call at HP local support center. And, give us case ID.  

Our team will help you.

 

If it is a internal cluster, please let us know your hp email id, so than we can get in touch with you.

 

Mean while provide us the follwoing detais:

 

Wha is the OS on head node ?

 

From the log

 

>>>error: CMU_NETBOOT_TIMEOUT(480 seconds) reached while waiting for hadwrk03p to network boot: check 'odin'. Debug 

 

Why these two hostnames are different. hadwrl03p and odin ? Is it a typo ? Ideally both should be same. 

Is it trying to power on the proper node ?

 

How much time the node is taking to complete shutdown, when you manually perform the /sbin/shutdown -h now.

Is it taking more than 60sec or less than 60sec ?  If it is more than 60sec, increase the tim to that value and try again.

 

 

On SL4540 node are you using hpvsa, i.e Dynamic RAID mode (B120i) enabled on the node ?

 

Also, is it possible to get the cluster access ?

Advisor
Chintala
Posts: 12
Registered: ‎09-19-2013
Message 6 of 14 (968 Views)

Re: NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

Also get the complete output of the following files:

 

 /opt/cmu/tmp/ILO_power_osoff_<nodename>.output

/opt/cmu/tmp/power_osoff_<nodename>.err

 

/opt/cmu/tmp/ILO_power_on_<nodename>.output

/opt/cmu/tmp/power_on_<nodename>.err

Occasional Advisor
NicolasR
Posts: 8
Registered: ‎05-21-2014
Message 7 of 14 (957 Views)

Re: NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

Thanks again Chintala!

 

-We´ve just purchased CMU for customer cluster, so we are opening a case ID (so far unable since im still waiting for serial/said).

-The previous error was a typo.......dismiss it

-Shutdown is not taking longer than 20 secs

-on sl4540 we are using hpvsa (original installation was made with "blacklist=ahci dd")

 

I still cant figure out why node is not being reboot (but powered off) and cannot complete a TFTP boot.

 

The logs:

  /opt/cmu/tmp/ILO_power_osoff_<nodename>.output --------> OK

/opt/cmu/tmp/power_osoff_<nodename>.err--------> does not exist

 

/opt/cmu/tmp/ILO_power_on_<nodename>.output --------> OK

/opt/cmu/tmp/power_on_<nodename>.err --------> does not exist

 

added:

/opt/cmu/tmp> cat cmu_backup_err_1181849120419151806.tmp

------------------------------------------------------------------------------------------------------------------------

 

Thu May 22 10:49:34 CEST 2014 /opt/cmu/hardware/ILO/cmu_ILO_power_osoff called with -n hadwrk03p -i 172.22.20.34 -e /opt/cmu/tmp/power_osoff_hadwrk03p.err
Enter the username: Enter the password: <?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='The RIBCL version is incorrect. The correct version is 2.0 or later.'
/>
</RIBCL>

<?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='No error'
/>
</RIBCL>

<?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='No error'
/>
</RIBCL>

<?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='No error'
/>
</RIBCL>

<?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='No error'
/>
</RIBCL>

<?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='No error'
/>
</RIBCL>

 


----------------


14:12:25 hadmae1p(root):/opt/cmu/tmp> cat ILO_power_on_hadwrk03p.output
Thu May 22 10:49:48 CEST 2014 /opt/cmu/hardware/ILO/cmu_ILO_power_on called with -n hadwrk03p -i 172.22.20.34 -e /opt/cmu/tmp/power_on_hadwrk03p.err
Enter the username: Enter the password: <?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='The RIBCL version is incorrect. The correct version is 2.0 or later.'
/>
</RIBCL>

<?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='No error'
/>
</RIBCL>

<?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='No error'
/>
</RIBCL>

<?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='No error'
/>
</RIBCL>

<?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='No error'
/>
</RIBCL>

<?xml version="1.0"?>
<RIBCL VERSION="2.23">
<RESPONSE
STATUS="0x0000"
MESSAGE='No error'
/>
</RIBCL>


-------------


14:13:56 hadmae1p(root):/opt/cmu/tmp> cat cmu_backup_err_1181849120419151806.tmp
error: CMU_NETBOOT_TIMEOUT(480 seconds) reached while waiting for hadwrk03p to network boot: check 'hadwrk03p'. Debug information : NEW=1400749085 - ORIG=1400748604

 

 

Thanks for your help!

 

 

 

 

Advisor
Chintala
Posts: 12
Registered: ‎09-19-2013
Message 8 of 14 (951 Views)

Re: NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

Thank you for information.

 

Is it happening only with this node or other nodes in cluster as well ?

Have you tried by resetting the ILO of backup node once ?

 

Can you try the following from the management node (make sure backup node is in power ON state )

 

# /opt/cmu/bin/cmu_power -p BOOT -n <backup_nodename>

 

And, see whether node comes up or not. 

 

Try the same steps on an another node in the cluster.

 

Get the power logs (mentioned in the above post) for those nodes.

 

Next:

-------

Can you manually shutdown the backup node and then start backup process ? 

 

Let us know what is happening on the console of the backup node.

 

 

 Also, is it possible to have the cluster access ?

 

 

Occasional Advisor
NicolasR
Posts: 8
Registered: ‎05-21-2014
Message 9 of 14 (943 Views)

Re: NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

Morning Chintala (at least here XD on spain),

 

Thanks for the recommendations, they keep me knowing the software deeper.

 

1st action was successful, I can manually reboot the node.

 

09:55:47 hadmae1p(root):/opt/cmu/bin> ./cmu_power -p BOOT -n hadwrk03p

powering off hadwrk03p ...

spawning 1 task(s) ................
waiting for 1 task(s) ................ { last:hadwrk03p }
powering on hadwrk03p ...

spawning 1 task(s) ................
waiting for 1 task(s) ................ { last:hadwrk03p }
./cmu_power finished

 

2nd action: Manually shutting it down, the running the backup from CMU.

Manuak shutdown OK

Launch backup from CMU partially ok.

It will boot on, but will go directly into grub (its supposed to boot into PXE/TFTP mode, correct?)

 

ILO_power_osoff_hadwrk03p.output

ILO_power_on_hadwrk03p.output

Boths seems ok, no errors no discrepancies founded.

 

CMU_Backup_err from /opt/tmp

10:22:58 hadmae1p(root):/opt/cmu/tmp> cat cmu_backup_err_7263375511447792300.tmp
<hadwrk03p> : error netbooting node: check CMU configuration, boot order, and BMC access. Debug info : NEW=1400832591 - ORIG=1400832373 : TIMEOUT=480

 

Sorry, we are not able to provide access to our customer network.

 

-How can I launch a command make the node launch into PXE/TFTP and check whats happening on the server side?

 

Thanks for your support,

Nicolas.-

 

Advisor
Chintala
Posts: 12
Registered: ‎09-19-2013
Message 10 of 14 (935 Views)

Re: NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

Morning Niclolas,

 

Good to know that node is powering on when you manually power ON it.

 

we still need to figure out why node is not powering on when we start a backup process.

As node is successfully powering ON when you do it by using /opt/cmu/bin/cmu_power. (In fact cmu backup process also uses the same command, which you tried)

 

>>>It will boot on, but will go directly into grub (its supposed to boot into PXE/TFTP mode, correct?) 

 

Is the admin NIC (which is in CMU database ) PXE enabled ?

 

Is PXE boot is set before HDD in the node boot order ? If not you can set it through the ILO -> Virtual Power -> Bootorder.

 

When you start backup process, is the backup node sending any DHCP requests. If yes does the requests (packets) reaching headnode ? You can check this by looking for MAC address in /var/log/messages on headnode.

 

Also, are you enabling spanning tree on the switch side (i saw it in your early posts) ?

Is it possible to disable the spanning tree ? We saw some difficulties with PXE booting in the past with spanning tree enabled.

 

If it is not possible to disable the spanning tree, you need to set the ports connected to nodes in Portfast/edged-port mode in switch to work around the STP.

 

Set the ports connected to nodes in Portfast/edged-port mode at switch level and manually shutdown the backup node,

then start a backup process again.

  

>>>How can I launch a command make the node launch into PXE/TFTP and check whats happening on the server side?

 

You can view the console of backup node from the management node by giving

 

# /opt/cmu/bin/cmu_console <nodename> vsp

 

Before that make sure that node virtual serial port (vsp) is set to COM1 (ttyS0).

If it is set to COM2, change it to the COM1 in the BIOS. 

 

Let us know how it goes.

 

Occasional Advisor
NicolasR
Posts: 8
Registered: ‎05-21-2014
Message 11 of 14 (924 Views)

Re: NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

Hi again!!!

 

Good news, CMU it seems to work now.

 

Steps taken:

-Change boot sequence to set NIC on top.

-Change serial VSP port configuration

-No further changes made at swtich level (STP is still on).

 

Don´t know very well, but changing serial port seems now to make the nood to boot into PXE.

DHCP offers are being sent/received. TFTP is working.

Launching Backup from CMU will boot node into PXE (previously would only shut it down).

 

However........and please, don´t tell me its not true, there´s a new turnaround.

error retrieving fstab file: is root partition 'sda1' correct?

 

sda1 is /boot partition

Everything else is being handled under LVM.

 

So my question is (after reading posts and user´s guide im not able to found it)

We will be able to use CMU to backup the nodes which are having FS under LVM?

 

Please, give me good news regarding this and tell me LVM is supported XD

Thanks!

Nicolas.-

 

 

 

Advisor
Chintala
Posts: 12
Registered: ‎09-19-2013
Message 12 of 14 (914 Views)

Re: NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

Hello Nicols

 

Good to know that node is powering ON and PXE/TFTP working.

 

Unfortunatly i don't have a goodnews for you. :-(

 

Currently, CMU doesn't support LVM on compute nodes. It is mentined in user guide

section 2.2.2 Preinstallation limitations. User guide is under /opt/cmu/www.

 

While taking backup, you need to mention (select) the ROOT partion number. (not /boot partition).

 

From your earlier posts, you are using HPVSA. To make backup work you need to blacklist the ahci module.

 

This is because ahci module loads first before RAID driver (hpvsa) inside the HP Insight CMU netboot environment during backup and cloning operations. 'ahci' detects B120i (SL4540 Gen8) as a normal SATA controller and therefore RAID setup is not recognized on nodes. To blacklist 'ahci', add modprobe.blacklist=ahci to the /opt/cmu/etc/bootopts/default file. This workaround is necessary only for Dynamic Smart Arrays based on B120i.


For example:
APPEND root=/dev/nfs CMU_CONSOLE ramdisk_blocksize=512 CMU_VENDOR_ARGS ip=::::::bootp modprobe.blacklist=ahci

 

Do you have any other disks connected to other external controllers like p420i etc.,?

If yes, please blacklist hpsa module also by giving modprobe.blacklist=hpsa otherwise hpsa module will load before hpvsa and OS disk sda may get detected as sdX  inside the CMU netboot environment.

 

Let us know if you face issues.

 

Occasional Advisor
NicolasR
Posts: 8
Registered: ‎05-21-2014
Message 13 of 14 (909 Views)

Re: NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

Thanks Chintala for your support.

 

Unfortunately we are using an HADOOP/LVM architecture due to having several disks.

In this case, using p120 (for OS)  + p420i (for big data) with FS up to 24TB.

 

Also tried setting modprobe/blacklist options with no success (I guess you were making reference only if we are not using LVM, so this is not the case).

 

So, and seeing we are not going to be able to handle this scenary, will propose some changes and let you know how we are proceeding. It would be a pitty not tu use CMU as our backup/restore tool.

 

In any case, your help has been very tipfull.

Best Regards,

Nicolas.-

Advisor
Chintala
Posts: 12
Registered: ‎09-19-2013
Message 14 of 14 (901 Views)

Re: NO Netboot / reboot not working ------- CMU 7.2 - ProLiant SL4540 Gen8 + RHES 6.4

Is it mandetory to use LVM on OS disks also ?

 

You can have non-lvm disks for OS (attached to B120i controller), and a lvm architecture for big data disks (attached to p420 controller).

 

While taking backup, mention only OS disk root partition. And, I belive the backup of big data disks is not necessary, as they contain different data for different nodes.

 

 

Hope this helps !

 

The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation.