01-27-2004 10:52 PM
I'm not sure if I understand MC/SG logics correctly. Imagine the following failure situation. Primary node losses all network connections.
I believe that...
Primary node still runs its packages and keeps the cluster lock disk. The secondary node believes the primary node is down, tries to get lock disks, fails and then either stop its cluster software or makes TOC. There is no standard way to force primary node to stop its packages and release cluster lock in case of network failure.
Thanks and points in advance for your comments!
Solved! Go to Solution.
01-27-2004 10:58 PM
I am afraid the situation is like you say. So if the primairy node loses all netwark traffic and is the first to get the lock this the second node stops it cluster software or makes TOC. What you can do to privent this from happening is:
Create second seperate haertbeat lan ( x cable) or configure a serial heartbeat interface.
You can also take a look at the use of an arbitraded node.
01-27-2004 11:05 PM
You could look at using a serial heartbeat, although the advice generally is not to use this, or you could put in a dedicated heartbeat lan to a separate hub, on a secure power source.
The other option would be to use a quorum server rather than a lock disc, but be aware that the quorum server can only be addressed via one subnet, so if that subnet were to also die you would lose connectivity to the QS, but again you are looking at what is in all probability a multiple point of failure.
SG is designed to guard against SPOF or Single Point of Failure.
01-27-2004 11:54 PM
When HB traffic entirely ceases, active nodes attempt to reform a cluster, all the while, continuing to operate their packages. Those that "vote" into the new cluster continue to operate their packages without disturbance. Those that fail to get into the new cluster have to TOC (reboot).
In the situation you described - an entire loss of network connections on the primary, whichever node gets to the cluster lock disk first, has essentially voted to be the new cluster - even if it's NICs are dead.
As Melvyn said, Serviceguard provides a serial-heartbeat feature which enables Serviceguard to keep the current cluster running until SG can descriminate which node has all dead network paths. SG will then force -that- node out of the cluster via TOC.
G. Vrijhoeven mentioned the use of a dedicated crossover HB network - which is another method to insure proper package ownership. This method allows nodes to continue to pass HB, giving nodes time to detect failed package subnets, which in turn can lead SG to perform a package failover to an adoptive node that still has access to the network.
Using a quorum server as an arbitration device would insure the primary node would not succeed in keeping cluster ownership - since a network connection to the quorum server must be functional in order to remain in the cluster after loss of HB paths. A complete network failure on the primary would also prevent access to the quorum server, resulting in a TOC.
01-27-2004 11:58 PM
Just to clarify that when all is running well in a two node cluster, neither node owns the "lock disk".
The "lock disk" is used to break a tie in a two node cluster when the nodes cannot communicate.
So what would actually happen in your scenario is that it would start with a situation where both nodes were communicating and neither node owned the lock disk.
At the point where the primary node lost all of its network connectivity, the nodes at different times would sense that each other had died (from their point of view) and would race to get the lock disk.
If the primary node got there first then it would indeed form a one node cluster and try to run the packages; However, if the secondary node got there first, it would take over and the primary node would kill itself with a TOC.
Kent M. Ostby
01-28-2004 04:22 AM
This is by design.
02-02-2004 10:10 PM
Thanks for the reply. Kent's reply was the most useful for me. Dedicated HB LAN is a good solution, but requires additional NICs. Using of serial heart-beat is prohibitted by docs if you have more than one NIC. Probably, it should work, but I wouldn't like to run something like 'unsupported configuration'.