05-17-2012 07:34 AM
Hello peers, I am running a two node HPUX 11.31 cluster. Version of ServiceGuard is 11.20.
In my package configuration file I have set service_restart to unlimited for my oracle_monitor and oracle_monitor_listener services. This was with the thought that I would never not wanting to be monitoring for these oracle processes. However, when testing I have killed the pmon oracle process and would have expected the monitor script to have recognised the failure and moved the package across to the other node. What instead occurs is that the service process continually attempts to restart and the package remains in a hang state. Does this mean that you would never want to use the 'unlimited' option ???? I thought it would recognise between a valid monitor failure and the service just dying ???
Hope someonce can clear this up for me.
05-22-2012 01:18 AM
The unlimited restart option is for the situations where the service process is not a monitor but provides the actual service you want to make fault-tolerant, i.e. simple applications where the entire application consists of a single process. Since Oracle is more complex, it cannot really be started this way.
Using the services as monitor scripts is actually an extension of the original concept. When the monitor scripts are running as services, they have a very limited way of communicating to Serviceguard: basically, if the monitor keeps running, Serviceguard understands it is fine, and if it dies, something is wrong.
A production-quality software should never "just die" without a good reason - monitor scripts included. If your monitor script frequently "just dies" with no good reason, you should find out why it dies, and add some logic in your monitor script so that it can avoid needlessly dying in that situation.