01-14-2014 07:08 AM
I would like to submit an issue we have on our newest ServiceGuard cluster running on a pair of BL860 i2 blades.
This has been reported to HP but so far very little progress has been made. Since I really can't see what's so specific about our configuration that makes us have this problem, I would suppose that other have been bitten by this too and I would love to hear about what you've done to work around this.
Sorry, this is going to be a lengthy post because I need to provide all the relevant information. Please bear with me and many thanks in advance for those who will take the time to read it.
Any information you can provide on these issues, including "we have this too", is more than welcome.
Problem 1: need to explicitly and manually set a default route for every ServiceGuard package running on the machine (except those packaging SRP containers) to maintain their network connectivity with other VLANs.
A bit of background information first: these servers have 3 active LAN interfaces per server:
- lan2 in network 10.149.160.0/24: production traffic
- lan4 in network 10.149.247.0/24: administrative traffic
- lan5 in network 192.168.2.0/24: ServiceGuard hearbeat traffic
Initially, only one default gateway was defined to 10.149.160.254 in /etc/rc.config.d/netconf, hence on interface lan2.
ServiceGuard packages running on this machine have IP addresses in network 10.149.160.0/24 hence they come up as secondary lan2:X interfaces.
Such a configuration makes the administrative IP address 10.149.247.X unreachable from any of our IP networks (we have 10.149.0.0/16) except from 10.149.247.0/24 itself, unless we force ip_strong_es_model=2 in the network stack parameters, which is not the default configuration.
We used to do this on our two older clusters, but this one is supposed to host SRP containers which *require* ip_strong_es_model=1 so that's not an option.
Therefore we have added a second default gateway to 10.149.247.254 in the network boot parameters /etc/rc.config.d/netconf. This solves the network connectivity for the administrative (lan4) IP.
The heartbeat interface (lan5) is obviously not concerned. It doesn't get any traffic from outside of its own private network.
However, we have noticed that:
- SG packages running a SRP container work fine (no IP connectivity issue)
- "plain" SG packages (e.g. running an Oracle DB engine) are unreachable from any other network than 10.149.160.0/24. We need to explicitly add a separate default gateway for each and every package.
For example: assume a package whose IP adddress is 10.149.160.123, which comes up as lan2:1. Full IP connectivity can only be achieved if the following command is issued during package startup:
/usr/sbin/route add default 10.149.160.254 1 source 10.149.160.123
The new default route appears as follows in "netstat -rn":
default 10.149.160.254 UG 0 lan2:1 1500
The "normal" SG script that takes care of bringing up the package's IP address (namely /etc/cmcluster/scripts/sg/package_ip.sh) DOES NOT do this. However the additional scripts run when a package hosting a SRP container is started (/etc/cmcluster/package_name/srp_route_script) DOES do it. Therefore someone at must HP have realized that this was required, but why hasn't this been backported to the main SG scripts?
We've eventually resorted to patching package_ip.sh to add the needed default gateway at package startup and remove it at package shutdown, but this really is an ugly hack I'm not proud of.
Problem 2: outgoing connections from applicative processes within SG packages (including the ones made from standard HP-UX commands such as remsh) have completely unpredictable source IP addresses
This is an entirely new issue. No such behaviour has ever taken place on the two other BL860 clusters running HP-UX 11iv3. We observe that outgoing connections made from processes running within SG packages have unpredictable and changing source IP addresses. Since all packages have IP addresses within 10.149.160.0/24, we would expect the source IP address to be the one set to lan2 at boot. This CERTAINLY was the case on the older machines.
We observe that these IP addresses can be ANY of the addresses set on lan2 i.e. the address of any active SG package. It does vary over time too. Starting a SG package tends to make the source IP address for outgoing connections made by any process running on this machine "stick" to the address of the newly started package... until another one is started.
This makes things as maintaining .rhosts files on remote target servers getting remsh or rcp commands issued from a SG package completely impossible, since we have to account for every possible package becoming the source IP for the connections they get.
The only reply we've had so far from HP: "application processes need to be bound to their socket by their IP address only and not by ANY address" is completely unacceptable for many reasons such as:
- we're not going to hardcode the IP address assigned to the relevant SG package into the source of any of our applications
- for several applications, source code is no longer available or never was and /or they're HP-PA applications running under ARIES
- this also affects standard HP-UX commands such as remsh, rcp, ftp etc. We're not going to recode these, or are we?
Thanks for your time reading this,
01-21-2014 07:35 AM
This document discusses routing in an SRP environment:
This link may get you to it:
01-22-2014 10:23 AM
DB engine should run in a container.
01-28-2014 07:34 AM
Thanks for your replies,
@Stephen: have you actually read this document? it deals with the creation of a SRP package and Oracle configuration, but specifically doesn't say much if anything about the IP routing issues.
@Laurent: you seem to imply that you can't have a mix of "regular" SG packages and SG-packaged SRP containers on the same server. Where do you get this information from? Our HP support folks certainly haven't complained the slightest bit about us having both on the same box...
As for the problem already being present although less likely before SRP packages, well, in this case there must be orders of magnitude of diffrences. Our SG packages have been running for quite a few years on the two other clusters with es_strong_model=2, making hundreds if not thousands of connections per day and we have *never* encountered such a problem. The source IP address of these outgoing connections has always been predictable and = the IP address of the server itself on that interface (in my example, the IP address of lan2).
Now it tends to take the IP address of the last package started on the server many timer per day.
This problem is already known to HP because SG daemons themselves can be affected as I've been told. Sometimes intra-cluster connections made by these are rejected by the target node because the source IP is not recognized as the one belonging to the source node. This can cause host panics due to safety timer expired. A local HP support folk told me that (quote) "an upcoming release of Serviceguard will address the issue by forcing daemons to bind their sockets explicity to the native IP address of the server instead of to 'any'" (end quote).
This document covers the routing issues quite nicely although it doesn't provide a solution.