10-09-2013 07:20 PM - edited 10-09-2013 08:24 PM
I have been given the unenviable task of chasing down a bug over a decade old.
We are currently using TCP/IP Services V5.4 on an AlphaServer running OpenVMS 7.3-2. We use this to process realtime data, which is pre-processed on a Solaris-based server and then sent to the Alpha via RPC. Since this is a realtime process, our "client" is really a bit of a server, with the callback mechanism being used to receive messages.
Every so often -- not VERY often, but often enough to worry some of our users -- we have a deserialization failure on receipt of what should be a double-precision float. Because we're working with a large collection of legacy code originally developed for the VAX, we compile everything to use D-floats. The failure, then, occurs in xdr_double_D.
One of our senior developers, now retired, must have noticed this back in 2001, because at that point he wrote a wrapper routine to detect these failures on decode operations. This wrapper simulates a successful deserialization but substitutes a value large enough to be obviously bogus, which he then increments every time a failure is detected. He then modified our callback to note these bogus values and log them so he could keep track of how often this was happening. Unfortunately, no one understood this anymore so it took me a little while to figure out what was going on, and the original developer didn't do any further work on the problem in the 5+ subsequent years he was with us.
it may or may not be coincidental that we see these failures most often (perhaps to 100% of the time, but I've been unable to verify that yet) when the expected value is 0.0.
The problem only recently came to the forefront when some of our users were pumping simulated data through the system, and certain parameters were artificially held at 0.0 when in real life they'd only pass through it momentarily. They would like to know they can rely on the results of their simulation, and I'd like to be able to tell them so.
We do not have VMS source code licensed, but tracking down some of the original source code for ONC RPC from Sun, I see it was originally written to work on a VAX, where D-floats were the only floating point format. If the current xdr_double_D works by the same method as the original xdr_double when compiled for a VAX, the only way it might fail is by passing on a failure on one of its two calls to XDR_GETLONG, which is itself a macro calling a function pointed to in the XDR stream handle. Conversion after that point from the serialized format (IEEE) to a D-float is fairly straightforward and cannot as originally written result in failure.
Can anyone give me any insight into possible underlying causes for an unsuccessful XDR_GETLONG call? Or does perhaps someone know of any other causes for this failure that might be unique to the VMS implementation for the Alpha?
I am also trying to track this down on the Solaris side with the contractor who provides that server, but I'd like to be able to guarantee him that the problem is not on the VMS side.
A possibly related issue is that we sometimes see values "wobble" around their known actual values by a lower-order bit or two, which doesn't affect the values users see to the precision they typically use but which we can see in the hex. Given that floating point conversion ought to be done with bit-twiddling with nothing but integer arithmetic used to adjust exponent biases, I don't see how I can ascribe this to the usual floating point rounding errors.
Thanks for any responses I might receive!
10-10-2013 04:06 AM