10-05-2010 06:33 AM
We have an ES40 Alpha server with 4 by EV6.7 (21264A) 667 processors, and 3GB of RAM on Tru64UNIX v5.1B rev 2650.
Periodically the system just resets and because we have the AUTO_ACTION parameter set to reboot it reboots. There are no errors in the binary error log and no messages in the messages files.
If anyone can provide input on why this maybe occurring it would be greatly appreciated.
10-05-2010 10:50 AM
Do you really mean "periodically", that is,
"at regular intervals", like, say, every day
at 00:01, or do you really mean
"occasionally", as in "repeatedly and
Are you getting crash dumps?
10-05-2010 11:21 PM
Also, check further back over recent weeks/months in the binary errlog and see if there were any h/w related warnings.
10-05-2010 11:27 PM
This happens at irregular intervals for no apparent reason that we can diagnose. The binary errlog is clean, there are no crash dumps. There are no messages in the messages file.
10-06-2010 04:24 AM
no sign of an organized crash or shutdown,
then I'd tend to suspect a power problem
(supply interruption, or failing power
supply). I'm not sure how one might easily
test that bad-power-supply hypothesis,
however, other than swapping out the
10-07-2010 12:11 AM
I had several cases like yours. Problem was bad memory DIMM.
Find some time for machine downtime and do memory tests.
>>> memexer 3
It can take a very long time to find failing DIMM.
10-10-2010 03:37 PM
Use the console and RMC to check the system health via the env and status commands. Clear any ALERTS and look for anything out of spec. See these guides for reference:
The next unplanned restart use RMC env to see what alerts are set.
Use '# consvar -s auto_action halt', the next time it halts, look in the NVRAM error logs via SRM commands.
Capture the following...
P00>>> show power
P00>>> cat el
P00>>> sho fru
P00>>> show error