11-26-2006 09:27 PM
I used to work with HP-UX PA-RISC systems and in case of an airco problem I nicely saw an ' OVERTEMP_CRIT WARNING ' message appearing in the syslog.
Now, I'm mostly dealing with HP-UX Itanium systems and in the scenario above the syslog only shows this message :
EMS : EMS Event Notification
Value: "MAJORWARNING (3)" for Resource: "/system/events/ia64_corehw/core_hw" (Threshold: >= " 3")
event details: /opt/resmon/bin/resdata -R 293011458 -r /system/events/ia64_corehw/core_hw -n 293011457 -a
I now know that I can retrieve more from the ELM message :
>-- Event Monitoring Service Event Notification --<
/system/events/ia64_corehw/core_hw is >= 3.
Its current value is MAJORWARNING(3).
Event data from monitor:
Event Time..........: Thu Nov 23 18:34:19 2006
Event #.............: 101011
System Temperature is at non-recoverable level.
Is there a way to have those 'old' clear messages back into the syslog of an HP-UX Itanium ?
Solved! Go to Solution.
11-28-2006 02:34 AM
On PA systems, you will get an EMS event 33 from dm_core_hw when the first temperature threshold (Low) is reached, and you will get the OVERTEMP_CRIT message from envd in syslog at around the same time.
You then get an EMS event 34 when the next threshold (Mid) is reached, and envd should initiate a shutdown (according to the configuration in /etc/envd.conf).
If the last level (Hi) is reached, because the system is still running for some reason, it will be powered off.
On IPF systems, things have changed. At the low threshold, you should still be seeing OVERTEMP_CRIT in syslog. You will also get an EMS event, from either fpl_em or ia64_corehw (it depends on whether the system is cell-based or not). At the Mid overtemp threshold, the system will be shutdown by the Firmware, not by envd, and no events or messages get logged. This is a 'soft' shutdown. If the High threshold is reached, again it's a hard power off.
What system type do you have, and what version of the OnlineDiags (STM)? There were some problems where envd was not getting notified of the Low threshold (OVERTEMP_CRIT) on non-cellular systems, but should be fixed in current versions of the OnlineDiags.
11-28-2006 03:20 AM
Even better, always have at least N + 1 cooling capacity so that you can tolerate the failure of any 1 unit without problems. The problem with relying upon a warning scheme is getting someone to actually shutdown the equipment in a timely manner. The computer may shutdown itself but what about other devices such as disk and tape drives which will continue to run even if the temperature excursion is extreme.
... and even better still is to have an auxiliary trip coil equipped main breaker connected to a thermal switch that will disconnect all power should a preset value be exceeded. This keeps you out of trouble should more than 1 of your HVAC units fail.
11-28-2006 03:39 AM
The question was about the apparent differences in warnings for overtemp conditions. It was NOT about different temperatures at which the warnings appear.
11-28-2006 03:50 AM
11-29-2006 08:57 PM
Thank for the inputs.
I agree that in most case, logging temperature errors, makes no sense because the time to do an intervention is too small.
However, these systems are located on the other side of the planet and the warnings gives us a good idea to find out what was going on.
11-29-2006 09:13 PM
C.46.05 is a pretty old version (HWE0409), and you should update to a recent version (the current version is HWE0609, C.54.00).
The fix to the problem that envd was not reporting OVERTEMP_CRIT was included in the HWE0603 (C.51.00) release, so for any earlier releases (such as the one you have) the problem will be seen (on some systems; I think it was non-cell systems that were affected, but my memory is a little hazy). See JAGaf79895 in http://www.docs.hp.com/en/diag/ems/emr_0603_1123.h
07-16-2007 08:55 PM
Your comment on Nov 28 at 16.45 is particularly interesting to me as I am trying to persuade the management here the risks they are taking by not sorting out the air cond. Is there any documentation or test results with regards to the long term damage that high temperatures can do to systems?
07-16-2007 11:44 PM
> Your comment on Nov 28 at 16.45 is particularly interesting to me as I am trying to persuade the management here the risks they are taking by not sorting out the air cond. Is there any documentation or test results with regards to the long term damage that high temperatures can do to systems?
The documentation is exactly the same as what happens when your building burns down. Is there some sort of documentation stating that items in the building will be damaged? How much?
The situation is self-evident. Once the temperature goes over 95 degrees (35 C) then bad things will happen. They will start with mysterious firewall and router issues that are intermittant. Then tape drives start having errors or start jamming with tangled tapes. Blade servers start crashing because they already run very hot.
And after you bring the temperature down, the problems continue -- electronic components do not heal themselves, they remain damaged forever. And the damage will appear as intermittent crashes and failures that diagnostics seldom locate because the components are only slightly damaged. And MOST important: some equipment is much better at tolerating high temperatures than others. High end HP servers have multi-speed and redundant fans to keep things cool. But your network equipment, big disk arrays or other servers may not have similar protection. So the HP will protect itself but it's useless because of other equipment that has been damaged.
Ask your management if they would like their paycheck calculated on machines that intermittently drop a digit or change the sign of a few numbers. Or to have their bank give them a statement where just a few checks are missing and a few others from some other accounts have been added.
When will this happen? When the computer room is too warm for humans to inhabit, generally above 95 F as mentioned before. Also, serious damage can occur because equipment is placed too close together so that inlet temperatures are well over 100 F in certain spots. Add up the total cost to to replace everything in the computer room. Would it be $100K or a million dollars, maybe more? And even if the insurance company pays for the damaged equipment, how long would the data center be useless -- days? weeks?
Quibbling over 50 thousand dollars to provide reliable cooling and prevent loss of a million dollars' worth of computer equipment is sheer folly.
07-17-2007 01:31 AM
I think you should well know that management don't tend to take kindly to any sarcasm but just want the facts. an that's what I'm after really - technical facts.
Such as, having had the temperature at an average of 35 deg.C for a number of weeks, what kind of long term damage could we expect after the room is back to normal.
Eg. Tape drives, hard drives, interfaces, motherboards, CPU's etc..
If anyone can point me ni the direction of some sort of tests and their results from somewhere, I would be most grateful.
07-17-2007 02:36 AM
Firstly Bill unfortunately the internal temperatures of most kit these days runs close to 30C in anycase and thats with decent cooling. However hot spots are already common.
You are right homwever that its not good to trust the temperature settings, LTO3 drives often report high tempratures above the tapes maximum recommended setting because due to a design fault the temperature sensor is on the opposite side of the unit in an area of restricted airflow.
We did some unintentional testing in our computerroom and once the airflow stopped within the computer room temperatures rose to 40C (and 19C) climb within 5 minutes. No alarms were triggerred though because the room temperature sensor was in the outflow duct and not in the room itself and since there was not circulation....
Because of the high humidity we have in this area we have had situations were summer temperatures in equipment rooms have risen to in excess of 70C after air conditioners have frozen over. At that temperature the solder starts melting on printed circuit boards.
But in overtemp situations you do have the increased risk of failure. We have a high incidence of Dead On Arrival Stock and we ascribed it purely due to the temperatures inside the delivery vehicles. All equipment is now shipped overnight for arrival during the early morning hours, or if its local in the late afternoon.
The biggest problems relating to heat are not so much semiconductor technology as these are designed these days with higher temperature thresholds over short periods of time, but with magnetic equipment and movable equipment. Disk drives develop bad sectors, platters start to warp, and magnetic tapes start to stretch, bearings also sieze. Having said that we do experience higher failure rates on overtemp equipment.
We have had a recommendation from a vendor to allow a cooling off time after high heat shutdowns before starting the equipment up. This also applies to relocating equipment. Its not always feasible but something to consider.