Re: critical temperature warning difference ia64 and PA-RISC ? (710 Views)
Reply
Super Advisor
Franky Leeuwerck_2
Posts: 192
Registered: ‎02-18-2004
Message 1 of 15 (711 Views)
Accepted Solution

critical temperature warning difference ia64 and PA-RISC ?

Hi,

I used to work with HP-UX PA-RISC systems and in case of an airco problem I nicely saw an ' OVERTEMP_CRIT WARNING ' message appearing in the syslog.

Now, I'm mostly dealing with HP-UX Itanium systems and in the scenario above the syslog only shows this message :

EMS [4471]: EMS Event Notification
Value: "MAJORWARNING (3)" for Resource: "/system/events/ia64_corehw/core_hw" (Threshold: >= " 3")
event details: /opt/resmon/bin/resdata -R 293011458 -r /system/events/ia64_corehw/core_hw -n 293011457 -a

I now know that I can retrieve more from the ELM message :

>-- Event Monitoring Service Event Notification --<

/system/events/ia64_corehw/core_hw is >= 3.
Its current value is MAJORWARNING(3).
Event data from monitor:

Event Time..........: Thu Nov 23 18:34:19 2006
Severity............: MAJORWARNING
Monitor.............: ia64_corehw
Event #.............: 101011
System..............: myHost.mydomain
Summary:
System Temperature is at non-recoverable level.


Is there a way to have those 'old' clear messages back into the syslog of an HP-UX Itanium ?

Regards
Franky
Trusted Contributor
Calandrello
Posts: 277
Registered: ‎12-22-2004
Message 2 of 15 (711 Views)

Re: critical temperature warning difference ia64 and PA-RISC ?

Friend
in case that the temperature of schemes arrives at one definitive temperature, the serving anger to enter in way halt
Honored Contributor
Andrew Merritt_2
Posts: 1,023
Registered: ‎04-02-2002
Message 3 of 15 (711 Views)

Re: critical temperature warning difference ia64 and PA-RISC ?

Hi Franky,
On PA systems, you will get an EMS event 33 from dm_core_hw when the first temperature threshold (Low) is reached, and you will get the OVERTEMP_CRIT message from envd in syslog at around the same time.

You then get an EMS event 34 when the next threshold (Mid) is reached, and envd should initiate a shutdown (according to the configuration in /etc/envd.conf).

If the last level (Hi) is reached, because the system is still running for some reason, it will be powered off.

On IPF systems, things have changed. At the low threshold, you should still be seeing OVERTEMP_CRIT in syslog. You will also get an EMS event, from either fpl_em or ia64_corehw (it depends on whether the system is cell-based or not). At the Mid overtemp threshold, the system will be shutdown by the Firmware, not by envd, and no events or messages get logged. This is a 'soft' shutdown. If the High threshold is reached, again it's a hard power off.

What system type do you have, and what version of the OnlineDiags (STM)? There were some problems where envd was not getting notified of the Low threshold (OVERTEMP_CRIT) on non-cellular systems, but should be fixed in current versions of the OnlineDiags.

Andrew


Acclaimed Contributor
A. Clay Stephenson
Posts: 17,825
Registered: ‎07-16-1998
Message 4 of 15 (711 Views)

Re: critical temperature warning difference ia64 and PA-RISC ?

NOTE: You are asking the fundamentally wrong question because computers are lousy thermometers (and clocks). Go buy yourself a digital thermometer that has either a serial interface or a network connection. Even inexpensive models these days actually have a web interface. This will allow you to write your monitoring script one time and never have to worry about changes in software or models.

Even better, always have at least N + 1 cooling capacity so that you can tolerate the failure of any 1 unit without problems. The problem with relying upon a warning scheme is getting someone to actually shutdown the equipment in a timely manner. The computer may shutdown itself but what about other devices such as disk and tape drives which will continue to run even if the temperature excursion is extreme.

... and even better still is to have an auxiliary trip coil equipped main breaker connected to a thermal switch that will disconnect all power should a preset value be exceeded. This keeps you out of trouble should more than 1 of your HVAC units fail.



If it ain't broke, I can fix that.
Honored Contributor
Andrew Merritt_2
Posts: 1,023
Registered: ‎04-02-2002
Message 5 of 15 (711 Views)

Re: critical temperature warning difference ia64 and PA-RISC ?

A. Clay, I understand the point you are making, but you're not answering the question that was asked.

The question was about the apparent differences in warnings for overtemp conditions. It was NOT about different temperatures at which the warnings appear.

Andrew
Super Advisor
Franky Leeuwerck_2
Posts: 192
Registered: ‎02-18-2004
Message 6 of 15 (711 Views)

Re: critical temperature warning difference ia64 and PA-RISC ?

Hi everyone,

Thanks for your answers all, Clay too, even that was not the question.

Franky
Acclaimed Contributor
A. Clay Stephenson
Posts: 17,825
Registered: ‎07-16-1998
Message 7 of 15 (711 Views)

Re: critical temperature warning difference ia64 and PA-RISC ?

and the correct answer is still "Big Woo" because by the time any of you have done anything about this, it's almost certainly too late and permanent damage has already been done. One of the common exercises done in an intermediate electronics lab is to measure the characteristics of a semiconductor device such as a junction transistor and then subject the device to an intentional over-temperature condition and measure those same characteristics after the device has cooled back down to the original temperature. The characterics have been permanently (and irreversibly) changed. That's why measuring does little good and even monitoring schemes with a (real) thermometer are of limited value because by the time someone gets there, it's often much too late and potentially millions of dollars of damage has been done.
If it ain't broke, I can fix that.
Super Advisor
Franky Leeuwerck_2
Posts: 192
Registered: ‎02-18-2004
Message 8 of 15 (711 Views)

Re: critical temperature warning difference ia64 and PA-RISC ?

I come back on this topic tomorrow, now I urgently need to pick up the kids.

See you..
Super Advisor
Franky Leeuwerck_2
Posts: 192
Registered: ‎02-18-2004
Message 9 of 15 (711 Views)

Re: critical temperature warning difference ia64 and PA-RISC ?

Hi Clay,

Thank for the inputs.
I agree that in most case, logging temperature errors, makes no sense because the time to do an intervention is too small.
However, these systems are located on the other side of the planet and the warnings gives us a good idea to find out what was going on.

Franky
Super Advisor
Franky Leeuwerck_2
Posts: 192
Registered: ‎02-18-2004
Message 10 of 15 (711 Views)

Re: critical temperature warning difference ia64 and PA-RISC ?

Andrew,

The STM version we run is :
Support Tools Manager, Version C.46.05, Product Number B4708AA


Franky
Honored Contributor
Andrew Merritt_2
Posts: 1,023
Registered: ‎04-02-2002
Message 11 of 15 (710 Views)

Re: critical temperature warning difference ia64 and PA-RISC ?

Hi Franky,
C.46.05 is a pretty old version (HWE0409), and you should update to a recent version (the current version is HWE0609, C.54.00).

The fix to the problem that envd was not reporting OVERTEMP_CRIT was included in the HWE0603 (C.51.00) release, so for any earlier releases (such as the one you have) the problem will be seen (on some systems; I think it was non-cell systems that were affected, but my memory is a little hazy). See JAGaf79895 in http://www.docs.hp.com/en/diag/ems/emr_0603_1123.htm

Andrew
Advisor
Matthew Holdsworth
Posts: 20
Registered: ‎07-10-2007
Message 12 of 15 (710 Views)

Re: critical temperature warning difference ia64 and PA-RISC ?

A.Clay,

Your comment on Nov 28 at 16.45 is particularly interesting to me as I am trying to persuade the management here the risks they are taking by not sorting out the air cond. Is there any documentation or test results with regards to the long term damage that high temperatures can do to systems?
Honored Contributor
Bill Hassell
Posts: 14,226
Registered: ‎05-29-2000
Message 13 of 15 (710 Views)

Re: critical temperature warning difference ia64 and PA-RISC ?

Matthew writes:

> Your comment on Nov 28 at 16.45 is particularly interesting to me as I am trying to persuade the management here the risks they are taking by not sorting out the air cond. Is there any documentation or test results with regards to the long term damage that high temperatures can do to systems?

The documentation is exactly the same as what happens when your building burns down. Is there some sort of documentation stating that items in the building will be damaged? How much?

The situation is self-evident. Once the temperature goes over 95 degrees (35 C) then bad things will happen. They will start with mysterious firewall and router issues that are intermittant. Then tape drives start having errors or start jamming with tangled tapes. Blade servers start crashing because they already run very hot.

And after you bring the temperature down, the problems continue -- electronic components do not heal themselves, they remain damaged forever. And the damage will appear as intermittent crashes and failures that diagnostics seldom locate because the components are only slightly damaged. And MOST important: some equipment is much better at tolerating high temperatures than others. High end HP servers have multi-speed and redundant fans to keep things cool. But your network equipment, big disk arrays or other servers may not have similar protection. So the HP will protect itself but it's useless because of other equipment that has been damaged.

Ask your management if they would like their paycheck calculated on machines that intermittently drop a digit or change the sign of a few numbers. Or to have their bank give them a statement where just a few checks are missing and a few others from some other accounts have been added.

When will this happen? When the computer room is too warm for humans to inhabit, generally above 95 F as mentioned before. Also, serious damage can occur because equipment is placed too close together so that inlet temperatures are well over 100 F in certain spots. Add up the total cost to to replace everything in the computer room. Would it be $100K or a million dollars, maybe more? And even if the insurance company pays for the damaged equipment, how long would the data center be useless -- days? weeks?

Quibbling over 50 thousand dollars to provide reliable cooling and prevent loss of a million dollars' worth of computer equipment is sheer folly.
Advisor
Matthew Holdsworth
Posts: 20
Registered: ‎07-10-2007
Message 14 of 15 (710 Views)

Re: critical temperature warning difference ia64 and PA-RISC ?

Bill,

I think you should well know that management don't tend to take kindly to any sarcasm but just want the facts. an that's what I'm after really - technical facts.

Such as, having had the temperature at an average of 35 deg.C for a number of weeks, what kind of long term damage could we expect after the room is back to normal.
Eg. Tape drives, hard drives, interfaces, motherboards, CPU's etc..

If anyone can point me ni the direction of some sort of tests and their results from somewhere, I would be most grateful.
Honored Contributor
Andrew Young_2
Posts: 504
Registered: ‎01-01-2003
Message 15 of 15 (710 Views)

Re: critical temperature warning difference ia64 and PA-RISC ?

Hi.

Firstly Bill unfortunately the internal temperatures of most kit these days runs close to 30C in anycase and thats with decent cooling. However hot spots are already common.

You are right homwever that its not good to trust the temperature settings, LTO3 drives often report high tempratures above the tapes maximum recommended setting because due to a design fault the temperature sensor is on the opposite side of the unit in an area of restricted airflow.

We did some unintentional testing in our computerroom and once the airflow stopped within the computer room temperatures rose to 40C (and 19C) climb within 5 minutes. No alarms were triggerred though because the room temperature sensor was in the outflow duct and not in the room itself and since there was not circulation....

Because of the high humidity we have in this area we have had situations were summer temperatures in equipment rooms have risen to in excess of 70C after air conditioners have frozen over. At that temperature the solder starts melting on printed circuit boards.

But in overtemp situations you do have the increased risk of failure. We have a high incidence of Dead On Arrival Stock and we ascribed it purely due to the temperatures inside the delivery vehicles. All equipment is now shipped overnight for arrival during the early morning hours, or if its local in the late afternoon.

The biggest problems relating to heat are not so much semiconductor technology as these are designed these days with higher temperature thresholds over short periods of time, but with magnetic equipment and movable equipment. Disk drives develop bad sectors, platters start to warp, and magnetic tapes start to stretch, bearings also sieze. Having said that we do experience higher failure rates on overtemp equipment.

We have had a recommendation from a vendor to allow a cooling off time after high heat shutdowns before starting the equipment up. This also applies to relocating equipment. Its not always feasible but something to consider.

HTH

Andrew Y
Si hoc legere scis, nimis eruditionis habes
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation.