superdome EMS report I/O link error (128 Views)
Reply
Occasional Advisor
Genesis_1
Posts: 15
Registered: ‎02-14-2007
Message 1 of 16 (128 Views)
Accepted Solution

superdome EMS report I/O link error

Hi, my superdome EMS has reported IO link error for months. But the system has been running normally. I suppose that the reo link cable is defective. The problem descriptions are as follow:

FRU Physical Location: 0x00ffff01ffffff93
FRU Source = 9 (cell)
Source Detail = 3 (coherency controller)
Cabinet Location = 0
Cell Location = 1

RIN_ERR_PRI_MODE..........: 0x0000000000000008
REO input single wire error

CECC_DATA_MSB_0...........: 0x0000000000000383
CECC_DATA_LSB_0...........: 0xc064d030322e0c4e
CECC_DATA_MSB_1...........: 0x0000000000000383
CECC_DATA_LSB_1...........: 0xc064d030322e0c4e


>---------- End Event Monitoring Service Event Notification ----------<

>------------ Event Monitoring Service Event Notification ------------<

Notification Time: Sun Dec 16 11:42:00 2007

dds2 sent Event Monitor notification information:

/system/events/core_hw/core_hw is >= 1.
Its current value is SERIOUS(4).



Event data from monitor:

Event Time..........: Sun Dec 16 11:41:59 2007
Severity............: SERIOUS
Monitor.............: dm_core_hw
Event #.............: 85
System..............: dds2

Summary:
I/O link interface to cell controller recovered errors


Description of Error:

The cell controller (CC) chip has detected and corrected multiple errors
in data transferred to it from the I/O bus adapter (REO) chip to which it
is connected.

Probable Cause / Recommended Action:

The inbound I/O link cable is unreliable.
Contact your HP support representative to check the inbound I/O link
cable.

There may be a problem with the CC chip or cell board.
Contact your HP support representative to check the cell board.

There may be a problem with the I/O backplane.
Contact your HP support representative to check the I/O backplane.

Additional Event Data:
System IP Address...: 10.93.4.12
Event Id............: 0x47649e8800000000
Monitor Version.....: B.01.00
Event Class.........: System
Client Configuration File...........:
/var/stm/config/tools/monitor/default_dm_core_hw.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 3
Associated OS error log entry id(s):
None
Additional System Data:
System Model Number.............: 9000/800/SD32000
OS Version......................: B.11.11
STM Version.....................: A.29.00
EMS Version.....................: A.03.20
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/dm_core_hw.htm#85



There is a error in the HPMC trace file:

10215: ------- Analyzing CC1, RIN_ERR_PRI_MODE CSR:
10216: RIN_ERR_PRI_MODE_CSR = 0x0000000000000008
10217: RIN_ERR_ENABLE_MASK CSR = 0x000000001fffffff
10218: RIN_FE_UPGRADE_CONFIG CSR = 0x0000000000000fc0
10219: RIN_DR_UPGRADE_CONFIG CSR = 0x0000000000000000
10220: Problem: (3)CC1, RIN Corr Err: Link one-bit failure in same position for
10221: 1 or more cycles. Corrected by HW.
10222: Possible Cause 1: RIO link cable connected to cell 1 has a poor
10223: connection or is defective. [12270208]
10224: Possible Fix 1: Reseat or replace RIO link cable.
10225: Possible Cause 2: RIO chip is defective.
10226: Possible Fix 2: Replace HIOB connected to cell 1.
10227: Possible Cause 3: CC chip on cell 1 or cell board is defective.
10228: Possible Fix 3: Replace Cell board 1.
10229:
10230:
10231: ------- Analyzing CC1, RIN_ERR_SEC_MODE CSR:
10232: RIN_ERR_SEC_MODE_CSR = 0x0000000000000008
10233: Problem: (3)CC1, RIN Corr Err: Link one-bit failure in same position for
10234: 1 or more cycles. Corrected by HW.
10235: Possible Cause 1: RIO link cable connected to cell 1 has a poor
10236: connection or is defective. [12270208]
10237: Possible Fix 1: Reseat or replace RIO link cable.
10238: Possible Cause 2: RIO chip is defective.
10239: Possible Fix 2: Replace HIOB connected to cell 1.
10240: Possible Cause 3: CC chip on cell 1 or cell board is defective.
10241: Possible Fix 3: Replace Cell board 1.
10242:
10243:
10244: Note: CC1, RIN GSM Hdr Log is NOT valid.
10245: Note: CC1, RIN Uncor. Hdr Log is NOT valid.
10246: Note: CC1, RIN Uncor. ECC Data Log is NOT valid.
10247: Note: CC1, RIN Uncor. ECC Cyc Log is NOT valid.
10248: Note: CC1, RIN FE Hdr Log is NOT valid.
10249: Note: CC1, RIN No Pres. Log is NOT valid.
10250:
10251: Note: CC1, RIN Cor. ECC Data MSB Log0 is valid.
10252: Note: CC1, RIN Single ECC Wire Log is valid.
10253:
10254: ------- Analyzing CC1 RIN_SGL_ECC_WIRE_LOG:
10255: RIN_SGL_ECC_WIRE_LOG CSR = 0x0000000004002000
10256: CC1 RIN block corrected single wire error in RIO link wire number 13.
10257: CC1 RIN block corrected single bit error in RIO link data row 2
------- Analyzing cell 1 RIO logs:
10260:
10261: Warning: RIO 0 Link PRIMARY_ERROR_LOG CSR connected to cell 1 not
10262: stored. - Analysis skipped.
10263:
10264: Note: CC 1, RIO 0, Rope Unit 0 RU_PRI_ERR_LOG CSR not stored. -
10265: Analysis skipped.
10266: Note: CC 1, RIO 0, Rope Unit 1 RU_PRI_ERR_LOG CSR not stored. -
10267: Analysis skipped.


what do you think? Thanks
Please use plain text.
Honored Contributor
Phil uk
Posts: 253
Registered: ‎04-18-2007
Message 2 of 16 (128 Views)

Re: superdome EMS report I/O link error

Hi,

Call HP and have them send a CE to site.
I would recommend that the CE runs Scan-on-the-fly (SOTF) from the SuperDome Management Station (SMS) - this may give more clues as to where the problem actually is, ie, Cell Board,REO,Backplane, IO Backplane.
If SOTF finds errors - it may be necessary to arrange a complete outage on the machine depending what the problem might be.
(Don't try to reseat the REO cable while the SuperDome is powered on as you may damage the backplane or bend pins on the backplane.)
Note, it is very unusual for REO cables to fail.
Regards,
Phil
Please use plain text.
Occasional Advisor
Genesis_1
Posts: 15
Registered: ‎02-14-2007
Message 3 of 16 (128 Views)

Re: superdome EMS report I/O link error

Thank you, Phil. I'll take your recommends. In addition, the problem has been existing for several years and keep occurring 1 times per day. But the superdome is running normally. HP CE didn't solve the problem in the warranty period. Maybe they didn't pay more attention to it.
Please use plain text.
Occasional Advisor
Genesis_1
Posts: 15
Registered: ‎02-14-2007
Message 4 of 16 (128 Views)

Re: superdome EMS report I/O link error

Hi, Phil, I want to know how to use the SOTF to diag the superdome REO problem.
Urgent call, Thanks.
Please use plain text.
Honored Contributor
Phil uk
Posts: 253
Registered: ‎04-18-2007
Message 5 of 16 (128 Views)

Re: superdome EMS report I/O link error

Hi,

It is a diagnostic test (called JUST) that you run from the SMS. It also depends on what sort of SMS you have as to how you run the tests (unix SMS or PC based SMS).
It is quite detailed and should be run by HP CE's etc.
Also, from the output of the diagnostics then you need to decode what the problem may be.
If you make a mistake in the sequence of events for setting up the JUST tests (the correct daemons etc) - then you can cause all nPARs/vPARs to crash.

I would strongly suggest you get HP onsite to do this.

Regards,
Phil
Please use plain text.
Occasional Advisor
Genesis_1
Posts: 15
Registered: ‎02-14-2007
Message 6 of 16 (128 Views)

Re: superdome EMS report I/O link error

Thanks, I know JUST, but I don't know how to use the SOTF. My sms is a500, and the superdome can be shutdowned.
Please use plain text.
Honored Contributor
Phil uk
Posts: 253
Registered: ‎04-18-2007
Message 7 of 16 (128 Views)

Re: superdome EMS report I/O link error


If you know JUST, and you CAN shut it down, then do the offline version - much better.

A500 SMS - so must be a Legacy 'Dome i guess?

logon to sms with hduser account
(password is HP proprietry - so if you know JUST then I guess you know the password)

ONLY do this if partitions are DOWN !!
run
# just -s eg, priv-01, priv-02 etc

once at JUST prompt on SMS
...select the tests you wish to run.

Don't forget to power off the whole machine (+IOX chassis) at the breakers for 1minute after you've run the test...then back on again.

Cheers,
Phil
Please use plain text.
Honored Contributor
Phil uk
Posts: 253
Registered: ‎04-18-2007
Message 8 of 16 (128 Views)

Re: superdome EMS report I/O link error


.......not to mention that you need to decode all of that stuff if it picks up the errors 8-(

I still recommend you get HP to do it
Please use plain text.
Occasional Advisor
Genesis_1
Posts: 15
Registered: ‎02-14-2007
Message 9 of 16 (128 Views)

Re: superdome EMS report I/O link error

Is it Aclts?
reo_link_ac_test -dt :0:9:cc -rt :0:9:reo
right?
Please use plain text.
Honored Contributor
Phil uk
Posts: 253
Registered: ‎04-18-2007
Message 10 of 16 (128 Views)

Re: superdome EMS report I/O link error


Is it Aclts?
>> This is not the password for the hduser account

reo_link_ac_test -dt :0:9:cc -rt :0:9:reo
>> I'm not familiar with this....
What does that do??
....where are you running this command from??
Please use plain text.
Occasional Advisor
Genesis_1
Posts: 15
Registered: ‎02-14-2007
Message 11 of 16 (128 Views)

Re: superdome EMS report I/O link error

The Aclts is a script that depend on JUST.
reo_link_ac_test is the command that is used to test the reo link.

Thank you very much, Phil.
Please use plain text.
Honored Contributor
Phil uk
Posts: 253
Registered: ‎04-18-2007
Message 12 of 16 (128 Views)

Re: superdome EMS report I/O link error


Sorry, the scripts etc that you have mentioned are not something that I am familiar with.
Please use plain text.
Occasional Advisor
Genesis_1
Posts: 15
Registered: ‎02-14-2007
Message 13 of 16 (128 Views)

Re: superdome EMS report I/O link error

Hi, everybody, I've successfully solved this problem by replacing the reo inbound cable of the superdome. There is no doubt, it's the hardest work to replace the reo cable in the superdome. It took me a whole night to do it. Fortunately, the error doesn't occur any more. I'm so happy about it. But otherwise, I feel a little shame: the problem has been presented and has been reported to HP since the superdome was installed in 2003, and the error occured every day and there is also a error on cell1 during the power-on-self-test. The problem was so obvious, but HP CE didn't actually deal with it in the several years while just told the customer that the problem was inessential. Though I don't know what the reason is, I don't think that the problem should existed such a long time. Especially for a professional service like HP.
Anyway, thanks everybody for your help.
Please use plain text.
Honored Contributor
Phil uk
Posts: 253
Registered: ‎04-18-2007
Message 14 of 16 (128 Views)

Re: superdome EMS report I/O link error


Well done changing the REO - not a nice job to do.
Unusual for a REO to have a problem - but I suspect it was bad from day 1, so well done.

Sorry HP let you down.

Regards,
Phil
Please use plain text.
Acclaimed Contributor
Torsten.
Posts: 22,953
Registered: ‎10-02-2001
Message 15 of 16 (128 Views)

Re: superdome EMS report I/O link error

respect, changing this cable is a tough challenge ...

Hope this helps!
Regards
Torsten.

__________________________________________________

There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________

No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! star in the left column!   
Please use plain text.
Occasional Advisor
Genesis_1
Posts: 15
Registered: ‎02-14-2007
Message 16 of 16 (128 Views)

Re: superdome EMS report I/O link error

I also suspect the reo cable was defective when it was produced.
Please use plain text.
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation