03-23-2008 09:38 PM
FRU Physical Location: 0x00ffff01ffffff93
FRU Source = 9 (cell)
Source Detail = 3 (coherency controller)
Cabinet Location = 0
Cell Location = 1
REO input single wire error
>---------- End Event Monitoring Service Event Notification ----------<
>------------ Event Monitoring Service Event Notification ------------<
Notification Time: Sun Dec 16 11:42:00 2007
dds2 sent Event Monitor notification information:
/system/events/core_hw/core_hw is >= 1.
Its current value is SERIOUS(4).
Event data from monitor:
Event Time..........: Sun Dec 16 11:41:59 2007
Event #.............: 85
I/O link interface to cell controller recovered errors
Description of Error:
The cell controller (CC) chip has detected and corrected multiple errors
in data transferred to it from the I/O bus adapter (REO) chip to which it
Probable Cause / Recommended Action:
The inbound I/O link cable is unreliable.
Contact your HP support representative to check the inbound I/O link
There may be a problem with the CC chip or cell board.
Contact your HP support representative to check the cell board.
There may be a problem with the I/O backplane.
Contact your HP support representative to check the I/O backplane.
Additional Event Data:
System IP Address...: 10.93.4.12
Event Id............: 0x47649e8800000000
Monitor Version.....: B.01.00
Event Class.........: System
Client Configuration File...........:
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 3
Associated OS error log entry id(s):
Additional System Data:
System Model Number.............: 9000/800/SD32000
OS Version......................: B.11.11
STM Version.....................: A.29.00
EMS Version.....................: A.03.20
Latest information on this event:
There is a error in the HPMC trace file:
10215: ------- Analyzing CC1, RIN_ERR_PRI_MODE CSR:
10216: RIN_ERR_PRI_MODE_CSR = 0x0000000000000008
10217: RIN_ERR_ENABLE_MASK CSR = 0x000000001fffffff
10218: RIN_FE_UPGRADE_CONFIG CSR = 0x0000000000000fc0
10219: RIN_DR_UPGRADE_CONFIG CSR = 0x0000000000000000
10220: Problem: (3)CC1, RIN Corr Err: Link one-bit failure in same position for
10221: 1 or more cycles. Corrected by HW.
10222: Possible Cause 1: RIO link cable connected to cell 1 has a poor
10223: connection or is defective. 
10224: Possible Fix 1: Reseat or replace RIO link cable.
10225: Possible Cause 2: RIO chip is defective.
10226: Possible Fix 2: Replace HIOB connected to cell 1.
10227: Possible Cause 3: CC chip on cell 1 or cell board is defective.
10228: Possible Fix 3: Replace Cell board 1.
10231: ------- Analyzing CC1, RIN_ERR_SEC_MODE CSR:
10232: RIN_ERR_SEC_MODE_CSR = 0x0000000000000008
10233: Problem: (3)CC1, RIN Corr Err: Link one-bit failure in same position for
10234: 1 or more cycles. Corrected by HW.
10235: Possible Cause 1: RIO link cable connected to cell 1 has a poor
10236: connection or is defective. 
10237: Possible Fix 1: Reseat or replace RIO link cable.
10238: Possible Cause 2: RIO chip is defective.
10239: Possible Fix 2: Replace HIOB connected to cell 1.
10240: Possible Cause 3: CC chip on cell 1 or cell board is defective.
10241: Possible Fix 3: Replace Cell board 1.
10244: Note: CC1, RIN GSM Hdr Log is NOT valid.
10245: Note: CC1, RIN Uncor. Hdr Log is NOT valid.
10246: Note: CC1, RIN Uncor. ECC Data Log is NOT valid.
10247: Note: CC1, RIN Uncor. ECC Cyc Log is NOT valid.
10248: Note: CC1, RIN FE Hdr Log is NOT valid.
10249: Note: CC1, RIN No Pres. Log is NOT valid.
10251: Note: CC1, RIN Cor. ECC Data MSB Log0 is valid.
10252: Note: CC1, RIN Single ECC Wire Log is valid.
10254: ------- Analyzing CC1 RIN_SGL_ECC_WIRE_LOG:
10255: RIN_SGL_ECC_WIRE_LOG CSR = 0x0000000004002000
10256: CC1 RIN block corrected single wire error in RIO link wire number 13.
10257: CC1 RIN block corrected single bit error in RIO link data row 2
------- Analyzing cell 1 RIO logs:
10261: Warning: RIO 0 Link PRIMARY_ERROR_LOG CSR connected to cell 1 not
10262: stored. - Analysis skipped.
10264: Note: CC 1, RIO 0, Rope Unit 0 RU_PRI_ERR_LOG CSR not stored. -
10265: Analysis skipped.
10266: Note: CC 1, RIO 0, Rope Unit 1 RU_PRI_ERR_LOG CSR not stored. -
10267: Analysis skipped.
what do you think? Thanks
Solved! Go to Solution.
03-24-2008 11:32 PM
Call HP and have them send a CE to site.
I would recommend that the CE runs Scan-on-the-fly (SOTF) from the SuperDome Management Station (SMS) - this may give more clues as to where the problem actually is, ie, Cell Board,REO,Backplane, IO Backplane.
If SOTF finds errors - it may be necessary to arrange a complete outage on the machine depending what the problem might be.
(Don't try to reseat the REO cable while the SuperDome is powered on as you may damage the backplane or bend pins on the backplane.)
Note, it is very unusual for REO cables to fail.
03-25-2008 08:32 AM
03-28-2008 06:48 AM
It is a diagnostic test (called JUST) that you run from the SMS. It also depends on what sort of SMS you have as to how you run the tests (unix SMS or PC based SMS).
It is quite detailed and should be run by HP CE's etc.
Also, from the output of the diagnostics then you need to decode what the problem may be.
If you make a mistake in the sequence of events for setting up the JUST tests (the correct daemons etc) - then you can cause all nPARs/vPARs to crash.
I would strongly suggest you get HP onsite to do this.
03-28-2008 07:17 AM
If you know JUST, and you CAN shut it down, then do the offline version - much better.
A500 SMS - so must be a Legacy 'Dome i guess?
logon to sms with hduser account
(password is HP proprietry - so if you know JUST then I guess you know the password)
ONLY do this if partitions are DOWN !!
# just -s
once at JUST prompt on SMS
...select the tests you wish to run.
Don't forget to power off the whole machine (+IOX chassis) at the breakers for 1minute after you've run the test...then back on again.
03-28-2008 07:51 AM
Is it Aclts?
>> This is not the password for the hduser account
>> I'm not familiar with this....
What does that do??
....where are you running this command from??
04-03-2008 11:23 AM
Anyway, thanks everybody for your help.
04-03-2008 12:08 PM
Well done changing the REO - not a nice job to do.
Unusual for a REO to have a problem - but I suspect it was bad from day 1, so well done.
Sorry HP let you down.
04-03-2008 02:28 PM
Hope this helps!
There are only 10 types of people in the world -
those who understand binary, and those who don't.
No support by private messages. Please ask the forum!
If you feel this was helpful please click the KUDOS! star in the left column!