Re: System Crash: ES40 (656 Views)
Reply
Occasional Collector
Tim_Peer
Posts: 3
Registered: ‎07-11-2011
Message 1 of 10 (841 Views)
Accepted Solution

System Crash: ES40

An unusual event occurred this weekend with the XFC.  At first glance, one could suggest a bug for which a patch was created... however, this is first such event since the OS was installed in 2002. Should I bet the cause is memory/hardware related and not  XFC image? I ran Diag and only the crash event was posted.

 

SYSGEN> VCC_FLAGS = 2

 

Bugcheck Type:     SSRVEXCEPT, Unexpected system service exceptionNode:              

ES40    (Standalone)CPU Type:          

AlphaServer ES40VMS Version:       V7.3-1  

 

Failing PC:        FFFFFFFF.80330EBC    SYS$XFCACHE+1CEBC

 

CPU ID         00                        CPU State    rc,pa,pp,cv,pv,pmv,plCPU Type      

EV67  Pass 2.2.3/.5 (21264A)  

PAL Code       1.98-104  

w/ 9 gb MEMORY

Honored Contributor
Volker Halle
Posts: 5,205
Registered: ‎04-26-2004
Message 2 of 10 (826 Views)

Re: System Crash: ES40

Tim,

 

before considering to swap hardware, ALWAYS try to find out, what exactly caused a crash. Assume it's software first, only if the crash cannot be explained from a software point-of-view, then try to assume it may be hardware.

 

Could you post the CLUE file of this crash as an attachment in your next reply ? Or mail it to me as a private mail.

 

You'll find the CLUE file in CLUE$COLLECT:CLUE$node_ddmmyy_hhmm.LIS

 

Volker.

 

-----

An OpenVMS crashdump analysis a day

makes the Windows headaches go away.

Honored Contributor
P Muralidhar Kini
Posts: 897
Registered: ‎03-14-2010
Message 3 of 10 (781 Views)

Re: System Crash: ES40

Hi Tim,

If its a inline BUGCHECK from XFC then you would have a crash type as

 >> ASSERTFAIL, System ASSERT failure detected

In such cases, you can get some additional data related to the crash from XFC trace buffers

SDA>XFC SHOW TRACE/RAW

In this case, its not a inline BUGCHECK

>>Bugcheck Type: SSRVEXCEPT, Unexpected system service exception

Please check if the XFC buffers has any information that can give any clues. Execute the following command in the crash dump

SDA>XFC SHOW TRACE/RAW

 

Also,

>> AlphaServer ES40VMS Version: V7.3-1

Looks like you are running a unsupported version of OpenVMS. OpenVMS version supported on Alpha is V7.3-2 onwards.

 

Regards,

Murali

Let There Be Rock - AC/DC
Honored Contributor
Volker Halle
Posts: 5,205
Registered: ‎04-26-2004
Message 4 of 10 (776 Views)

Re: System Crash: ES40

Tim,

 

I've had a short look at your CLUE file. Here is the most important footprint information:

 

Bugcheck Type:     SSRVEXCEPT, Unexpected system service exception
Node:                ES40    (Standalone)
CPU Type:        AlphaServer ES40
VMS Version:   V7.3-1 
Failing PC:       FFFFFFFF.80330EBC    XFCREADSTART_C+00ACC
Failing PS:       00000000.00000201
Module:            SYS$XFCACHE    (Link Date/Time: 18-JUL-2002 19:55:19.70)
Offset:               0001CEBC

Signal Array:                                          64-bit Signal Array:
Arg Count    = 00000005                     Arg Count      =          00000005
Condition    = 0000000C                     Condition      = 00000000.0000000C
Argument #2  = 00000000                  Argument #2    = 00000000.00000000
Argument #3  = 00000054                  Argument #3    = 00000000.00000054
Argument #4  = 80330EBC                 Argument #4    = FFFFFFFF.80330EBC
Argument #5  = 00000201                  Argument #5    = 00000000.00000201

Failing Instruction:
XFCREADSTART_C+00ACC:  LDL R16,#X0054(R0)

Instruction Stream (last 20 instructions):
XFCREADSTART_C+00A7C:  BEQ R1,#X000008
XFCREADSTART_C+00A80:  BIS R31,R9,R16
XFCREADSTART_C+00A84:  BIS R31,R4,R17
XFCREADSTART_C+00A88:  JSR R26,(R26)
XFCREADSTART_C+00A8C:  LDL R21,#X0108(R6)
XFCREADSTART_C+00A90:  STL R31,#X010C(R6)
XFCREADSTART_C+00A94:  ADDL R21,R4,R4
XFCREADSTART_C+00A98:  STL R4,#X0108(R6)
XFCREADSTART_C+00A9C:  BR R31,#X000027
XFCREADSTART_C+00AA0:  CMPULE R5,#X00,R22
XFCREADSTART_C+00AA4:  CMPULE R4,#X00,R23
XFCREADSTART_C+00AA8:  BIS R31,R31,R24
XFCREADSTART_C+00AAC:  BIS R22,R23,R22
XFCREADSTART_C+00AB0:  BLBS R22,#X000022
XFCREADSTART_C+00AB4:  S8ADDL R24,R31,R25  <<< sets up R25...
XFCREADSTART_C+00AB8:  LDAH R19,#X0001(R31)
XFCREADSTART_C+00ABC:  ZAPNOT R25,#X0F,R25 <<< ...
XFCREADSTART_C+00AC0:  ADDQ R6,R25,R25         <<< R25 set up
XFCREADSTART_C+00AC4:  LDA R19,#X8000(R19)
XFCREADSTART_C+00AC8:  LDQ R0,#X0190(R25)   <<< load R0 here
XFCREADSTART_C+00ACC:  LDL R16,#X0054(R0)   <<< ACCVIO here due to R0=0
XFCREADSTART_C+00AD0:  LDQ R1,#X0068(R0)
XFCREADSTART_C+00AD4:  SLL R16,#X09,R17
XFCREADSTART_C+00AD8:  AND R1,#X10,R1
XFCREADSTART_C+00ADC:  ADDL R31,R17,R17

   R0  =  00000000.00000000   

   R6  =  FFFFFFFF.817F3D50   VCC_CTX
 

   R24 =  00000000.00000001  

   AI  =  FFFFFFFF.817F3D58  
   RA  =  FFFFFFFF.80330F08   XFCREADSTART_C+00B18
   PV  =  00000000.00000015  
   R28 =  0000029B.8FC90000  
   FP  =  00000000.7FFA1820  
   PC  =  FFFFFFFF.80330EBC   XFCREADSTART_C+00ACC
   PS  =  00000000.00000201   Kernel Mode, IPL 2

 

The crash happened in [XFC]XFC_READ routine XfcReadCopyToUserBuffer when trying to obtain ulSize.

The problem is R0=0 as loaded from #X0190(R25). I need to further verify the code stream loading the current value into R25...

From a software-point-of view, this does NOT look like a hardware problem at first glance.

 

Note that you're running XFC from V7.3-1 SSB-version, no XFC patches installed. And there are some, but none of them directly describes this crash footprint.

 

I would consider this crash footprint to not be a 'known' problem, I've not seen that exact footprint before and I've seen thousands of crashes...

 

If this system has been running for 9 years without ever seeing this bugcheck - you should check $ TYPE CLUE$HISTORY about the exact crash history of this node - you might just get lucky for the next 9 years...

 

Volker.

Respected Contributor
Bob Blunt
Posts: 314
Registered: ‎05-01-2003
Message 5 of 10 (766 Views)

Re: System Crash: ES40

Tim,

 

      Based on the replies you've gotten so far and some of the comments it sounds like you are really in need of, at least, bringing up the patch level on your system.  Given that you're using V7.3-1 the best we can suggest would be to cautiously get your patches updated and seriously consider what might keep you from bringing your system up to, at least, V7.3-2.  There are still many. many patches for V7.3-2 but the version itself is still seeing some maintenance.  If you can't move from V7.3-1 you're really hurting yourself if your patches aren't current...even if your crash isn't an exact match to the footprints published as solved in the patch release notes.  I've always held the opinion that a mix-and-match set of patches might not show the same footprints as the latest combination and any customer's situation might trigger (Volker hinted at this) a condition that hadn't been reported to engineering.  It just takes the right combination of events, software, configuration, etc.

 

To add to what Volker said... IF a hardware problem causes a crash you'll *usually* see some entry in the errorlog before the crash itself.  You might also receive information on the console shortly before the crash.  A hardware failure can be so catastrophic that you'll have little in the errorlog but in those cases you usually would have to glean the gory details from another member of the cluster sharing the system disk or from the stored registers the console saves when the event occurs.  Capturing this information requires special steps immediately after the machine halts and, in many cases, there isn't a lot you can do with the system anyway.  Power cycling can clear the details of what the console saves so checking with HP is recommended when a system halts like that.

 

And while the most current bit-to-text errorlog translation tools are recommended you can still get a very good "in the ballpark" read from DECevent so you don't absolutely positively HAVE to use SEA or whatever it is today.

 

bob

Occasional Collector
Tim_Peer
Posts: 3
Registered: ‎07-11-2011
Message 6 of 10 (751 Views)

Re: System Crash: ES40

Many thanks to all who took time to review and comment on this issue.

 

Since this is the first such event which has occurred in 10 years or so, will make no further changes other than upgrading OpenVMS to a more recent release.

 

Very best regards,

 

Tim Peer

Honored Contributor
Volker Halle
Posts: 5,205
Registered: ‎04-26-2004
Message 7 of 10 (742 Views)

Re: System Crash: ES40

[ Edited ]

Tim,

 

the instructions executed immediately preceeding the ACCVIO seem to have been correctly executed:

 

here_is_a_label:                     (R24 is ulIndex used in a for loop)

XFCREADSTART_C+00AB4: S8ADDL R24,R31,R25 <<< sets up R25 with 00000008
XFCREADSTART_C+00AB8: LDAH R19,#X0001(R31)
XFCREADSTART_C+00ABC: ZAPNOT R25,#X0F,R25 <<< clear low order 4 bytes in R25
XFCREADSTART_C+00AC0: ADDQ R6,R25,R25 <<< R25 set up finished here
XFCREADSTART_C+00AC4: LDA R19,#X8000(R19)
XFCREADSTART_C+00AC8: LDQ R0,#X0190(R25) <<< load R0 here
XFCREADSTART_C+00ACC: LDL R16,#X0054(R0) <<< ACCVIO here due to R0=0
...

code branches back to beginning of for loop (see above) later.

 

Combined with the values of the registers as reported by SDA> CLUE REGISTER

 

R0 = 00000000.00000000

R6 = FFFFFFFF.817F3D50 VCC_CTX

R24 = 00000000.00000001

AI = FFFFFFFF.817F3D58 (=R25)

 

everything adds up correctly.

 

If you do a SDA> EXA 817F3D58+190 in the dump, you should see a longword of zeroes.

 

More detailled analysis would require access to the dumpfile itself and following the for loop in the dump.

 

There probably is some corruption/invalid link in some XFC data structure, which has caused this crash to happen.

 

Volker.

HP Pro
Ian Miller.
Posts: 4,371
Registered: ‎06-03-2003
Message 8 of 10 (667 Views)

Re: System Crash: ES40

there is a cost running old versions of VMS - sometimes it takes a while to notice this cost.

____________________
Purely Personal Opinion
Honored Contributor
Volker Halle
Posts: 5,205
Registered: ‎04-26-2004
Message 9 of 10 (664 Views)

Re: System Crash: ES40

Ian,

 

this comment of yours is purely based on speculation, that whatever problem has caused this crash, would not have caused it to happen in 'more recent versions of OpenVMS'.

 

Without knowing about the real cause of this crash, noone can tell, whether it could have been prevented by a patch or an upgrade.

 

And think about the downtime and efforts involved during the past 9 years, which would have been needed to keep this system on current patch levels and versions.

 

Volker.

HP Pro
Ian Miller.
Posts: 4,371
Registered: ‎06-03-2003
Message 10 of 10 (656 Views)

Re: System Crash: ES40

[ Edited ]

Although not  relevant to this issue. Tim mentioned upgrading. I thought I would mention there is a cost in not upgrading which has to be balanced against the cost of upgrading. 

 

____________________
Purely Personal Opinion
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation.