Re: HP-UX IA64 B.11.31 BUS_ADRALN (350 Views)
Reply
Occasional Advisor
Phalgun
Posts: 8
Registered: ‎03-17-2014
Message 1 of 12 (469 Views)

HP-UX_IA64_B.11.31_BUS_ADRALN

I am working on HP-UX, IA64 B.11.31 machine. 

One of my executables crashes with BUS_ADRALN. The problem happens sporadically though it runs fine most of the time.

The following is the snippet from stack trace in gdb :

Program terminated with signal 10, Bus error.
BUS_ADRALN - Invalid address alignment. Please refer to the following link that helps in handling unaligned data: http://docs.hp.com/en/7730/newhelp0610/pragmas.htm#pragma-pack-ex3
#0 0xc000000000211ab0:0 in _lwp_kill+0x30 ()
from /usr/lib/hpux64/libpthread.so.1
(gdb) db
Undefined command: "db". Try "help".
(gdb) bt
#0 0xc000000000211ab0:0 in _lwp_kill+0x30 ()
from /usr/lib/hpux64/libpthread.so.1
#1 0xc000000000178810:0 in pthread_kill+0x9d0 ()
from /usr/lib/hpux64/libpthread.so.1
#2 0xc0000000003f80e0:0 in raise+0xe0 () from /usr/lib/hpux64/libc.so.1
#3 0xc00000001e5a2d80:0 in skgesigOSCrash () at skgesig.c:376
#4 0xc00000001f666900:0 in kpeDbgSignalHandler () at kpedbg.c:1074
#5 0xc00000001e5a3220:0 in skgesig_sigactionHandler () at skgesig.c:799
#6 <signal handler called>
#7 Foccur32 () at Foccur32.c:87
#8 0xc00000001498c020:0 in _tmaff_delallflds () at affinity.c:725
#9 0xc00000001498b570:0 in _tmaff_acall () at affinity.c:117
#10 0xc00000001478f7a0:0 in _tpacall_internal () at tmacall.c:588
#11 0xc0000000147a2a30:0 in _tpcall_internal () at tmcall.c:349
#12 0xc0000000147a0ed0:0 in _tpcall_ () at tmcall.c:157
#13 0xc0000000147a3790:0 in tpcall () at tmcall.c:474
#14 0xc000000002a3bc90:2 in inline ztux_flags () at blbn_trx_tux.c:1078
#15 0xc000000002a3bc80:2 in ztux_sync (l_name=<not available>,
l_service=<not available>, l_request_buf=<not available>,
l_request_buf_len=<not available>, l_response_buf=<not available>,
l_response_buf_len=<not available>, l_flags=<not available>)
at blbn_trx_tux.c:1215
#16 0xc00000001467b300:0 in zfn_call (l_fn=<not available>,
---Type <return> to continue, or q <return> to quit---
l_tx_buf_len=<not available>, l_rx_buf_len=<not available>) at blbn_trx.c:14340

 

I was checking similar post related to BUS_ADRALN in the community, but was not able to follow them.

I do understand that there is some address which happens to be misalligned, but I do not know which address is misalligned and how to find that out. 

 

Thanks in advance for your help.

HP Pro
boukari
Posts: 57
Registered: ‎02-25-2011
Message 2 of 12 (453 Views)

Re: HP-UX_IA64_B.11.31_BUS_ADRALN

Hello ,

Most processors (not x86 and friends ) require accesses to certain elements to be aligned on multiples of bytes. I.e. if you read an integer from address 0x04 that is okay, but if you try to do the same from 0x03 you will cause an interrupt to be thrown.

This is because it's easier to implement the load/store hardware if it's always on a multiple of the data size with which you're working.

DATA STRUCTURE ALIGNEMENT :

http://en.wikipedia.org/wiki/Data_structure_alignment


Regards,

BCS SW/HW GSC Engineer (L1)
IEEE Student Member
LPI 3 CORE & High Availability
VCP Vshpere 5 Datacenter
Novell CLA and Data Center specialist Certified
.....
Microsoft Partner & Microsoft student Partner
Occasional Advisor
Phalgun
Posts: 8
Registered: ‎03-17-2014
Message 3 of 12 (433 Views)

Re: HP-UX_IA64_B.11.31_BUS_ADRALN

@I do understand that there is misallignment of address, but what I do not understand is which variable's address is misalligned and how do I find that out.

 

It would be great if someone could help me figure that out. Once, I know which variable is causing the problem, I could take corrective action in that direction.

Honored Contributor
Matti_Kurkela
Posts: 6,271
Registered: ‎12-02-2001
Message 4 of 12 (429 Views)

Re: HP-UX_IA64_B.11.31_BUS_ADRALN

With just the gdb backtrace and no other information, it will be impossible to tell you which variable had a misaligned address... but it just might be possible to identify the location in the source code where it happened.

 

Note that entry #6 in your backtrace is <signal handler called>. I think this is the CPU detecting a misaligned access attempt and jumping to the appropriate signal handler instead of continuing to run the program. So entries #0..#6 would be from the signal handler code that killed the process, and entry #7 would be the one closest to the actual error location.

 

Now, entry #7 is listed as "Foccur32 () at Foccur32.c:87". That is, function Foccur32(), located in line 87 in source code file named Foccur32.c. Does this mean anything to you?

 

If possible, look at the source code of the Foccur32() function to determine what variables it uses, and how it uses them. You might be able to use gdb to peek at the values of those variables at the time of the crash, and even see which addresses those variables had.

 

You may also need to examine the parameters given to the Foccur32() function: entries #8..#16 will describe where the Foccur32() function was called from.

MK
Occasional Advisor
Phalgun
Posts: 8
Registered: ‎03-17-2014
Message 5 of 12 (425 Views)

Re: HP-UX_IA64_B.11.31_BUS_ADRALN

@

tpcall(l_name,
l_service,
l_request_buf,
l_request_buf_len,
l_response_buf,
l_response_buf_len,
ztux_flag(l_flags));

 

From Oracle documentation, we have following information for tpcall() and Foccur32(). Since the implementation is hidden I am finding it difficult to find out which address actually was misaligned. Is there some other way to be able to understand that?

Acclaimed Contributor
Dennis Handly
Posts: 25,298
Registered: ‎03-06-2006
Message 6 of 12 (415 Views)

Re: HP-UX IA64 B.11.31 BUS_ADRALN

[ Edited ]

>entry #7 would be the one closest to the actual error location.

 

Yes, this IS the fault location.

You now need to go into the debugger to debug from the corefile.

Use the frame command to go to the proper frame.

Then do the following:

bt

info reg

disas $pc-16*8 $pc+16

 

From the instruction with the fault, you should be able to find the register with the address and then check its value.

 

>You may also need to examine the parameters given to the Foccur32() function

 

Yes, the misaligned value could be from a parm.

Occasional Advisor
Phalgun
Posts: 8
Registered: ‎03-17-2014
Message 7 of 12 (402 Views)

Re: HP-UX_IA64_B.11.31_BUS_ADRALN

Thanks Dennis, I do get something now.

(gdb) disas $pc-16*8 $pc+16
Dump of assembler code from 0xc00000000e8a3b40:0 to 0xc00000000e8a3bd0:0:
;;; DOC Line Information: [Line, Column Start, Column End] [Line, Column] [Line]
;;; File: Foccur32.c
;;; Line: 59
0xc00000000e8a3b40:0 <Foccur32+0xc0>: (p2) mov r44=5
0xc00000000e8a3b40:1 <Foccur32+0xc1>: (p5) mov r44=r32
0xc00000000e8a3b40:2 <Foccur32+0xc2>: (p5) chk.s.i r40,Foccur32+576
0xc00000000e8a3b50:0 <Foccur32+0xd0>: nop.m 0x0
0xc00000000e8a3b50:1 <Foccur32+0xd1>:
(p2) br.call.dptk.many b0=0xc00000000e891dc0
0xc00000000e8a3b50:2 <Foccur32+0xd2>: (p3) br.cond.dpnt.many Foccur32+560;;
;;; Line: 64
0xc00000000e8a3b60:0 <Foccur32+0xe0>: (p2) mov r1=r34
0xc00000000e8a3b60:1 <Foccur32+0xe1>: (p2) mov r8=-1
0xc00000000e8a3b60:2 <Foccur32+0xe2>: nop.i 0x0
0xc00000000e8a3b70:0 <Foccur32+0xf0>: nop.m 0x0
0xc00000000e8a3b70:1 <Foccur32+0xf1>:
(p5) br.call.dptk.many b0=0xc00000000e893ce0
0xc00000000e8a3b70:2 <Foccur32+0xf2>: (p2) br.cond.dpnt.many Foccur32+512;;
;;; Line: 73
0xc00000000e8a3b80:0 <Foccur32+0x100>: adds r9=8,r8
0xc00000000e8a3b80:1 <Foccur32+0x101>: cmp.ne.unc p6=r0,r8
---Type <return> to continue, or q <return> to quit---
0xc00000000e8a3b80:2 <Foccur32+0x102>: mov b0=r36
0xc00000000e8a3b90:0 <Foccur32+0x110>: mov r1=r34
0xc00000000e8a3b90:1 <Foccur32+0x111>: mov r8=0;;
0xc00000000e8a3b90:2 <Foccur32+0x112>: mov.i ar.pfs=r35
;;; Line: 78
0xc00000000e8a3ba0:0 <Foccur32+0x120>: (p6) ld4 r9=[r9]
0xc00000000e8a3ba0:1 <Foccur32+0x121>: nop.i 0x0;;
;;; Line: 83
0xc00000000e8a3ba0:2 <Foccur32+0x122>: (p6) add r42=r9,r32;;
0xc00000000e8a3bb0:0 <Foccur32+0x130>: cmp.geu.unc p6=r42,r41
;;; Line: 86
0xc00000000e8a3bb0:1 <Foccur32+0x131>: nop.m 0x0
0xc00000000e8a3bb0:2 <Foccur32+0x132>: (p6) br.cond.dpnt.many Foccur32+464;;
;;; Line: 87
0xc00000000e8a3bc0:0 <Foccur32+0x140>: ld4 r9=[r42]
0xc00000000e8a3bc0:1 <Foccur32+0x141>: adds r8=4,r42;;
0xc00000000e8a3bc0:2 <Foccur32+0x142>: extr.u r10=r9,25,7
End of assembler dump.
(gdb) info line 87

 

Since, the signal was generated at line 87, the last three lines should have the answer. The registers used are r9, r8 and r10. However in the oput for 'info reg' I do not find any registers named r9,r9 and r10; there are registers with name pr[0-63],gr[0-47], br[0-7], rsc etc.

I have attached the output of 'info reg' , could you let me know which registers shall I look at?

 

Acclaimed Contributor
Dennis Handly
Posts: 25,298
Registered: ‎03-06-2006
Message 8 of 12 (387 Views)

Re: HP-UX IA64 B.11.31 BUS_ADRALN

[ Edited ]

;;; Line: 83
0xc00000000e8a3ba0:2 <Foccur32+0x122>: (p6) add r42=r9,r32;;
0xc00000000e8a3bb0:0 <Foccur32+0x130>: cmp.geu.unc p6=r42,r41
;;; Line: 87
0xc00000000e8a3bc0:0 <Foccur32+0x140>: ld4 r9=[r42]

 

>the signal was generated at line 87, the last three lines should have the answer.

 

They tell you why it aborted, r42 isn't aligned: 0x600000000051f2fa

 

>The registers used are r9, r8 and r10.  there are registers with name pr[0-63],gr[0-47], br[0-7]

 

These are the target registers, not helpful.  r => gr.

The value of r42 is the sum of r32 (first parm) and r9, which is the misaligned value, 0xa.

Occasional Advisor
Phalgun
Posts: 8
Registered: ‎03-17-2014
Message 9 of 12 (380 Views)

Re: HP-UX_IA64_B.11.31_BUS_ADRALN

Thanks Dennis, we now have an address that we know is misaligned. From the lines:-

 

;;; Line: 73

0xc00000000e8a3b80:0 <Foccur32+0x100>: adds r9=8,r8

;;; Line: 78
0xc00000000e8a3ba0:0 <Foccur32+0x120>: (p6) ld4 r9=[r9]

;;; Line: 83
0xc00000000e8a3ba0:2 <Foccur32+0x122>: (p6) add r42=r9,r32;;

 

it r9 gets its value from r8, but its not clear as to where did it get its value from. Since, we do not have anything code after Frame 13 i.e. after call to 

tpcall(l_name,
l_service,
l_request_buf,
l_request_buf_len,
l_response_buf,
l_response_buf_len,
ztux_flag(l_flags));

 

can you suggest some way to reckon which variable might have caused the alignment issue?

Acclaimed Contributor
Dennis Handly
Posts: 25,298
Registered: ‎03-06-2006
Message 10 of 12 (370 Views)

Re: HP-UX IA64 B.11.31 BUS_ADRALN

[ Edited ]

>it r9 gets its value from r8,

 

r8 is a pointer.  It seems to increment that pointer by 8 and then extracts an int.

And this is treated as a byte offset, used to load a misaligned int, in some packed structure?  So if you don't have control over this data structure, you are going to have to follow the directions given by gdb when it detected that alignment trap.

 

Since you are processing this data structure, you know exactly what it does.

Of course if I had the source to Foccur32, I could make better guesses.  :-)

 

>we do not have anything code after Frame 13 i.e. after call to

 

What direction is "after"?  Please list your code by using start and ending frame numbers.  I.e. do you own frame 7?

Occasional Advisor
Phalgun
Posts: 8
Registered: ‎03-17-2014
Message 11 of 12 (354 Views)

Re: HP-UX_IA64_B.11.31_BUS_ADRALN

>r8 is a pointer.  It seems to increment that pointer by 8 and then extracts an int

 

I am not good at assembly, could you please help me understand how  did  you arrive at this conclusion?  

 

>What direction is "after"?  Please list your code

Starting from the bottom my code ends at Frame 14, and from Frame 13 to Frame 0 is the 3rd party (Oracle tuxedo) library. 

 

Also, I read in certain other similar threads on BUS_ADRDLN, sometimes this issue occurs because of misaligned offset of the certain members in data structure. I have a hunch that it could it the issue here as well, but what causes the certain members of a structure to be at misaligned? Should they compiler not take care of padding enough bytes and aligning the address of members?

Acclaimed Contributor
Dennis Handly
Posts: 25,298
Registered: ‎03-06-2006
Message 12 of 12 (350 Views)

Re: HP-UX IA64 B.11.31 BUS_ADRALN

[ Edited ]

>could you help me understand how did you arrive at this conclusion? 

 

0xc00000000e8a3b80:0 <Foccur32+0x100>: adds r9=8,r8

0xc00000000e8a3ba0:0 <Foccur32+0x120>: (p6) ld4 r9=[r9]

 

Adds 8 to pointer in r8.  Then loads an int from that address.

 

>Starting from the bottom my code ends at Frame 14,

 

Since you don't own the code where it aborts, about the only thing you can do is to look at the abort location and see if that address is passed in from your function, N levels away.

 

You need to turn off optimization in your code.

You could also see if gdb can tell you the address of the variable:

info sym 0x600000000051f2fa

info mod 0x600000000051f2fa

 

>what causes the certain members of a structure to be at misaligned?

 

Either the user asked for packed structs or the user manually assembled the "struct" with byte moves.

 

>Should the compiler not take care of padding enough bytes and aligning the address of members?

 

The compiler will do exactly what it is told.  Either the data is from a file or network or the user asked for packed structs.

 

You may want to just give up on fiddling and just tell the hardware to do misaligned loads.  If your program then works, that's the problem.

http://h21007.www2.hp.com/portal/download/files/unprot/aCxx/HTML_Online_Help/pragmas.htm#pragma-pack...

The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation.