在汇编中声明和索引 qwords 的整数数组

问题描述：

我有一个关于如何在程序集中初始化数组的问题.我试过了:

I have a question regarding how to initialize an array in assembly. I tried:

.bss
#the array
unsigned:    .skip 10000
.data
#these are the values that I want to put in the array
par4:   .quad 500 
par5:   .quad 10
par6:   .quad 15

这就是我声明我的字符串和我想放入它的变量的方式.这就是我尝试将它们放入数组的方式:

That's how I declared my string and the variables that I want to put it inside. This is how I tried to put them into the array:

movq $0 , %r8

movq par4 , %rax
movq %rax , unsigned(%r8)
incq %r8

movq par5 , %rax
movq %rax , unsigned(%r8)
incq %r8

movq par6 , %rax
movq %rax , unsigned(%r8)

我尝试打印元素以检查是否一切正常，只有最后一个打印正常，其他两个有一些奇怪的值.

I tried printing the elements to check if everything is okay, and only the last one prints okay, the other two have some weird values.

也许这不是我应该声明和使用它的方式?

Maybe this is not the way I should declare and work with it?

答

首先，unsigned 是 C 中类型的名称，因此它对于数组来说是一个糟糕的选择.让我们称它为 arr.

First of all, unsigned is the name of a type in C, so it's a poor choice for an array. Let's call it arr instead.

您想将 BSS 中的那块空间视为数组 qword 元素.所以每个元素是8个字节.所以你需要存储到arr+0、arr+8和arr+16.(总共数组的大小为 10000 字节，即 10000/8 个 qwords).

You want to treat that block of space in the BSS as an array qword elements. So each element is 8 bytes. So you need to store to arr+0, arr+8, and arr+16. (The total size of your array is 10000 bytes, which is 10000/8 qwords).

但是您使用 %r8 作为字节偏移量，而不是缩放索引.在其他条件相同的情况下，这通常是一件好事；在某些情况下，索引寻址模式在某些 CPU 上速度较慢.但问题是您只能使用 inc 将它增加 1，而不是使用 add $8, %r8.

But you're using %r8 as a byte offset, not a scaled-index. That's generally a good thing, all else equal; indexed addressing modes are slower in some cases on some CPUs. But the problem is you only increment it by 1 with inc, not with add $8, %r8.

所以你实际上存储到 arr+0、arr+1 和 arr+2，其中 8-相互重叠的字节存储，只留下最后一个存储的最低有效字节.x86 是 little-endian，所以结果内存的内容实际上是 this，后面跟着保持为零的其余未写入字节.

So you're actually storing to arr+0, arr+1, and arr+2, with 8-byte stores that overlap each other, leaving just the least-significant byte of the last store. x86 is little-endian so the resulting contents of memory is effectively this, followed by the rest of the unwritten bytes that stay zero.

# static array that matches what you actually stored
arr: .byte 500 & 0xFF, 10, 15, 0, 0, 0, 0, 0, 0, 0, ...

您当然可以只使用 .data 部分中的 .qword 来声明一个包含您想要的内容的静态数组.但是只有前 3 个非零元素，将它放在 BSS 中对于一个那么大的元素是有意义的，而不是将操作系统页面放在磁盘的零中.

You could of course just use .qword in the .data section to declare a static array with the contents you want. But with only the first 3 element non-zero, putting it in the BSS makes sense for one that large, instead of a having the OS page in the zeros from disk.

如果您要完全展开而不是在从 par4 开始的 3 元素 qword 数组上使用循环，则根本不需要增加寄存器.您也不需要将初始化器放在数据存储器中，您可以只使用立即数，因为它们都适合作为 32 位符号扩展.

If you're going to fully unroll instead of using a loop over your 3-element qword array starting at par4, you don't need to increment a register at all. You also don't need the initializers to be in data memory, you can just use immediates because they all fit as 32-bit sign-extended.

  # these are assemble-time constants, not associated with a section
.equ par4, 500
.equ par5, 10
.equ par6, 15

.text  # already the default section but whatever

.globl _start
_start:
    movq    $par4, arr(%rip)            # use RIP-relative addressing when there's no register
    movq    $par5, arr+8(%rip)
    movq    $par6, arr+16(%rip)

    mov $60, %eax
    syscall               # Linux exit(0)

.bss
    arr:   .skip 10000

您可以在 GDB 下运行它并检查内存以查看结果.(用 gcc -nostdlib -static foo.s 编译它).在 GDB 中，用 starti 启动程序(在入口点停止)，然后用 si 单步执行.使用 x/4g &arr 将 arr 处的内存内容转储为 4 个 qwords 的数组.

You can run that under GDB and examine memory to see what you get. (Compile it with gcc -nostdlib -static foo.s). In GDB, start the program with starti (to stop at the entry point), then single-step with si. Use x /4g &arr to dump the contents of memory at arr as an array of 4 qwords.

或者，如果您确实想使用寄存器，不妨只循环一个指针而不是一个索引.

Or if you did want to use a register, might as well just loop a pointer instead of an index.

    lea     arr(%rip), %rdi           # or mov $arr, %edi in a non-PIE executable
    movq    $par4, (%rdi)
    add     $8, %rdi                  # advance the pointer 8 bytes = 1 element
    movq    $par5, (%rdi)
    add     $8, %rdi
    movq    $par6, (%rdi)

或缩放索引:

## Scaled-index addressing
    movq    $par4, arr(%rip)
    mov     $1, %eax
    movq    $par5, arr(,%rax,8)       # [arr + rax*8]
    inc     %eax
    movq    $par6, arr(,%rax,8)

有趣的技巧:你可以只做一个字节存储而不是一个 qword 存储来设置低字节，剩下的为零.这将节省代码大小，但如果您立即进行 qword 加载，则会出现存储转发停顿.(存储/重新加载将缓存中的数据与存储缓冲区中的存储合并的额外延迟约 10 个周期)

Fun trick: you could just do a byte store instead of a qword store to set the low byte, and leave the rest zero. This would save code-size but if you did a qword load right away, you'd get a store-forwarding stall. (~10 cycles extra latency for the store/reload to merge data from the cache with the store from the store buffer)

或者如果你确实仍然想从 .rodata 中的 par4 复制 24 个字节，你可以使用上海证券交易所.x86-64 保证 SSE2 可用.

Or if you did still want to copy 24 bytes from par4 in .rodata, you could use SSE. x86-64 guarantees that SSE2 is available.

    movaps   par4(%rip), %xmm0
    movaps   %xmm0, arr(%rip)          # copy par4 and par5

    mov      par6(%rip), %rax          # aka par4+16
    mov      %rax, arr+16(%rip)

.section .rodata          # read-only data.
.p2align 4         # align by 2^4 = 16 for movaps
  par4:  .quad 500
  par5:  .quad 10
  par6:  .quad 15

.bss
.p2align 4        # align by 16 for movaps
  arr: .skip 10000
# or use .lcomm arr, 10000  without even switching to .bss

或者使用 SSE4.1，您可以加载+扩展小常量，这样您就不需要为要复制到 BSS 数组中的每个小数字使用整个 qword.

Or with SSE4.1, you can load+expand small constants so you don't need a whole qword for each small number that you're going to copy into the BSS array.

    movzxwq    initializers(%rip), %xmm0       # zero-extend 2 words into 2 qwords
    movaps     %xmm0, arr(%rip)
    movzwl     initializers+4(%rip), %eax      # zero-extending word load
    mov        %rax, arr+16(%rip)

.section .rodata
  initializers: .word 500, 10, 15

在汇编中声明和索引 qwords 的整数数组

相关推荐