在什么情况下会在AVX2指令集比单独加载数据的速度更快?

在什么情况下会在AVX2指令集比单独加载数据的速度更快?

问题描述:

我一直在研究使用新的聚集AVX2指令集的指令。具体来说,我决定基准一个简单的问题,其中一个浮点阵列排列,并添加到另一个。在C,这可以被实现为

I have been investigating the use of the new gather instructions of the AVX2 instruction set. Specifically, I decided to benchmark a simple problem, where one floating point array is permuted and added to another. In c, this can be implemented as

void vectortest(double * a,double * b,unsigned int * ind,unsigned int N)
{
    int i;
    for(i=0;i<N;++i)
    {
        a[i]+=b[ind[i]];
    }
}

我编译g ++的-O3 -march =原生此功能。现在,我在议会通过三种方式实现这一点。为简单起见,我假定阵列的N的长度是可被4除尽。简单的,非量化的实现:

I compile this function with g++ -O3 -march=native. Now, I implement this in assembly in three ways. For simplicity I assume that the length of the arrays N is divisible by four. The simple, non-vectorized implementation:

align 4
global vectortest_asm
vectortest_asm:
        ;;  double * a = rdi                                                                                                                                                                                                                                   
        ;;  double * b = rsi                                                                                                                                                                                                                                   
        ;;  unsigned int * ind = rdx                                                                                                                                                                                                                           
        ;;  unsigned int N = rcx                                                                                                                                                                                                                               

        push rax
        xor rax,rax

loop:   sub rcx, 1
        mov eax, [rdx+rcx*4]    ;eax = ind[rcx]                                                                                                                                                                                                                
        vmovq xmm0, [rdi+rcx*8]         ;xmm0 = a[rcx]                                                                                                                                                                                                         
        vaddsd xmm0, [rsi+rax*8]        ;xmm1 += b[rax] ( and b[rax] = b[eax] = b[ind[rcx]])                                                                                                                                                                   
        vmovq [rdi+rcx*8], xmm0
        cmp rcx, 0
        jne loop

        pop rax

        ret

没有收集指令矢量化的循环:

The loop vectorised without the gather instruction:

loop:   sub rcx, 4

        mov eax,[rdx+rcx*4]    ;first load the values from array b to xmm1-xmm4
        vmovq xmm1,[rsi+rax*8]
        mov eax,[rdx+rcx*4+4]
        vmovq xmm2,[rsi+rax*8]

        mov eax,[rdx+rcx*4+8]
        vmovq xmm3,[rsi+rax*8]
        mov eax,[rdx+rcx*4+12]
        vmovq xmm4,[rsi+rax*8]

        vmovlhps xmm1,xmm2     ;now collect them all to ymm1
        vmovlhps xmm3,xmm4
        vinsertf128 ymm1,ymm1,xmm3,1

        vaddpd ymm1, ymm1, [rdi+rcx*8]
        vmovupd [rdi+rcx*8], ymm1

        cmp rcx, 0
        jne loop

最后,使用vgatherdpd实现:

And finally, an implementation using vgatherdpd:

loop:   sub rcx, 4               
        vmovdqu xmm2,[rdx+4*rcx]           ;load the offsets from array ind to xmm2
        vpcmpeqw ymm3,ymm3                 ;set ymm3 to all ones, since it acts as the mask in vgatherdpd                                                                                                                                                                 
        vgatherdpd ymm1,[rsi+8*xmm2],ymm3  ;now gather the elements from array b to ymm1

        vaddpd ymm1, ymm1, [rdi+rcx*8]
        vmovupd [rdi+rcx*8], ymm1

        cmp rcx, 0
        jne loop

我标杆的机器上,这些功能与Haswell的CPU(至强E3-1245 V3)。一些典型的结果是(秒次):

I benchmark these functions on a machine with a Haswell cpu (Xeon E3-1245 v3). Some typical results are (times in seconds):

Array length 100, function called 100000000 times.
Gcc version: 6.67439
Nonvectorized assembly implementation: 6.64713
Vectorized without gather: 4.88616
Vectorized with gather: 9.32949

Array length 1000, function called 10000000 times.
Gcc version: 5.48479
Nonvectorized assembly implementation: 5.56681
Vectorized without gather: 4.70103
Vectorized with gather: 8.94149

Array length 10000, function called 1000000 times.
Gcc version: 7.35433
Nonvectorized assembly implementation: 7.66528
Vectorized without gather: 7.92428
Vectorized with gather: 8.873

gcc的和nonvectorized集版本非常接近对方。 (我还检查的gcc,这是相当类似于我的手codeD版的汇编输出)的量化给出了小数组一些好处,但对于大数组慢。这个大惊喜(至少对我来说)是使用vgatherpdp的版本是如此之慢。所以,我的问题是,为什么?我做了一些愚蠢的吗? 有人可以提供一个例子,其中的聚集指令实际上给了只是在做多工操作的性能优势?如果不是,什么是真正具有这样的指令呢?

The gcc and the nonvectorized assembly version are very close to each other. (I also checked the assembly output of gcc, which is quite similar to my hand coded version.) The vectorization gives some benefit for small arrays, but is slower for large arrays. The big surprise (to me at least) is that the version using vgatherpdp is so slow. So, my question is, why? Am I doing something stupid here? Can someone provide an example where the gathering instruction would actually give a performance benefit over just doing multiple loading operations? If not, what is the point of actually having such an instruction?

测试code,完全与克的makefile ++和NASM,可在 https://github.com/ vanhala / vectortest.git 万一有人想尝试了这一点。

The test code, complete with a makefile for g++ and nasm, is available at https://github.com/vanhala/vectortest.git in case somebody wants to try this out.

不幸的是,聚集加载指令并不特别聪明 - 他们似乎产生每个元素一个总线周期,无论加载地址的,所以即使你碰巧有连续元素有明显的凝聚负载没有内部逻辑。所以在效率方面一个聚集负载是不大于N的标量负载更好,除了它仅使用一个指令

Unfortunately the gathered load instructions are not particularly "smart" - they seem to generate one bus cycle per element, regardless of the load addresses, so even if you happen to have contiguous elements there is apparently no internal logic for coalescing the loads. So in terms of efficiency a gathered load is no better than N scalar loads, except that it uses only one instruction.

的指令集唯一真正的好处是当你反正实现SIMD code,你需要加载非连续的数据,您随后将申请进一步的SIMD操作。在这种情况下,聚集SIMD加载指令会很多,会比通常由例如生成一串标量code更有效 _mm256_set_xxx()(或一系列连续的负荷和permutes等,根据实际访问模式)。

The only real benefit of the gather instructions is when you are implementing SIMD code anyway, and you need to load non-contiguous data to which you are then going to apply further SIMD operations. In that case a SIMD gathered load instruction will be a lot more efficient than a bunch of scalar code that would typically be generated by e.g. _mm256_set_xxx() (or a bunch of contiguous loads and permutes, etc, depending on the actual access pattern).