使用SSE/AVX/AVX2检查__m128i的所有字节是否匹配单个字节
我正在寻找有效的方法来计算以下函数:
I'm looking for efficient ways to compute the following function:
输入: __ m128i数据,uint8_t输入
;
输出:布尔值,指示 data
中的任何字节是否在 in
中.
Output: boolean indicating if any byte in data
is in
.
我实际上是在使用它们为容量为8的字节实现时空有效的堆栈.我最有效的解决方案是首先计算一个 __ m128i tmp
,所有字节均作为 in
.然后检查 tmp \ xor数据
中的任何字节是否为零字节.
I'm essentially using them to implement a space/time efficient stack for bytes with capacity 8. My most efficient solution is to first compute a __m128i tmp
with all bytes as in
. Then check if any bytes in tmp\xor data
is a zero byte.
是的,AVX2具有高效的字节广播功能.具有全零掩码的SSSE3 pshufb
同样便宜,但是您必须创建随机控制向量.AVX512BW/F甚至具有单指令 vpbroadcastb/w/d/qx/y/zmm,r32 .(使用可选的遮罩,因此您可以根据需要将一些零或与现有矢量合并,例如,使用一位掩码将其插入某个位置.)
Yes, AVX2 has efficient byte-broadcast. SSSE3 pshufb
with an all-zero mask is just as cheap, but you have to create the shuffle-control vector. AVX512BW/F even has single-instruction vpbroadcastb/w/d/q x/y/zmm, r32
. (With optional masking, so you can zero some or merge with an existing vector if you want, e.g. insert at a position using a single-bit mask.)
幸运的是,编译器在实现 _mm_set1_epi8
时知道如何执行此操作,因此我们可以将其留给编译器.
Fortunately, compilers know how to do this when implementing _mm_set1_epi8
so we can leave it to the compiler.
然后将其简化为通常的 pcmpeqb
/ pmovmskb
以获得一个整数,该整数将具有一个 1
位用于匹配元素,您可以分支.
Then it just boils down to the usual pcmpeqb
/pmovmskb
to get an integer which will have a 1
bit for matching elements, which you can branch on.
// 0 for not found, non-zero for found. (Bit position tells you where).
unsigned contains(__m128i data, uint8_t needle) {
__m128i k = _mm_set1_epi8(needle);
__m128i cmp = _mm_cmpeq_epi8(data, k); // vector mask
return _mm_movemask_epi8(cmp); // integer bitmask
}
正如你所期望,所有的编译器让这款采用该ASM(的 Godbolt )
contains(long long __vector(2), unsigned char):
vmovd xmm1, edi
vpbroadcastb xmm1, xmm1
vpcmpeqb xmm0, xmm0, xmm1
vpmovmskb eax, xmm0
ret
除了MSVC,MSVC首先浪费了 movsx eax,dl
上的一条指令.(Windows x64在RDX中传递了第二个arg,而x86-64 System V在RDI中传递了第一个 integer arg.)
Except MSVC, which wastes an instruction on movsx eax, dl
first. (Windows x64 passes the 2nd arg in RDX, vs. x86-64 System V passing the first integer arg in RDI.)
如果没有AVX2,则使用SSSE3或更高版本会得到类似的东西
Without AVX2, you'll get something like this with SSSE3 or above
# gcc8.3 -O3 -march=nehalem
contains(long long __vector(2), unsigned char):
movd xmm1, edi
pxor xmm2, xmm2
pshufb xmm1, xmm2 # _mm_shuffle_epi8(needle, _mm_setzero_si128())
pcmpeqb xmm0, xmm1
pmovmskb eax, xmm0
ret
或仅使用SSE2(x86-64的基准):
Or with just SSE2 (baseline for x86-64):
contains(long long __vector(2), unsigned char):
mov DWORD PTR [rsp-12], edi
movd xmm1, DWORD PTR [rsp-12] # gcc's tune=generic strategy is still store/reload /facepalm
punpcklbw xmm1, xmm1 # duplicate to low 2 bytes
punpcklwd xmm1, xmm1 # duplciate to low 4 bytes
pshufd xmm1, xmm1, 0 # broadcast
pcmpeqb xmm1, xmm0
pmovmskb eax, xmm1
ret
相关:
SIMD/SSE:如何检查所有向量元素是否均为非零( pxor
+ ptest
+ jcc
= 4 uos vs. pcmpeqb
+ pmovmskb
+宏融合的 test/jcc
= 3 oups.)
SIMD/SSE: How to check that all vector elements are non-zero (pxor
+ptest
+jcc
= 4 uops vs. pcmpeqb
+pmovmskb
+ macro-fused test/jcc
= 3 uops.)
非索引-SSE/AVX寄存器的零字节(查找匹配位置)
如何使用SIMD计算字符出现次数(像memchr一样,但是使用AVX2进行匹配计数而不是找到第一个匹配项.有效地积累了计数,并且有效地实现了水平和.)
How to count character occurrences using SIMD (like memchr but counting matches instead of finding the first, using AVX2. With efficient accumulation of the counts, and efficient horizontal sums.)