将 ASCII 格式的大十进制数(128 位)转换为二进制(十六进制)

问题描述:

我希望这是我在这里提出的与此主题相关的最后一个问题!

I'm hoping this is going to be my final question on here relating to this subject!

我正在寻找一种方法,将编码为 ASCII 的巨大十进制数转换为其 128 位十六进制(二进制)表示.

I'm looking for a way to convert huge decimal numbers that are encoded as ASCII into their 128bit hexadecimal (binary) representation.

这些实际上是以十进制表示的 IPv6 地址.

These are actually IPv6 addresses presented in their decimal notation.

例如:55844105986793442773355413541572575232"解析为:0x2a032f00000000000000000000000000

for example: "55844105986793442773355413541572575232" resolves to: 0x2a032f00000000000000000000000000

我的大部分代码都在 x86-32 MASM 汇编中,所以我宁愿保持这种方式而不是在不同语言之间切换.

The majority of my code is in x86-32 MASM assembly, so I'd rather keep it this way than chopping between different languages.

我有可以在 python 中运行的代码,但如上所述,我想在 x86 asm 中拥有所有内容.

I've got code that works in python, but as above, I'd like to have everything in x86 asm.

Hex 是二进制的 ASCII 序列化格式.您首先要在寄存器中将 ASCII 十进制转换为二进制整数.然后将该二进制文件转换为十六进制.十六进制 != 二进制.

Hex is an ASCII serialization format for binary. You're going to want to first convert from ASCII-decimal to binary integers in registers. Then convert that binary to hex. Hex != binary.

二进制 ->十六进制很容易;每个二进制字节分别转换为两个 ASCII 十六进制数字.(或每个 dword 到 8 个十六进制数字).请参阅如何将二进制整数转换为十六进制字符串?了解简单循环以及使用 SSE2、SSSE3、AVX2 的有效方法、AVX512F 或 AVX512VBMI,一次将 64 位输入转换为 16 字节的十六进制,或者使用 AVX2 甚至可以一步完成整个 128 位/16 字节的输入并生成所有 32 字节的十六进制数字.

binary -> hex is easy; each binary byte converts separately to two ASCII hex digits. (Or each dword to 8 hex digits). See How to convert a binary integer number to a hex string? for a simple loop, and for efficient ways using SSE2, SSSE3, AVX2, AVX512F, or AVX512VBMI to convert 64 bits of input at a time into 16 bytes of hex, or with AVX2 even do your whole 128-bit / 16-byte input in one step and produce all 32 bytes of hex digits.

那只剩下十进制-ASCII ->unsigned __int128 input 问题.使用 shld/.../shl(从高位双字开始)进行 128 位移位,并使用 add/adc/adc/adc 添加(从低 dword 开始)很简单,因此您可以实现通常的 total = total * 10 + digit (NASM 程序集将输入转换为整数?)但具有扩展精度的 128 位整数数学.保存 128 位整数需要 4 个 32 位寄存器.

That just leaves the decimal-ASCII -> unsigned __int128 input problem. 128-bit shift with shld/.../shl (starting with the high dword) and add with add/adc/adc/adc (starting with the low dword) are straightforward, so you can implement the usual total = total * 10 + digit (NASM Assembly convert input to integer?) but with extended-precision 128-bit integer math. It takes 4x 32-bit registers to hold a 128-bit integer.

t*10 实现为 t*2 + t*8 = (t*2) + (t*2)*4 首先使用 3xcode>shld 和 add eax,eax,或 add eax,eax + 3x adc same,same.然后复制并再移位2,然后将两个128位数字相加.

Implement t*10 as t*2 + t*8 = (t*2) + (t*2)*4 by first doubling using either 3x shld and add eax,eax, or add eax,eax + 3x adc same,same. Then copy and shift by another 2, then add the two 128-bit numbers together.

但是只有 7 个 GP 整数寄存器(不包括堆栈指针),您必须将某些内容溢出到内存中.而且您还希望将字符串输入指针放在寄存器中.

But with only 7 GP integer registers (not counting the stack pointer), you'd have to spill something to memory. And you also want your string input pointer in a register.

因此,您可能希望在 4x 寄存器中左移 1,然后将它们溢出到内存中并在寄存器中再移 2.然后从溢出它们的堆栈缓冲区中 add/3xadc .这使您可以将 4 个 reg 中的 128 位整数乘以 10,而无需使用任何额外的寄存器.

So probably you'd want to left-shift by 1 in your 4x registers, then spill them to memory and shift by another 2 in registers. Then add/3xadc from the stack buffer where you spilled them. That lets you multiply a 128-bit integer in 4 regs by 10 without using any extra registers.

    ; input:  total = 128-bit integer in  EBX:ECX:EDX:EAX
     ; 16-byte tmp buffer at [esp]
    ; result: total *= 10  in-place
    ; clobbers: none

    ; it's traditional to keep a 64-bit integer in EDX:EAX, e.g. for div or from mul
    ; I chose EBX:ECX for the high half so it makes an easy-to-remember pattern.

;;; total *= 2  and copy to tmp buf
    add   eax, eax             ; start from the low element for carry propagation
    mov   [esp + 0], eax
    adc   edx, edx
    mov   [esp + 4], edx
    adc   ecx, ecx
    mov   [esp + 8], ecx
    adc   ebx, ebx
    mov   [esp + 12], ebx

;;; shift that result another 2 to get   total * 8
    shld  ebx, ecx, 2        ; start from the high element to pull in unmodified lower bits
    shld  ecx, edx, 2
    shld  edx, eax, 2
    shl   eax, 2

;;; add total*2 from memory to total*8 in regs to get  total*10
    add   eax, [esp + 0]
    adc   edx, [esp + 4]
    adc   ecx, [esp + 8]
    adc   ebx, [esp + 12]

乱序执行在这里非常很有帮助.请注意,在shld 块中,指令实际上依赖于前面的shld.他们从未修改的较低元素中提取位.第一个add eax,eax一运行,shl eax,2就可以运行了(如果前端已经下发了)

Out-of-order execution is very helpful here. Notice that in the shld block, the instructions don't actually depend on the previous shld. They pull in bits from unmodified lower elements. As soon as the first add eax,eax runs, shl eax,2 can run (if the front-end has already issued it).

寄存器重命名使运行该 SHL 成为可能,而不会因 WAR(读后写)危险而停顿.shld edx, eax, 2 也需要 EAX 作为输入,但寄存器重命名的重点是让 CPU 跟踪 EAX 的版本与 shl eax 的输出分开,2.

Register renaming makes it possible to run that SHL without stalling for a WAR (Write-after-read) hazard. The shld edx, eax, 2 also needs EAX as an input, but the whole point of register renaming is to let the CPU track that version of EAX separately from the output of the shl eax,2.

这让我们可以编写不使用很多架构寄存器(仅这 4 个)的代码,但仍然利用更多的物理寄存器让 shld/shl 块以与程序顺序相反的顺序执行,因为输入准备就绪来自 add/adc 块.

This lets us write code that doesn't use many architectural registers (just these 4), but still takes advantage of more physical registers to let the shld/shl block execute in the opposite order from program order, as inputs become ready from the add/adc block.

这很棒,因为这意味着最终的 add/adc 块(从内存中添加)已按照其需要的顺序准备好输入,而无需序列化任一指令链的延迟.这很好,因为 shld 在当前的 Intel CPU(如 Haswell/Skylake)上有 3 个周期延迟,而在 Sandybridge/IvyBridge 上为 1 个.(这是在 Nehalem 和更早版本上具有 2c 延迟的 2-uop 指令).但是在 Haswell/Skylake 上,它仍然是 1 uop,每时钟 1 个吞吐量.(仅端口 1)

This is great because it means that the final add/adc block (adding from memory) has its inputs ready in the order it needs them, without serializing the latencies of either chain of instructions. This is good because shld has 3 cycle latency on current Intel CPUs (like Haswell/Skylake), up from 1 on Sandybridge/IvyBridge. (It was a 2-uop instruction with 2c latency on Nehalem and earlier). But on Haswell/Skylake it's still 1 uop with 1-per-clock throughput. (port 1 only)

Ryzen 具有较慢的 shld:6 uops,3 个周期延迟,每 3 个周期一个吞吐量.(https://agner.org/optimize/)

Ryzen has slower shld: 6 uops, 3 cycle latency, one-per-3-cycle throughput. (https://agner.org/optimize/)

即使按照程序顺序,每个块是单独完成的,我们也可以有效地同时运行 3 个添加或移位链.一旦我们添加了第 4 个区块的新数字,它也可以在飞行中.

We effectively can have 3 add or shift chains in flight at once even though in program order each block is done separately. And once we add in the new digit with a 4th block, it can be in flight, too.

示例循环.输入 EBX:ECX:EDX = 0 和 EAX = 第一个数字,准备检查第二个字符是否为数字,然后执行 total = t*10 + digit.

Example loop. Enter it with EBX:ECX:EDX = 0 and EAX = the first digit, ready to check the 2nd character for being a digit and then do total = t*10 + digit.

.digit_loop:
    ... earlier block    ; total *= 10

    add    eax, ebp      ; total += digit
    adc    edx, 0
    adc    ecx, 0
    adc    ebx, 0

.loop_entry_point:
    inc    esi
    movzx  ebp, byte ptr [esi]    ; load a new input digit

    sub    ebp, '0'               ; ASCII digit -> 0..9 integer
    cmp    ebp, 9                 ; unless it was out of range
    jbe   .digit_loop
;else fall through on a non-digit.

; ESI points at the first non-digit
; EBX:ECX:EDX:EAX holds the 128-bit binary integer.

您可以在重新加载 total*2 之前将 total += digit 向上移动,以更好地隐藏存储转发延迟.

You could move the total += digit up to before the reload of total*2 to better hide store-forwarding latency.

另一个可能的选择是 4x mul 和部分产品的必要 add/adc.如果您可以假设 mulx 的 BMI2 在不影响标志的情况下相乘,那么这可能会很好,这样您就可以将 mulx 与 adc 交错.但是你需要在寄存器中使用 10.

Another possible option is 4x mul and the requisite add/adc of the partial products. That might be nice if you can assume BMI2 for mulx to multiply without affecting flags so you can interleave mulx with adc. But then you'd need 10 in a register.

另一种选择是使用 XMM 寄存器进行 SSE2 64 位整数数学运算.或用于 64 位 MMX regs 的 MMX.但是,处理 64 位元素边界很不方便,因为只有标量整数才有加进位.但可能仍然值得,因为您只有一半的操作.

Another option is to use XMM registers for SSE2 64-bit integer math. Or MMX for 64-bit MMX regs. Dealing with 64-bit-element boundaries is inconvenient, though, because only scalar integer has add-with-carry. But possibly still worth it because you only have half the number of operations.

最好将 9 位整数组转换为 32 位十进制数,然后将扩展精度乘以 1e9 进行组合.(比如最后 9 位数字,之前的 9 位数字等)所以你没有所有这些 adc/store+reload 工作每个数字.这将意味着在最后进行大量乘法以组合多达四(?)组数字.

It might be better to convert 9-digit groups of integers to 32-bit decimal, then do extended-precision multiplies by 1e9 to combine. (Like the last 9 digits, the 9 digits before that, etc.) So you don't have all this adc / store+reload work for every digit. That would mean a significant amount of multiplying at the end to combine up to four(?) groups of digits.

或者可能只是用单个寄存器处理前 9 位数字(正常方式),然后用第二个循环扩大到两个寄存器,然后在第 18 个之后扩大到 4 个数字.这对于结果小于 9 位的数字很有用,只使用快速 1 寄存器累加器.

Or maybe just process the first 9 digits with a single register (the normal way), then widen to two registers with a 2nd loop, then widen to four for digits after the 18th. That would be good for numbers that turn out to be shorter than 9 digits, only ever using the fast 1-register accumulator.