为什么C ++中的float除法比整数除法快?
请考虑以下C ++代码段:(Visual Studio 2015)
Consider the following code snippet in C++ :(visual studio 2015)
第一个区块
const int size = 500000000;
int sum =0;
int *num1 = new int[size];//initialized between 1-250
int *num2 = new int[size];//initialized between 1-250
for (int i = 0; i < size; i++)
{
sum +=(num1[i] / num2[i]);
}
第二个区块
const int size = 500000000;
int sum =0;
float *num1 = new float [size]; //initialized between 1-250
float *num2 = new float [size]; //initialized between 1-250
for (int i = 0; i < size; i++)
{
sum +=(num1[i] / num2[i]);
}
我期望第一个块的运行速度更快,因为它是整数运算。但是,尽管第二块是浮点运算,但它的运行速度要快得多。这是我基准测试的结果:
部门:
I expected that first block runs faster because it is integer operation . But the Second block is considerably faster , although it is floating point operation . here is results of my bench mark : Division:
Type Time
uint8 879.5ms
uint16 885.284ms
int 982.195ms
float 654.654ms
以及浮点乘法比整数乘法快。
是我的基准测试结果:
As well as floating point multiplication is faster than integer multiplication. here is results of my bench mark :
乘法:
Type Time
uint8 166.339ms
uint16 524.045ms
int 432.041ms
float 402.109ms
我的系统规格:CPU核心i7-7700,Ram 64GB,Visual Studio 2015
My system spec: CPU core i7-7700 ,Ram 64GB,Visual studio 2015
int32_t
除法需要快速除以31位数字,而 float
除法要求快速除以24位尾数(隐含尾数中的前一个而不是存储在浮点数中),并且需要更快地减去8位指数。
int32_t
division requires fast division of 31-bit numbers, whereas float
division requires fast division of 24-bit mantissas (the leading one in mantissa is implied and not stored in a floating point number) and faster subtraction of 8-bit exponents.
值得一提的是SSE和AVX指令仅提供浮点除法,而不提供整数除法。 SSE指令/整数可以轻松地将 float
计算的速度提高三倍。
It may be worth mentioning that SSE and AVX instructions only provide floating point division, but no integer division. SSE instructions/intrinsincs can be used to quadruple the speed of your float
calculation easily.
如果您查看 Agner Fog的指令表,例如,对于Skylake,是32位整数除法的延迟是26个CPU周期,而SSE标量浮点除法的等待时间是11个CPU周期(而且令人惊讶的是,划分四个压缩浮点需要花费相同的时间)。
If you look into Agner Fog's instruction tables, for example, for Skylake, the latency of the 32-bit integer division is 26 CPU cycles, whereas the latency of the SSE scalar float division is 11 CPU cycles (and, surprisingly, it takes the same time to divide four packed floats).
还要注意,在C和C ++中,对小于 int
的数字没有除法,因此 uint8_t
和 uint16_t 升级为 int
,然后进行 int
的划分。 uint8_t
的划分看起来比 int
快,因为转换为 int时它设置的位更少了
可使除法更快完成。
Also note, in C and C++ there is no division on numbers shorter that int
, so that uint8_t
and uint16_t
are first promoted to int
and then the division of int
s happens. uint8_t
division looks faster than int
because it has fewer bits set when converted to int
which causes the division to complete faster.