如何在Python中优化嵌套的for循环

问题描述:

因此,我试图编写一个python函数以返回一个称为Mielke-Berry R值的度量.指标的计算方式如下:

So I am trying to write a python function to return a metric called the Mielke-Berry R value. The metric is calculated like so:

我当前编写的代码工作正常,但是由于方程中总和的原因,我唯一想解决的事情就是在Python中使用嵌套的for循环,这非常慢...

The current code I have written works, but because of the sum of sums in the equation, the only thing I could think of to solve it was to use a nested for loop in Python, which is very slow...

下面是我的代码:

def mb_r(forecasted_array, observed_array):
    """Returns the Mielke-Berry R value."""
    assert len(observed_array) == len(forecasted_array)
    y = forecasted_array.tolist()
    x = observed_array.tolist()
    total = 0
    for i in range(len(y)):
        for j in range(len(y)):
            total = total + abs(y[j] - x[i])
    total = np.array([total])
    return 1 - (mae(forecasted_array, observed_array) * forecasted_array.size ** 2 / total[0])

我将输入数组转换为列表的原因是因为我听说(尚未测试)使用python for循环索引numpy数组非常慢.

The reason I converted the input arrays to lists is because I have heard (haven't yet tested) that indexing a numpy array using a python for loop is very slow.

我觉得可能会有某种numpy函数来解决这个问题,有人知道吗?

I feel like there may be some sort of numpy function to solve this much faster, anyone know of anything?

以numpy广播

如果您不受内存限制,则优化numpy中的嵌套循环的第一步是使用广播并以矢量化方式执行操作:

Broadcasting in numpy

If you are not memory constrained, the first step to optimize nested loops in numpy is to use broadcasting and perform operations in a vectorized way:

import numpy as np    

def mb_r(forecasted_array, observed_array):
        """Returns the Mielke-Berry R value."""
        assert len(observed_array) == len(forecasted_array)
        total = np.abs(forecasted_array[:, np.newaxis] - observed_array).sum() # Broadcasting
        return 1 - (mae(forecasted_array, observed_array) * forecasted_array.size ** 2 / total[0])

但是在这种情况下,循环是在C中而不是Python中发生的,它涉及分配大小(N,N)的数组.

But while in this case looping occurs in C instead of Python it involves allocation of a size (N, N) array.

如上所述,广播意味着巨大的内存开销.因此,应谨慎使用它,并不总是正确的方法.虽然您可能会对在任何地方使用它都有第一印象,但不要.不久前,我也对此感到困惑,请参阅我的问题 Numpy ufuncs速度与循环速度的对比.不要太冗长,我将在您的示例中显示:

As it was noted above broadcasting implies huge memory overhead. So it should be used with care and it is not always the right way. While you may have first impression to use it everywhere - do not. Not so long ago I was also confused by this fact, see my question Numpy ufuncs speed vs for loop speed. Not to be too verbose, I will show this on yours example:

import numpy as np

# Broadcast version
def mb_r_bcast(forecasted_array, observed_array):
    return np.abs(forecasted_array[:, np.newaxis] - observed_array).sum()

# Inner loop unrolled version
def mb_r_unroll(forecasted_array, observed_array):
    size = len(observed_array)
    total = 0.
    for i in range(size):  # There is only one loop
        total += np.abs(forecasted_array - observed_array[i]).sum()
    return total

小型数组(广播速度更快)

Small-size arrays (broadcasting is faster)

forecasted = np.random.rand(100)
observed = np.random.rand(100)

%timeit mb_r_bcast(forecasted, observed)
57.5 µs ± 359 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit mb_r_unroll(forecasted, observed)
1.17 ms ± 2.53 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

中等大小的数组(相等)

Medium-size arrays (equal)

forecasted = np.random.rand(1000)
observed = np.random.rand(1000)

%timeit mb_r_bcast(forecasted, observed)
15.6 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit mb_r_unroll(forecasted, observed)
16.4 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

大型数组(广播速度较慢)

Large-size arrays (broadcasting is slower)

forecasted = np.random.rand(10000)
observed = np.random.rand(10000)

%timeit mb_r_bcast(forecasted, observed)
1.51 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit mb_r_unroll(forecasted, observed)
377 ms ± 994 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

对于小型阵列,广播版本比展开的版本快 20倍;对于中型阵列,广播版本与相当相等,但是对于大型阵列它会慢四倍,因为内存开销付出了自己昂贵的代价.

As you can see for small-sized arrays broadcast version is 20x faster than unrolled, for medium-sized arrays they are rather equal, but for large-sized arrays it is 4x slower because memory overhead is paying its own costly price.

另一种方法是使用numba及其强大的 magic 功能装饰器.在这种情况下,只需要对初始代码进行少量修改即可.同样,为了使循环并行,您应该将range更改为prange并提供parallel=True关键字参数.在下面的代码段中,我使用与@jit(nopython=True)相同的@njit装饰器:

Another approach is to use numba and its magic powerful @jit function-decorator. In this case, only slight modification of your initial code is necessary. Also to make loops parallel you should change range to prange and provide parallel=True keyword argument. In the snippet below I use the @njit decorator which is the same as the @jit(nopython=True):

from numba import njit, prange

@njit(parallel=True)
def mb_r_njit(forecasted_array, observed_array):
    """Returns the Mielke-Berry R value."""
    assert len(observed_array) == len(forecasted_array)
    total = 0.
    size = len(forecasted_array)
    for i in prange(size):
        observed = observed_array[i]
        for j in prange(size):
            total += abs(forecasted_array[j] - observed)
    return 1 - (mae(forecasted_array, observed_array) * size ** 2 / total)

您没有提供mae函数,但是要在njit模式下运行代码,您还必须修饰mae函数,或者如果它是数字,则将其作为参数传递给jitted函数.

You didn't provide mae function, but to run the code in njit mode you must also decorate mae function, or if it is a number pass it as an argument to the jitted function.

Python科学生态系统非常庞大,我只提到其他一些等效的选项来加快速度:CythonNuitkaPythranbottleneck等.也许您对gpu computing感兴趣,但这实际上是另一个故事.

Python scientific ecosystem is huge, I just mention some other equivalent options to speed up: Cython, Nuitka, Pythran, bottleneck and many others. Perhaps you are interested in gpu computing, but this is actually another story.

不幸的是,在我的计算机上,旧计算机的时间是:

On my computer, unfortunately the old one, the timings are:

import numpy as np
import numexpr as ne

forecasted_array = np.random.rand(10000)
observed_array   = np.random.rand(10000)

初始版本

%timeit mb_r(forecasted_array, observed_array)
23.4 s ± 430 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

numexpr

numexpr

%%timeit
forecasted_array2d = forecasted_array[:, np.newaxis]
ne.evaluate('sum(abs(forecasted_array2d - observed_array))')[()]
784 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

广播版本

%timeit mb_r_bcast(forecasted, observed)
1.47 s ± 4.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

内环展开版本

%timeit mb_r_unroll(forecasted, observed)
389 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

numba njit(parallel=True)版本

numba njit(parallel=True) version

%timeit mb_r_njit(forecasted_array, observed_array)
32 ms ± 4.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

可以看出,njit方法比初始解决方案快 730x ,并且比numexpr解决方案快 24.5x (也许您需要Intel的Vector数学库来加速它).与初始循环相比,内部循环展开的简单方法还可以使您的速度提高 60倍.我的规格是:

It can be seen that njit approach is 730x faster then your initial solution, and also 24.5x faster than numexpr solution (maybe you need Intel's Vector Math Library to accelerate it). Also simple approach with the inner loop unrolling gives you 60x speed up compared to your initial version. My specs are:

Intel(R)Core(TM)2四核CPU Q9550 2.83GHz
Python 3.6.3
numpy 1.13.3
numba 0.36.1
numexpr 2.6.4

Intel(R) Core(TM)2 Quad CPU Q9550 2.83GHz
Python 3.6.3
numpy 1.13.3
numba 0.36.1
numexpr 2.6.4

您的短语使我感到惊讶:我听说(尚未测试)使用python for循环索引numpy数组非常慢." 因此,我进行了测试:

I was surprised by your phrase "I have heard (haven't yet tested) that indexing a numpy array using a python for loop is very slow." So I test:

arr = np.arange(1000)
ls = arr.tolistist()

%timeit for i in arr: pass
69.5 µs ± 282 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit for i in ls: pass
13.3 µs ± 81.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit for i in range(len(arr)): arr[i]
167 µs ± 997 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit for i in range(len(ls)): ls[i]
90.8 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

事实证明您是对的.遍历列表快了 2-5倍.当然,要对这些结果进行一定程度的讽刺:)

and it turns out that you are right. It is 2-5x faster to iterate over the list. Of course, these results must be taken with a certain amount of irony :)