为什么df.lookup比df.min慢?

问题描述：

我想通过在idxmin之后使用lookup来节省一些时间，而不是调用min和idxmin.在我看来，第一个应该更高效，因为在第二个中，值需要被搜索两次(在最小值上，另一个在最小值的索引上，即O(NxM)的2倍)，而在第二个中，首先，搜索索引(O(NxM))，然后使用索引查找值(O(N))

I wanted to cut some time by using lookup after idxmin, instead of calling min and idxmin. In my head, the first should be more efficient because in the second the values need to be searched twice (on for the minimum value, and another for the index of the minimum value - i.e. 2 times O(NxM)), whereas in the first, the indexes are searched (O(NxM)) and then the indexes are used to locate the values (O(N))

请检查这个问题，这样您就可以理解上下文以及我的推理的更多细节.

Please check this question so you understand the context and more details on my reasoning.

结果开始出乎意料，因此我进行了一些测试:

The results starting to be unexpected so I proceded with some tests:

我使用了100000行x 10列的数据框(通过添加更多行，结果会变得更糟):

I used a dataframe of 100000 rows x 10columns (results get worse by adding more rows):

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,100,size=(100000, 10)), columns=[f'option_{x}' for x in range(1,11)]).reset_index()
df['min_column'] = df.filter(like='option').idxmin(1)

然后我做了一些计时:

%timeit -n 100 df.filter(like='option').min(1)
# 12.2 ms ± 599 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit -n 100 df.lookup(df.index, df['min_column'])
# 46.9 ms ± 526 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

请注意，即使min_columns是针对lookup预先计算的，结果也比单纯寻找最小值要差4倍.

Notice that even though the min_columns was precalculated for the lookup, the results are 4 times worse than simply looking for the minimum.

其他尺寸的比较:

RowsxCols    min        lookup
100000x10    12.2ms     46.9ms
1000000x10   162ms      682ms
10000x1000   173ms      220ms
1000x10000   295ms      7.97ms

从上表中可以看出，通过添加行(1000000x10)不会使结果得到任何改善，并且在添加更多列(10000x1000)时只是一个很小的追赶.这种追赶是有道理的，但在我看来，它应该更大一些， 索引应该比搜索快 (请参阅更新的numpy结果)，并且仅在极端情况下(几乎不切实际，例如1000x10000)我开始看到优势.

From the above table, as expected the results don't get any better by adding rows (1000000x10), and just a small catch up when adding many more columns(10000x1000). This catch up makes sense, but in my head it should be way bigger, an index should be faster than a search (see updated numpy results), and only in extreme cases (almost unpractical, e.g. 1000x10000) I am starting seeing advantages.

对此行为有解释吗?

我用numpy测试了这一点，并且得到了预期的行为:

I tested the this with numpy, and I get the expected behaviour:

vals = np.random.randint(0,10,size=(100000, 10))
%timeit -n 100 np.min(vals, axis=1)
2.83 ms ± 235 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

idx_min = np.argmin(vals, axis=1)
%timeit -n 100 vals[np.arange(len(idx_min)), idx_min]
1.63 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

比较结果(numpy):

Comparison results (numpy):

RowsxCols    min        indexing using []
100000x10    2.83ms     1.63ms
1000000x10   24.6ms     15.4ms
100000x100   14.5ms     3.38ms
10000x1000   11.1ms   0.377ms

答

如果您查看lookup函数的源代码实现，那么效率似乎并不高.源代码可以在这里找到:

If you look at the source code implementation of lookup function, it does not look to be very efficient. The source code can be found here:

http://github.com/pandas-dev/pandas/blob/v0.23.4/pandas/core/frame.py#L3435-L3484

尤其是在if-else条件主体中，它确实存在

Particularly, in the main if-else condition body, it does

if not self._is_mixed_type or n > thresh:
        values = self.values
        ridx = self.index.get_indexer(row_labels)
        cidx = self.columns.get_indexer(col_labels)
        if (ridx == -1).any():
            raise KeyError('One or more row labels was not found')
        if (cidx == -1).any():
            raise KeyError('One or more column labels was not found')
        flat_index = ridx * len(self.columns) + cidx
        result = values.flat[flat_index]

result = np.empty(n, dtype='O')
for i, (r, c) in enumerate(zip(row_labels, col_labels)):
        result[i] = self._get_value(r, c)

我不确定if case的详细实现，但是您可能希望在大量行和大量列的情况下尝试使用此方法，并且从查找功能中可以获得更好的结果.

I am not sure about detailed implementation of if case, but you might want to try this on a very large number of rows and very large number of column cases and you might get some better results off of lookup function.

您可能应该尝试定义自己的查找表，以便您确切地知道运行时，而不是使用此查找功能

You probably should try to define your own lookup table so you'd know exactly the runtime rather than using this lookup function

为什么df.lookup比df.min慢?

相关推荐