在 x 和 y 坐标的 numpy 数组中查找最近点的索引

问题描述:

我有两个二维 numpy 数组:x_array 包含 x 方向的位置信息,y_array 包含 y 方向的位置.

I have two 2d numpy arrays: x_array contains positional information in the x-direction, y_array contains positions in the y-direction.

然后我有一长串 x,y 点.

I then have a long list of x,y points.

对于列表中的每个点,我需要找到最接近该点的位置(在数组中指定)的数组索引.

For each point in the list, I need to find the array index of the location (specified in the arrays) which is closest to that point.

基于这个问题,我天真地生成了一些有效的代码:在numpy数组中查找最近的值

I have naively produced some code which works, based on this question: Find nearest value in numpy array

import time
import numpy

def find_index_of_nearest_xy(y_array, x_array, y_point, x_point):
    distance = (y_array-y_point)**2 + (x_array-x_point)**2
    idy,idx = numpy.where(distance==distance.min())
    return idy[0],idx[0]

def do_all(y_array, x_array, points):
    store = []
    for i in xrange(points.shape[1]):
        store.append(find_index_of_nearest_xy(y_array,x_array,points[0,i],points[1,i]))
    return store


# Create some dummy data
y_array = numpy.random.random(10000).reshape(100,100)
x_array = numpy.random.random(10000).reshape(100,100)

points = numpy.random.random(10000).reshape(2,5000)

# Time how long it takes to run
start = time.time()
results = do_all(y_array, x_array, points)
end = time.time()
print 'Completed in: ',end-start

我正在一个大型数据集上执行此操作,并且非常想加快速度.谁能优化一下?

I'm doing this over a large dataset and would really like to speed it up a bit. Can anyone optimize this?

谢谢.

更新:解决方案遵循@silvado 和@justin(下)的建议

UPDATE: SOLUTION following suggestions by @silvado and @justin (below)

# Shoe-horn existing data for entry into KDTree routines
combined_x_y_arrays = numpy.dstack([y_array.ravel(),x_array.ravel()])[0]
points_list = list(points.transpose())


def do_kdtree(combined_x_y_arrays,points):
    mytree = scipy.spatial.cKDTree(combined_x_y_arrays)
    dist, indexes = mytree.query(points)
    return indexes

start = time.time()
results2 = do_kdtree(combined_x_y_arrays,points_list)
end = time.time()
print 'Completed in: ',end-start

上面的这段代码将我的代码(在 100x100 矩阵中搜索 5000 个点)加速了 100 倍.有趣的是,使用 scipy.spatial.KDTree (而不是 scipy.spatial.cKDTree)与我天真的解决方案相当的时间,所以绝对值得使用 cKDTree 版本......

This code above sped up my code (searching for 5000 points in 100x100 matrices) by 100 times. Interestingly, using scipy.spatial.KDTree (instead of scipy.spatial.cKDTree) gave comparable timings to my naive solution, so it is definitely worth using the cKDTree version...

scipy.spatial 也有 kd 树实现:scipy.spatial.KDTree.

scipy.spatial also has a k-d tree implementation: scipy.spatial.KDTree.

一般的做法是先利用点数据建立k-d树.其计算复杂度约为 N log N,其中 N 是数据点的数量.然后可以用 log N 复杂度完成范围查询和最近邻搜索.这比简单地循环遍历所有点(复杂度 N)要高效得多.

The approach is generally to first use the point data to build up a k-d tree. The computational complexity of that is on the order of N log N, where N is the number of data points. Range queries and nearest neighbour searches can then be done with log N complexity. This is much more efficient than simply cycling through all points (complexity N).

因此,如果您有重复的范围或最近邻查询,强烈建议使用 k-d 树.

Thus, if you have repeated range or nearest neighbor queries, a k-d tree is highly recommended.