用Python中的二维数组总结ndarray

问题描述:

我想使用包含在 2d 数组 idx 中的索引来总结一个 3d 数组 dat.

I want to summarize a 3d array dat using indices contained in a 2d array idx.

考虑下面的例子.对于沿 dat[:, :, i] 的每个边距,我想根据某个索引 idx 计算中位数.所需的输出 (out) 是一个二维数组,其行记录索引,列记录边距.以下代码有效,但效率不高.有什么建议吗?

Consider the example below. For each margin along dat[:, :, i], I want to compute the median according to some index idx. The desired output (out) is a 2d array, whose rows record the index and columns record the margin. The following code works but is not very efficient. Any suggestions?

import numpy as np
dat = np.arange(12).reshape(2, 2, 3)
idx = np.array([[0, 0], [1, 2]])

out = np.empty((3, 3))
for i in np.unique(idx):
    out[i,] = np.median(dat[idx==i], axis = 0)
print(out)

输出:

[[ 1.5  2.5  3.5]
 [ 6.   7.   8. ]
 [ 9.  10.  11. ]]

为了更好地可视化问题,我将数组的 2x2 维称为行和列,将 3 维称为深度.我将沿第 3 维的向量称为像素"(像素的长度为 3),沿前两个维度的平面称为通道".

To visualize the problem better, I will refer to the 2x2 dimensions of the array as the rows and columns, and the 3 dimension as depth. I will refer to vectors along the 3rd dimension as "pixels" (pixels have length 3), and planes along the first two dimensions as "channels".

您的循环正在累积由掩码 idx == i 选择的一组像素,并取该组中每个通道的中值.结果是一个 Nx3 数组,其中 N 是您拥有的不同事件的数量.

Your loop is accumulating a set of pixels selected by the mask idx == i, and taking the median of each channel within that set. The result is an Nx3 array, where N is the number of distinct incides that you have.

有一天,广义ufuncs将在 numpy 中无处不在,并且 np.median 就是这样一个函数.那天,你将可以使用reduceat magic1 做类似的事情

One day, generalized ufuncs will be ubiquitous in numpy, and np.median will be such a function. On that day, you will be able to use reduceat magic1 to do something like

unq, ind = np.unique(idx, return_inverse=True)
np.median.reduceat(dat.reshape(-1, dat.shape[-1]), np.r_[0, np.where(np.diff(unq[ind]))[0]+1])

1请参阅将操作应用于 numpy 数组的不均匀分割部分以了解更多信息有关特定魔法类型的信息.

1 See Applying operation to unevenly split portions of numpy array for more info on the specific type of magic.

由于目前无法实现,您可以使用 scipy.ndimage.median 代替.此版本允许您计算数组中一组标记区域的中位数,这正是您使用 idx 所拥有的.此方法假定您的索引数组包含 N 个密集打包的值,所有这些值都在 range(N) 内.否则整形操作将无法正常进行.

Since this is not currently possible, you can use scipy.ndimage.median instead. This version allows you to compute medians over a set of labeled areas in an array, which is exactly what you have with idx. This method assumes that your index array contains N densely packed values, all of which are in range(N). Otherwise the reshaping operations will not work properly.

如果不是这种情况,请从转换 idx 开始:

If that is not the case, start by transforming idx:

_, ind = np.unique(idx, return_inverse=True)
idx = ind.reshape(idx.shape)

idx = np.unique(idx, return_inverse=True)[1].reshape(idx.shape)

由于您实际上是为每个区域和通道计算单独的中值,因此您需要为每个通道设置一组标签.充实 idx 以便为每个通道设置一组不同的索引:

Since you are actually computing a separate median for each region and channel, you will need to have a set of labels for each channel. Flesh out idx to have a distinct set of indices for each channel:

chan = dat.shape[-1]
offset = idx.max() + 1
index = np.stack([idx + i * offset for i in range(chan)], axis=-1)

现在 index 在每个通道中定义了一组相同的区域,您可以在 scipy.ndimage.median 中使用:

Now index has an identical set of regions defined in each channel, which you can use in scipy.ndimage.median:

out = scipy.ndimage.median(dat, index, index=range(offset * chan)).reshape(chan, offset).T

输入标签必须从零到 offset * chan 密集打包,index=range(offset * chan) 才能正常工作,并且 reshape 操作以获得正确数量的元素.最后的转置只是标签排列方式的产物.

The input labels must be densely packed from zero to offset * chan for index=range(offset * chan) to work properly, and the reshape operation to have the right number of elements. The final transpose is just an artifact of how the labels are arranged.

这是完整的产品,以及结果的 IDEOne 演示:

Here is the complete product, along with an IDEOne demo of the result:

import numpy as np
from scipy.ndimage import median

dat = np.arange(12).reshape(2, 2, 3)
idx = np.array([[0, 0], [1, 2]])

def summarize(dat, idx):
    idx = np.unique(idx, return_inverse=True)[1].reshape(idx.shape)
    chan = dat.shape[-1]
    offset = idx.max() + 1
    index = np.stack([idx + i * offset for i in range(chan)], axis=-1)
    return median(dat, index, index=range(offset * chan)).reshape(chan, offset).T

print(summarize(dat, idx))