计算两个整数矩阵/数据帧的所有行之间的成对汉明距离

问题描述：

我有两个数据框，带有参考数据的 df1 和带有新数据的 df2.对于 df2 中的每一行，我需要根据汉明距离找到与 df1 匹配的最佳(和次佳)行.

I have two data frames, df1 with reference data and df2 with new data. For each row in df2, I need to find the best (and the second best) matching row to df1 in terms of hamming distance.

我使用 e1071 包来计算汉明距离.两个向量 x 和 y 之间的汉明距离可以计算为例如:

I used e1071 package to compute hamming distance. Hamming distance between two vectors x and y can be computed as for example:

x <- c(356739, 324074, 904133, 1025460, 433677, 110525, 576942, 526518, 299386,
       92497, 977385, 27563, 429551, 307757, 267970, 181157, 3796, 679012, 711274,
       24197, 610187, 402471, 157122, 866381, 582868, 878)

y <- c(356739, 324042, 904133, 959893, 433677, 110269, 576942, 2230, 267130,
       92496, 960747, 28587, 429551, 438825, 267970, 181157, 36564, 677220,
       711274, 24485, 610187, 404519, 157122, 866413, 718036, 876)

xm <- sapply(x, intToBits)
ym <- sapply(y, intToBits)

distance <- sum(sapply(1:ncol(xm), function(i) hamming.distance(xm[,i], ym[,i])))

结果距离为 25.但我需要对 df1 和 df2 的所有行执行此操作.一个简单的方法需要一个双循环嵌套，看起来非常慢.

and the resulting distance is 25. Yet I need to do this for all rows of df1 and df2. A trivial method takes a double loop nest and looks terribly slow.

任何想法如何更有效地做到这一点?最后我需要附加到 df2:

Any ideas how to do this more efficiently? In the end I need to append to df2:

具有来自 df1 的行 id 的列，该列给出了最低距离；
距离最小的一列；
具有来自 df1 的行 id 的列，给出第二个最低距离；
距离第二小的列.

a column with the row id from df1 that gives the lowest distance;
a column with the lowest distance;
a column with the row id from df1 that gives the 2nd lowest distance;
a column with the second lowest distance.

谢谢.

答

快速计算两个等长整数向量之间的汉明距离

正如我在评论中所说，我们可以:

As I said in my comment, we can do:

hmd0 <- function(x,y) sum(as.logical(xor(intToBits(x),intToBits(y))))

计算两个等长的整数向量 x 和 y 之间的汉明距离.这仅使用 R 基，但比 e1071::hamming.distance 更有效，因为它是矢量化的！

to compute hamming distance between two integers vectors of equal length x and y. This only uses R base, yet is more efficient than e1071::hamming.distance, because it is vectorized!

对于您帖子中的 x 和 y 示例，这给出了 25.(我的另一个答案将显示我们应该做什么，如果我们想要成对汉明距离.)

For the example x and y in your post, this gives 25. (My other answer will show what we should do, if we want pairwise hamming distance.)

矩阵和向量之间的快速汉明距离

如果我们想计算单个y和多个x之间的汉明距离，即向量和矩阵之间的汉明距离，我们可以使用以下功能.

If we want to compute the hamming distance between a single y and multiple xs, i.e., the hamming distance between a vector and a matrix, we can use the following function.

hmd <- function(x,y) {
  rawx <- intToBits(x)
  rawy <- intToBits(y)
  nx <- length(rawx)
  ny <- length(rawy)
  if (nx == ny) {
    ## quick return
    return (sum(as.logical(xor(rawx,rawy))))
    } else if (nx < ny) {
    ## pivoting
    tmp <- rawx; rawx <- rawy; rawy <- tmp
    tmp <- nx; nx <- ny; ny <- tmp
    }
  if (nx %% ny) stop("unconformable length!") else {
    nc <- nx / ny  ## number of cycles
    return(unname(tapply(as.logical(xor(rawx,rawy)), rep(1:nc, each=ny), sum)))
    }
  }

注意:

hmd 执行计算按列.它旨在CPU 缓存友好.这样，如果我们想做一些逐行的计算，我们应该先对矩阵进行转置；
这里没有明显的循环；相反，我们使用 tapply().

hmd performs computation column-wise. It is designed to be CPU cache friendly. In this way, if we want to do some row-wise computation, we should transpose the matrix first;
there is no obvious loop here; instead, we use tapply().

两个矩阵/数据帧之间的快速汉明距离计算

这就是你想要的.以下函数foo 取两个数据帧或矩阵df1 和df2，计算df1 与每一行的距离df2.参数 p 是一个整数，表示你想要保留多少结果.p = 3 将在 df1 中保留最小的 3 个距离及其行 ID.

This is what you want. The following function foo takes two data frames or matrices df1 and df2, computing the distance between df1 and each row of df2. argument p is an integer, showing how many results you want to retain. p = 3 will keep the smallest 3 distances with their row ids in df1.

foo <- function(df1, df2, p) {
  ## check p
  if (p > nrow(df2)) p <- nrow(df2)
  ## transpose for CPU cache friendly code
  xt <- t(as.matrix(df1))
  yt <- t(as.matrix(df2))
  ## after transpose, we compute hamming distance column by column
  ## a for loop is decent; no performance gain from apply family
  n <- ncol(yt)
  id <- integer(n * p)
  d <- numeric(n * p)
  k <- 1:p
  for (i in 1:n) {
    distance <- hmd(xt, yt[,i])
    minp <- order(distance)[1:p]
    id[k] <- minp
    d[k] <- distance[minp]
    k <- k + p
    }
  ## recode "id" and "d" into data frame and return
  id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
  colnames(id) <- paste0("min.", 1:p)
  d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
  colnames(d) <- paste0("mindist.", 1:p)
  list(id = id, d = d)
  }

注意:

根据之前的原因，在开始时进行换位；
这里使用了 for 循环.但这实际上是有效的，因为在每次迭代中都完成了大量计算.它也比使用 *apply 系列更优雅，因为我们要求多个输出(行 id id 和距离 d).

transposition is done at the beginning, according to reasons before;
a for loop is used here. But this is actually efficient because there is considerable computation done in each iteration. It is also more elegant than using *apply family, since we ask for multiple output (row id id and distance d).

实验

这部分使用小数据集来测试/演示我们的功能.

This part uses small dataset to test/demonstrate our functions.

一些玩具数据:

set.seed(0)
df1 <- as.data.frame(matrix(sample(1:10), ncol = 2))  ## 5 rows 2 cols
df2 <- as.data.frame(matrix(sample(1:6), ncol = 2))  ## 3 rows 2 cols

先测试hmd(需要换位):

hmd(t(as.matrix(df1)), df2[1, ])  ## df1 & first row of df2
# [1] 2 4 6 2 4

测试foo:

foo(df1, df2, p = 2)

# $id
#   min1 min2
# 1    1    4
# 2    2    3
# 3    5    2

# $d
#   mindist.1 mindist.2
# 1         2         2
# 2         1         3
# 3         1         3

如果您想将一些列附加到 df2，您知道该怎么做，对吗?

If you want to append some columns to df2, you know what to do, right?

计算两个整数矩阵/数据帧的所有行之间的成对汉明距离

相关推荐