计算两个整数矩阵/数据帧的所有行之间的成对汉明距离
我有两个数据框,带有参考数据的 df1
和带有新数据的 df2
.对于 df2
中的每一行,我需要根据汉明距离找到与 df1
匹配的最佳(和次佳)行.
I have two data frames, df1
with reference data and df2
with new data. For each row in df2
, I need to find the best (and the second best) matching row to df1
in terms of hamming distance.
我使用 e1071
包来计算汉明距离.两个向量 x
和 y
之间的汉明距离可以计算为例如:
I used e1071
package to compute hamming distance. Hamming distance between two vectors x
and y
can be computed as for example:
x <- c(356739, 324074, 904133, 1025460, 433677, 110525, 576942, 526518, 299386,
92497, 977385, 27563, 429551, 307757, 267970, 181157, 3796, 679012, 711274,
24197, 610187, 402471, 157122, 866381, 582868, 878)
y <- c(356739, 324042, 904133, 959893, 433677, 110269, 576942, 2230, 267130,
92496, 960747, 28587, 429551, 438825, 267970, 181157, 36564, 677220,
711274, 24485, 610187, 404519, 157122, 866413, 718036, 876)
xm <- sapply(x, intToBits)
ym <- sapply(y, intToBits)
distance <- sum(sapply(1:ncol(xm), function(i) hamming.distance(xm[,i], ym[,i])))
结果距离为 25.但我需要对 df1
和 df2
的所有行执行此操作.一个简单的方法需要一个双循环嵌套,看起来非常慢.
and the resulting distance is 25. Yet I need to do this for all rows of df1
and df2
. A trivial method takes a double loop nest and looks terribly slow.
任何想法如何更有效地做到这一点?最后我需要附加到 df2
:
Any ideas how to do this more efficiently? In the end I need to append to df2
:
- 具有来自
df1
的行 id 的列,该列给出了最低距离; - 距离最小的一列;
- 具有来自
df1
的行 id 的列,给出第二个最低距离; - 距离第二小的列.
- a column with the row id from
df1
that gives the lowest distance; - a column with the lowest distance;
- a column with the row id from
df1
that gives the 2nd lowest distance; - a column with the second lowest distance.
谢谢.
快速计算两个等长整数向量之间的汉明距离
正如我在评论中所说,我们可以:
As I said in my comment, we can do:
hmd0 <- function(x,y) sum(as.logical(xor(intToBits(x),intToBits(y))))
计算两个等长的整数向量 x
和 y
之间的汉明距离.这仅使用 R 基,但比 e1071::hamming.distance
更有效,因为它是矢量化的!
to compute hamming distance between two integers vectors of equal length x
and y
. This only uses R base, yet is more efficient than e1071::hamming.distance
, because it is vectorized!
对于您帖子中的 x
和 y
示例,这给出了 25.(我的另一个答案将显示我们应该做什么,如果我们想要成对汉明距离.)
For the example x
and y
in your post, this gives 25. (My other answer will show what we should do, if we want pairwise hamming distance.)
矩阵和向量之间的快速汉明距离
如果我们想计算单个y
和多个x
之间的汉明距离,即向量和矩阵之间的汉明距离,我们可以使用以下功能.
If we want to compute the hamming distance between a single y
and multiple x
s, i.e., the hamming distance between a vector and a matrix, we can use the following function.
hmd <- function(x,y) {
rawx <- intToBits(x)
rawy <- intToBits(y)
nx <- length(rawx)
ny <- length(rawy)
if (nx == ny) {
## quick return
return (sum(as.logical(xor(rawx,rawy))))
} else if (nx < ny) {
## pivoting
tmp <- rawx; rawx <- rawy; rawy <- tmp
tmp <- nx; nx <- ny; ny <- tmp
}
if (nx %% ny) stop("unconformable length!") else {
nc <- nx / ny ## number of cycles
return(unname(tapply(as.logical(xor(rawx,rawy)), rep(1:nc, each=ny), sum)))
}
}
注意:
-
hmd
执行计算按列.它旨在CPU 缓存友好.这样,如果我们想做一些逐行的计算,我们应该先对矩阵进行转置; - 这里没有明显的循环;相反,我们使用
tapply()
.
-
hmd
performs computation column-wise. It is designed to be CPU cache friendly. In this way, if we want to do some row-wise computation, we should transpose the matrix first; - there is no obvious loop here; instead, we use
tapply()
.
两个矩阵/数据帧之间的快速汉明距离计算
这就是你想要的.以下函数foo
取两个数据帧或矩阵df1
和df2
,计算df1
与每一行的距离df2
.参数 p
是一个整数,表示你想要保留多少结果.p = 3
将在 df1
中保留最小的 3 个距离及其行 ID.
This is what you want. The following function foo
takes two data frames or matrices df1
and df2
, computing the distance between df1
and each row of df2
. argument p
is an integer, showing how many results you want to retain. p = 3
will keep the smallest 3 distances with their row ids in df1
.
foo <- function(df1, df2, p) {
## check p
if (p > nrow(df2)) p <- nrow(df2)
## transpose for CPU cache friendly code
xt <- t(as.matrix(df1))
yt <- t(as.matrix(df2))
## after transpose, we compute hamming distance column by column
## a for loop is decent; no performance gain from apply family
n <- ncol(yt)
id <- integer(n * p)
d <- numeric(n * p)
k <- 1:p
for (i in 1:n) {
distance <- hmd(xt, yt[,i])
minp <- order(distance)[1:p]
id[k] <- minp
d[k] <- distance[minp]
k <- k + p
}
## recode "id" and "d" into data frame and return
id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
colnames(id) <- paste0("min.", 1:p)
d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
colnames(d) <- paste0("mindist.", 1:p)
list(id = id, d = d)
}
注意:
- 根据之前的原因,在开始时进行换位;
- 这里使用了
for
循环.但这实际上是有效的,因为在每次迭代中都完成了大量计算.它也比使用*apply
系列更优雅,因为我们要求多个输出(行 idid
和距离d
).
- transposition is done at the beginning, according to reasons before;
- a
for
loop is used here. But this is actually efficient because there is considerable computation done in each iteration. It is also more elegant than using*apply
family, since we ask for multiple output (row idid
and distanced
).
实验
这部分使用小数据集来测试/演示我们的功能.
This part uses small dataset to test/demonstrate our functions.
一些玩具数据:
set.seed(0)
df1 <- as.data.frame(matrix(sample(1:10), ncol = 2)) ## 5 rows 2 cols
df2 <- as.data.frame(matrix(sample(1:6), ncol = 2)) ## 3 rows 2 cols
先测试hmd
(需要换位):
hmd(t(as.matrix(df1)), df2[1, ]) ## df1 & first row of df2
# [1] 2 4 6 2 4
测试foo
:
foo(df1, df2, p = 2)
# $id
# min1 min2
# 1 1 4
# 2 2 3
# 3 5 2
# $d
# mindist.1 mindist.2
# 1 2 2
# 2 1 3
# 3 1 3
如果您想将一些列附加到 df2
,您知道该怎么做,对吗?
If you want to append some columns to df2
, you know what to do, right?