标记R中的所有重复行，如在Stata中

问题描述：

从我的问题（这里）开始，我试图在R中复制Stata命令的功能 duplicateates tag ，它允许我根据给定的一组变量标记数据集中所有重复的行：

Following up from my question here, I am trying to replicate in R the functionality of the Stata command duplicates tag, which allows me to tag all the rows of a dataset that are duplicates in terms of a given set of variables:

clear *
set obs 16
g f1 = _n
expand 104
bys f1: g f2 = _n
expand 2
bys f1 f2: g f3 = _n
expand 41
bys f1 f2 f3: g f4 = _n
des  // describe the dataset in memory

preserve
sample 10  // draw a 10% random sample
tempfile sampledata
save `sampledata', replace
restore

// append the duplicate rows to the data
append using `sampledata'
sort f1-f4

duplicates tag f1-f4, generate(dupvar)
browse if dupvar == 1  // check that all duplicate rows have been tagged

编辑

这是Stata生成的（@ Arun的请求添加）：

Edit

Here is what Stata produces (added on @Arun's request):

f1   f2   f3   f4   dupvar  
 1    1    1    1        0  
 1    1    1    2        0  
 1    1    1    3        1  
 1    1    1    3        1  
 1    1    1    4        0  
 1    1    1    5        0  
 1    1    1    6        0  
 1    1    1    7        0  
 1    1    1    8        1  
 1    1    1    8        1

请注意，（f1，f2，f3，f4）= ，1，1，3）有两行，并且这两行都标记为 dupvar = 1 。类似地，对于（f1，f2，f3，f4）=（1,1,1,8）的重复的两行。

Note that for (f1, f2, f3, f4) = (1, 1, 1, 3) there are two rows, and both of those are marked dupvar = 1. Similarly, for the two rows that are duplicates for (f1, f2, f3, f4) =(1, 1, 1, 8).

基本函数重复仅标记第二个重复。所以，我写了一个函数来复制R中的Stata功能，使用 ddply 。

The base function duplicated tags only the second duplicate onwards. So, I wrote a function to replicate the Stata functionality in R, using ddply.

# Values of (f1, f2, f3, f4) uniquely identify observations
dfUnique = expand.grid(f1 = factor(1:16),
            f2 = factor(1:41),
            f3 = factor(1:2),
            f4 = factor(1:104))

# sample some extra rows and rbind them
dfDup = rbind(dfUnique, dfUnique[sample(1:nrow(dfUnique), 100), ])

# dummy data 
dfDup$data = rnorm(nrow(dfDup))

# function: use ddply to tag all duplicate rows in the data
fnDupTag = function(dfX, indexVars) {
  dfDupTag = ddply(dfX, .variables = indexVars, .fun = function(x) {
    if(nrow(x) > 1) x$dup = 1 else x$dup = 0
    return(x)
  })
  return(dfDupTag)
}

# test the function
indexVars = paste0('f', 1:4, sep = '')
dfTemp = fnDupTag(dfDup, indexVars)

在链接的问题，性能是一个巨大的问题。另一个可能的解决方案是

But as in the linked question, performance is a huge issue. Another possible solution is

dfDup$dup = duplicated(dfDup[, indexVars]) | 
  duplicated(dfDup[, indexVars], fromLast = TRUE) 
dfDupSorted = with(dfDup, dfDup[order(eval(parse(text = indexVars))), ])

我有几个问题：

1.可以使 ddply 版本更快？

2.第二个版本是否使用重复的正确？对于多个副本的重复行？
3.我如何使用 data.table ？

I have a few questions:
1. Is it possible to make the ddply version faster?
2. Is the second version using duplicated correct? For more than two copies of the duplicated rows? 3. How would I do this using data.table? Would that be faster?

答

我会在这里回答你的第三个问题..（我认为第一个问题是或多或少在您的其他帖子中回答）。

I'll answer your third question here.. (I think the first question is more or less answered in your other post).

## Assuming DT is your data.table
DT[, dupvar := 1L*(.N > 1L), by=c(indexVars)]

：= 通过引用添加一个新列 dupvar （因为没有副本，所以非常快）。 .N 是 data.table 中的特殊变量，它提供属于每个组的观察数，对于 f1，f2，f3，f4 ）。

:= adds a new column dupvar by reference (and is therefore very fast because no copies are made). .N is a special variable within data.table, that provides the number of observations that belong to each group (here, for every f1,f2,f3,f4).

花时间参加？data.table （并运行示例）了解其用法。

Take your time and go through ?data.table (and run the examples there) to understand the usage. It'll save you a lot of time later on.

所以，基本上，我们分组 indexVars ，检查if .N> 1L ，如果是这样，它会返回 TRUE 。我们乘以 1L 以返回整数而不是 logical

So, basically, we group by indexVars, check if .N > 1L and if it's the case, it'd return TRUE. We multiply by 1L to return an integer instead of logical value.

如果需要，您还可以使用 setkey 按列排序。

If you require, you can also sort it by the by-columns using setkey.

从下一个版本开始（目前在v1.9.3 - 开发版本中实现），还有一个函数只要通过引用对 data.table 进行排序，而不设置键来导出 。它也可以按升序或降序排序。（请注意， setkey 始终只按升序排序。）

From the next version on (currently implemented in v1.9.3 - development version), there's also a function setorder that's exported that just sorts the data.table by reference, without setting keys. It also can sort in ascending or descending order. (Note that setkey always sorts in ascending order only).

在下一个版本中，您可以执行以下操作：

That is, in the next version you can do:

setorder(DT, f1, f2, f3, f4)
## or equivalently
setorderv(DT, c("f1", "f2", "f3", "f4"))

此外，内部还优化了使用 DT [order（...）] 以使用 data.table 的快速排序。也就是说，内部检测到 DT [order（...）] 并改为 DT [forder（DT，...）] ，这比base的 order 快得多。所以，如果你不想通过引用改变它，并且想要将排序的 data.table 赋给另一个变量，你可以这样做：

In addition, the usage DT[order(...)] is also optimised internally to use data.table's fast ordering. That is, DT[order(...)] is detected internally and changed to DT[forder(DT, ...)] which is incredibly faster than base's order. So, if you don't want to change it by reference, and want to assign the sorted data.table on to another variable, you can just do:

DT_sorted <- DT[order(f1, f2, f3, f4)] ## internally optimised for speed
                                       ## but still copies!

HTH

标记R中的所有重复行，如在Stata中

编辑

Edit

相关推荐