有条件地删除R中的重复项

问题描述：

我有一个数据集，其中我需要根据另一列中的值有条件地删除重复的行.

I have a dataset in which I need to conditionally remove duplicated rows based on values in another column.

具体来说，只有在SampleID是重复的时，我才需要删除size = 0的任何行.

Specifically, I need to delete any row where size = 0 only if SampleID is duplicated.

SampleID<-c("a", "a", "b", "b", "b", "c", "d", "d", "e")
size<-c(0, 1, 1, 2, 3, 0, 0, 1, 0)
data<-data.frame(SampleID, size)

我要使用以下内容删除行:

I want to delete rows with:

Sample ID   size
a           0
d           0

并保留:

SampleID   size
a          1
b          1
b          2
b          3
c          0
d          1
e          0

注意.实际的数据集非常大，因此我不寻求一种仅按行号删除已知行的方法.

Note. actual dataset it very large, so I am not looking for a way to just remove a known row by row number.

答

使用data.table框架:将您的集合转换为data.table

Using data.table framework: Transform your set to data.table

require(data.table)
setDT(data)

建立一个ID列表，我们可以在其中删除行:

Build a list of id where we can delete lines:

dropable_ids = unique(data[size != 0, SampleID])

最后保留不在可删除列表中或具有非0值的行

Finaly keep lines that are not in the dropable list or with non 0 value

data = data[!(SampleID %in% dropable_ids & size == 0), ]

请注意，not( a and b )等同于a or b，但是data.table框架不能很好地处理or.

Please note that not( a and b ) is equivalent to a or b but data.table framework doesn't handle well or.

希望有帮助