有条件地删除R中的重复项
我有一个数据集,其中我需要根据另一列中的值有条件地删除重复的行.
I have a dataset in which I need to conditionally remove duplicated rows based on values in another column.
具体来说,只有在SampleID
是重复的 时,我才需要删除size = 0
的任何行.
Specifically, I need to delete any row where size = 0
only if SampleID
is duplicated.
SampleID<-c("a", "a", "b", "b", "b", "c", "d", "d", "e")
size<-c(0, 1, 1, 2, 3, 0, 0, 1, 0)
data<-data.frame(SampleID, size)
我要使用以下内容删除行:
I want to delete rows with:
Sample ID size
a 0
d 0
并保留:
SampleID size
a 1
b 1
b 2
b 3
c 0
d 1
e 0
注意.实际的数据集非常大,因此我不寻求一种仅按行号删除已知行的方法.
Note. actual dataset it very large, so I am not looking for a way to just remove a known row by row number.
使用data.table
框架:将您的集合转换为data.table
Using data.table
framework: Transform your set to data.table
require(data.table)
setDT(data)
建立一个ID列表,我们可以在其中删除行:
Build a list of id where we can delete lines:
dropable_ids = unique(data[size != 0, SampleID])
最后保留不在可删除列表中或具有非0值的行
Finaly keep lines that are not in the dropable list or with non 0 value
data = data[!(SampleID %in% dropable_ids & size == 0), ]
请注意,not( a and b )
等同于a or b
,但是data.table框架不能很好地处理or
.
Please note that not( a and b )
is equivalent to a or b
but data.table framework doesn't handle well or
.
希望有帮助