根据时间差删除重复项并在重复项附近折叠
我有一个如下所示的数据帧
I have a data-frame like as shown below
DF = structure(list(Age_visit = c(48, 48, 48, 49, 49, 77), Date_1 = c("8/6/2169 9:40", "8/6/2169 9:40",
"8/6/2169 9:41", "8/6/2169 9:42", "24/7/2169 8:31", "12/9/2169 10:30",
"19/6/2237 12:15"), Date_2 = c("NA-NA-NA NA:NA:NA", "NA-NA-NA NA:NA:NA", "NA-NA-NA NA:NA:NA",
"NA-NA-NA NA:NA:NA", "NA-NA-NA NA:NA:NA", "NA-NA-NA NA:NA:NA",
"NA-NA-NA NA:NA:NA"), person_id = c("21",
"21",
"21",
"21",
"21",
"21",
"31"
), enc_id = c("A21BC","A21BC",
"A22BC",
"A23BC",
"A24BC",
"A25BC",
"A31BC"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
数据帧
Age_visit Date_1 Date_2 person_id enc_id
<dbl> <chr> <chr> <chr> <chr>
1 48 8/6/2169 9:40 NA-NA-NA NA:NA:NA 21 A21BC
2 48 8/6/2169 9:40 NA-NA-NA NA:NA:NA 21 A21BC
3 48 8/6/2169 9:41 NA-NA-NA NA:NA:NA 21 A22BC
4 49 8/6/2169 9:42 NA-NA-NA NA:NA:NA 21 A23BC
5 49 24/7/2169 8:31 NA-NA-NA NA:NA:NA 21 A24BC
6 77 12/9/2169 10:30 NA-NA-NA NA:NA:NA 31 A31BC
我有两个要执行的规则/步骤。
I have two rules/steps to be implemented.
规则1(第1步)
首先,根据3列删除重复项,例如 Date_1 , person_id
, enc_id
First, remove duplicates based on 3 columns like Date_1
, person_id
, enc_id
DF[!duplicated(DF[,c('Date_1','person_id','enc_id')]),] # this will remove 1st row as it's a plain straight forward duplicate
规则2(步骤2)
从步骤1的输出中,折叠到重复的记录附近(请注意, DATE_1
和 enc_id
列),如果这些记录之间的时间差小于一小时,则基于一个记录。
From the output of step-1, collapse near duplicate records (notice tiny differences in DATE_1
and enc_id
columns) based on time into one single record if the time difference between these records is less than hour.
例如,如果您看到 person_id = 21
,则可以看到在第一步之后,他所有的 Date_1 时间值是同一天,但相差只有一分钟(9:40-> 9:41-> 9:42)。由于不到一个小时(60分钟),我们将它们全部折叠为一个记录,并仅保留第一条记录(9:40)。 我们对数据框中的每个主题进行此项检查
For ex, if you see person_id = 21
, you can see that after step-1, all his Date_1
time values are on the same day but the difference is only one minute (9:40 --> 9:41 --> 9:42). Since it's less than an hour (60 mins), we collapse all of them into one single record and retain only the first record (which is for 9:40). We do this check for each subject in the dataframe
我已根据几列删除了重复项,如下所示
I have removed the duplicates based on few columns like as shown below
DF[!duplicated(DF[,c('Date_1','person_id','enc_id')]),]
我希望我的输出如下所示
I expect my output to be like as shown below
Age_visit Date_1 Date_2 person_id enc_id
<dbl> <chr> <chr> <chr> <chr>
1 48 8/6/2169 9:40 NA-NA-NA NA:NA:NA 21 A21BC
4 49 24/7/2169 8:31 NA-NA-NA NA:NA:NA 21 A24BC
5 77 12/9/2169 10:30 NA-NA-NA NA:NA:NA 31 A31BC
使用 data.table
的滚动联接选项:
DT[, c("rn", "hrago") := .(.I, Date_1 - 60 * 60)]
DT[DT[DT, on=.(person_id, Date_1=hrago), roll=-Inf, unique(rn)]]
输出:
Age_visit Date_1 person_id enc_id rn hrago
1: 48 2169-06-08 09:40:00 21 A21BC 1 2169-06-08 08:40:00
2: 49 2169-07-24 08:31:00 21 A24BC 5 2169-07-24 07:31:00
3: 77 2169-09-12 10:30:00 31 A31BC 6 2169-09-12 09:30:00
数据:
library(data.table)
DT <- fread("Age_visit Date_1 person_id enc_id
48 8/6/2169-9:40 21 A21BC
48 8/6/2169-9:40 21 A21BC
48 8/6/2169-9:41 21 A22BC
49 8/6/2169-9:42 21 A23BC
49 24/7/2169-8:31 21 A24BC
77 12/9/2169-10:30 31 A31BC")
DT[, Date_1 := as.POSIXct(Date_1, format="%d/%m/%Y-%H:%M")]
说明:
Explanation:
1) DT [DT,on =。(person_id, Date_1 = hrago),
是使用两个表中的 person_id
和 Date_1
的自联接
1) DT[DT, on=.(person_id, Date_1=hrago),
is a self-join using person_id
from both tables and Date_1
from right table and hrago
from left table.
2) roll = -Inf
如果未找到与左表中的观测值相同的匹配项,则向右滚动右表中的观测值
2) roll=-Inf
rolls the observation in the right table backwards if an identical match for the observation in the left table is not found
3) unique(rn)
从右表中获取唯一行,然后为这些行过滤表。
3) unique(rn)
takes the unique rows from the right table and then filter the table for these rows.