根据时间差删除重复项并在重复项附近折叠

问题描述：

我有一个如下所示的数据帧

I have a data-frame like as shown below

DF = structure(list(Age_visit = c(48, 48, 48, 49, 49, 77), Date_1 = c("8/6/2169 9:40", "8/6/2169 9:40", 
                                                                     "8/6/2169 9:41", "8/6/2169 9:42", "24/7/2169 8:31", "12/9/2169 10:30", 
                                                                     "19/6/2237 12:15"), Date_2 = c("NA-NA-NA NA:NA:NA", "NA-NA-NA NA:NA:NA", "NA-NA-NA NA:NA:NA", 
                                                                                                            "NA-NA-NA NA:NA:NA", "NA-NA-NA NA:NA:NA", "NA-NA-NA NA:NA:NA", 
                                                                                                            "NA-NA-NA NA:NA:NA"), person_id = c("21",
                                                                                                                                                  "21", 
                                                                                                                                                  "21", 
                                                                                                                                                  "21", 
                                                                                                                                                  "21", 
                                                                                                                                                  "21", 
                                                                                                                                                  "31"
                                                                                                            ), enc_id = c("A21BC","A21BC", 
                                                                                                                                       "A22BC", 
                                                                                                                                       "A23BC", 
                                                                                                                                       "A24BC", 
                                                                                                                                       "A25BC", 
                                                                                                                                       "A31BC"
                                                                                                            )), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
                                                                                                            ))

数据帧

  Age_visit Date_1          Date_2            person_id enc_id
      <dbl> <chr>           <chr>             <chr>     <chr> 
1        48 8/6/2169 9:40   NA-NA-NA NA:NA:NA  21        A21BC 
2        48 8/6/2169 9:40   NA-NA-NA NA:NA:NA  21        A21BC 
3        48 8/6/2169 9:41   NA-NA-NA NA:NA:NA  21        A22BC 
4        49 8/6/2169 9:42   NA-NA-NA NA:NA:NA  21        A23BC 
5        49 24/7/2169 8:31  NA-NA-NA NA:NA:NA  21        A24BC 
6        77 12/9/2169 10:30 NA-NA-NA NA:NA:NA  31        A31BC

我有两个要执行的规则/步骤。

I have two rules/steps to be implemented.

规则1（第1步）

首先，根据3列删除重复项，例如 Date_1 ， person_id ， enc_id

First, remove duplicates based on 3 columns like Date_1, person_id, enc_id

DF[!duplicated(DF[,c('Date_1','person_id','enc_id')]),]  # this will remove 1st row as it's a plain straight forward duplicate

规则2（步骤2）

从步骤1的输出中，折叠到重复的记录附近（请注意， DATE_1 和 enc_id 列），如果这些记录之间的时间差小于一小时，则基于一个记录。

From the output of step-1, collapse near duplicate records (notice tiny differences in DATE_1 and enc_id columns) based on time into one single record if the time difference between these records is less than hour.

例如，如果您看到 person_id = 21 ，则可以看到在第一步之后，他所有的 Date_1 时间值是同一天，但相差只有一分钟（9:40-> 9:41-> 9:42）。由于不到一个小时（60分钟），我们将它们全部折叠为一个记录，并仅保留第一条记录（9:40）。 我们对数据框中的每个主题进行此项检查

For ex, if you see person_id = 21, you can see that after step-1, all his Date_1 time values are on the same day but the difference is only one minute (9:40 --> 9:41 --> 9:42). Since it's less than an hour (60 mins), we collapse all of them into one single record and retain only the first record (which is for 9:40). We do this check for each subject in the dataframe

我已根据几列删除了重复项，如下所示

I have removed the duplicates based on few columns like as shown below

DF[!duplicated(DF[,c('Date_1','person_id','enc_id')]),]

我希望我的输出如下所示

I expect my output to be like as shown below

  Age_visit Date_1          Date_2            person_id enc_id
      <dbl> <chr>           <chr>             <chr>     <chr> 
1        48 8/6/2169 9:40   NA-NA-NA NA:NA:NA  21        A21BC 
4        49 24/7/2169 8:31  NA-NA-NA NA:NA:NA  21        A24BC 
5        77 12/9/2169 10:30 NA-NA-NA NA:NA:NA  31        A31BC

答

使用 data.table 的滚动联接选项：

DT[, c("rn", "hrago") := .(.I, Date_1 - 60 * 60)]
DT[DT[DT, on=.(person_id, Date_1=hrago), roll=-Inf, unique(rn)]]

输出：

   Age_visit              Date_1 person_id enc_id rn               hrago
1:        48 2169-06-08 09:40:00        21  A21BC  1 2169-06-08 08:40:00
2:        49 2169-07-24 08:31:00        21  A24BC  5 2169-07-24 07:31:00
3:        77 2169-09-12 10:30:00        31  A31BC  6 2169-09-12 09:30:00

数据：

library(data.table)
DT <- fread("Age_visit Date_1    person_id enc_id
48 8/6/2169-9:40    21        A21BC 
48 8/6/2169-9:40    21        A21BC 
48 8/6/2169-9:41    21        A22BC 
49 8/6/2169-9:42    21        A23BC 
49 24/7/2169-8:31   21        A24BC 
77 12/9/2169-10:30  31        A31BC") 
DT[, Date_1 := as.POSIXct(Date_1, format="%d/%m/%Y-%H:%M")]

说明：

Explanation:

1） DT [DT，on =。（person_id， Date_1 = hrago），是使用两个表中的 person_id 和 Date_1 的自联接

1) DT[DT, on=.(person_id, Date_1=hrago), is a self-join using person_id from both tables and Date_1 from right table and hrago from left table.

2） roll = -Inf 如果未找到与左表中的观测值相同的匹配项，则向右滚动右表中的观测值

2) roll=-Inf rolls the observation in the right table backwards if an identical match for the observation in the left table is not found

3） unique（rn）从右表中获取唯一行，然后为这些行过滤表。

3) unique(rn) takes the unique rows from the right table and then filter the table for these rows.

根据时间差删除重复项并在重复项附近折叠

相关推荐