标记组的开始和结束

标记组的开始和结束

问题描述:

请考虑以下形式的 data.table 结构:

Consider a data.table structure of the form

     seller    buyer      month  
1: 50536344 61961225 1993-01-01  
2: 50536344 61961225 1993-02-01 
3: 50536344 61961225 1993-04-01 
4: 50536344 61961225 1993-05-01 
5: 50536344 61961225 1993-06-01

code>(买方,卖方)成对。我想标记每对的开始和结束。例如,我们看到有一对从1月到2月,没有在3月,一个从4月到6月。因此,以下是预期输出:

where I have (buyer, seller) pairs over time. I want to mark the start and end for every pair. For example, we see that there was a pair from January to February, none on March, and one from April to June. Hence, the following would be the expected output:

     seller    buyer      month  start    end
1: 50536344 61961225 1993-01-01   True  False
2: 50536344 61961225 1993-02-01  False   True
3: 50536344 61961225 1993-04-01   True  False
4: 50536344 61961225 1993-05-01  False  False
5: 50536344 61961225 1993-06-01  False   True


假设 month Date 类中(或类似地, POSIXt , IDateTime 或其他具有 diff 方法的类),可以使用 diff 函数做这个。

Assuming that the month is in Date class (or similarly for POSIXt, IDateTime or other classes with diff method), you can use the diff function do this.

# sort data.table
setkeyv(dt, c("seller", "buyer", "month"))
# define start
dt[, start := c(TRUE, diff(month) > 31), by = list(seller, buyer)]
# define end
dt[, end := c(diff(month) > 31, TRUE), by = list(seller, buyer)]

编辑:根据@David Arenburg的建议:你可以一次性定义开始和结束。这应该稍快,虽然我也发现它有点更难阅读。

Per suggestion of @David Arenburg: You can of course define the start and end in one go. This should be slightly faster, although I also find it a bit more difficult to read.

dt[, ":=" (start = c(TRUE, diff(month) > 31),
           end = c(diff(month) > 31, TRUE)), 
   by = list(seller, buyer)]

EDIT2:发生的一些更多的解释:每对卖方和买方的第一个观察将始终是业务关系的开始,因此 start = c TRUE,...)。之后,如果且仅当时间差大于一个月(31天)时,进一步的观察将是商业关系的开始,因此 diff(month)> 31 。把两个东西放在一起,你会得到 c(TRUE,diff(month)> 31)
类似的逻辑适用于结束,其中你必须与下一次观察而不是前一次观察进行比较。

Some more explonation of what is happening: The first observation for each pair of seller and buyer will always be the start of a business relationship, so start = c(TRUE, ...). After that a further observation will be the start of a business relationship if and only if the difference in time is larger than a month (31 days), so diff(month) > 31. Putting the two things together, you get c(TRUE, diff(month) > 31). A similar logic applies for the end, where you have to compare to the next observation instead of the previous one.