使用dplyr进行线性插值,但跳过所有缺失值的组

问题描述:

我正在尝试使用dplyr和rox()对组中的值进行线性插值,不幸的是,某些组中的所有值均缺失,因此我希望近似值可以跳过这些组并继续进行其余的操作。我不想外推或使用最近的邻近观测数据。

I'm trying to linearly interpolate values within a group using dplyr and approx() Unfortunately, some of the groups have all missing values, so I'd like the approximation to just skip those groups and proceed for the remainder. I don't want to extrapolate or using the nearest neighbouring observation's data.

这里是数据示例。第一组(按ID)全部丢失,另一组应插值。

Here's an example of the data. The first group (by id) has all missing, the other should be interpolated.

data <- read.csv(text="
id,year,value
c1,1998,NA
c1,1999,NA
c1,2000,NA
c1,2001,NA
c2,1998,14
c2,1999,NA
c2,2000,NA
c2,2001,18")

dataIpol <- data %>%
group_by(id) %>% 
arrange(id, year) %>%            
mutate(valueIpol = approx(year, value, year, 
                 method = "linear", rule = 1, f = 0, ties = mean)$y)

但是我得到了错误


错误:需要至少两个非NA值进行插值

Error: need at least two non-NA values to interpolate

我不如果我摆脱了所有缺少的组,那将无法得到此错误。

I don't get this error if I get rid of the groups that have all missing but that's not feasible.

我们可以通过以下方法解决此问题:添加具有所需数据点数量的 filter 步骤:

We can fix this by adding a filter step with the required number of data points:

library(dplyr)
dataIpol <- data %>%
  group_by(id) %>% 
  arrange(id, year) %>%
  filter(sum(!is.na(value))>=2) %>% #filter!
  mutate(valueIpol = approx(year, value, year, 
                            method = "linear", rule = 1, f = 0, ties = mean)$y)

在这里,我们将value列中的非NA项目的数量相加,并删除所有不具有&gt ; = 2

Here we sum the number of non-NA items in the value column, and remove any groups that do not have >=2.