如何避免在KNN模型中浪费时间?

如何避免在KNN模型中浪费时间?

问题描述:

我正在建立一个KNN模型来预测房价.我将仔细研究我的数据和模型,然后是我的问题.

I am building a KNN model to predict housing prices. I'll go through my data and my model and then my problem.

数据-

# A tibble: 81,334 x 4
   latitude longitude close_date          close_price
      <dbl>     <dbl> <dttm>                    <dbl>
 1     36.4     -98.7 2014-08-05 06:34:00     147504.
 2     36.6     -97.9 2014-08-12 23:48:00     137401.
 3     36.6     -97.9 2014-08-09 04:00:40     239105.

模型-

library(caret)
training.samples <- data$close_price %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- data[training.samples, ]
test.data <- data[-training.samples, ]

model <- train(
  close_price~ ., data = train.data, method = "knn",
  trControl = trainControl("cv", number = 10),
  preProcess = c("center", "scale"),
  tuneLength = 10
)

我的问题是时间浪费.我正在使用后来关闭的其他房屋对房屋进行预测,在现实世界中,我不应该获得该信息.

My problem is time leakage. I am making predictions on a house using other houses that closed afterwards and in the real world I shouldn't have access to that information.

我想对模型应用规则,即对于每个值y,仅使用在该y的房屋之前关闭的房屋.我知道我可以在特定日期拆分测试数据和火车数据,但这并不能完全做到这一点.

I want to apply a rule to the model that says, for each value y, only use houses that closed before the house for that y. I know I could split my test data and my train data on a certain date, but that doesn't quite do it.

是否可以在caret或其他knn库(例如classkknn)中防止这种时间泄漏?

Is it possible to prevent this time leakage, either in caret or other libraries for knn (like class and kknn)?

caret中,createTimeSlices实现了适用于时间序列的交叉验证的一种变体(通过滚动预测原点来避免时间泄漏). 文档位于此处.

In caret, createTimeSlices implements a variation of cross-validation adapted to time series (avoiding time leakage by rolling the forecasting origin). Documentation is here.

在您的情况下,根据您的确切需求,您可以使用类似的方法进行正确的交叉验证:

In your case, depending on your precise needs, you could use something like this for a proper cross-validation:

your_data <- your_data %>% arrange(close_date)

tr_ctrl <- createTimeSlices(
  your_data$close_price, 
  initialWindow  = 10, 
  horizon = 1,
  fixedWindow = FALSE)

model <- train(
  close_price~ ., data = your_data, method = "knn",
  trControl = tr_ctrl,
  preProcess = c("center", "scale"),
  tuneLength = 10
)

如果您在日期中有联系,并且希望在测试和训练集中的同一天完成交易,则可以在train中使用tr_ctrl之前对其进行修复:

if you have ties in the dates and want to having deals closed on the same day in the test and train sets, you can fix tr_ctrl before using it in train:

filter_train <- function(i_tr, i_te) {
  d_tr <- as_date(your_data$close_date[i_tr]) #using package lubridate
  d_te <- as_date(your_data$close_date[i_te])
  tr_is_ok <- d_tr < min(d_te)

  i_tr[tr_is_ok]
}

tr_ctrl$train <- mapply(filter_train, tr_ctrl$train, tr_ctrl$test)