根据其他列中的值有条件地更改数据框列

问题描述:

在模拟数据集中

n =  50
set.seed(378)
df <- data.frame(
  age = sample(c(20:90), n, rep = T), 
  sex = sample(c("m", "f"), n, rep = T, prob = c(0.55, 0.45)),
  smoker = sample(c("never", "former", "active"), n, rep = T, prob = c(0.4, 0.45, 0.15)), 
  py = abs(rnorm(n, 25, 10)),
  yrsquit = abs (rnorm (n, 10,2)),
  outcome = as.factor(sample(c(0, 1), n, rep = T, prob = c(0.8, 0.2)))
  )

我需要在结局组之间引入一些失衡(1 =疾病,0 =无疾病)。例如,患有该疾病的受试者年龄较大,并且更有可能是男性。我试过

I need to introduce some imbalance between the outcome groups (1=disease, 0=no disease). For example, subjects with the disease are older and more likely to be male. I tried

df1 <- within(df, sapply(length(outcome), function(x) {
if (outcome[x] == 1)  {
  age[x] <- age[x] + 15
  sex[x] <- sample(c("m","f"), prob=c(0.8,0.2))
}
}))

,但没有显示

tapply(df$sex, df$outcome, length)
tapply(df1$sex, df$outcome, length)
tapply(df$age, df$outcome, mean)
tapply(df1$age, df$outcome, mean)


使用 sapply 内部内的内容无法正常工作。 内部中的函数仅使用 sapply 的返回值。但是在您的代码中, sapply 返回 NULL 。因此,之内不会修改数据框。

The use of sapply inside within does not work as you expect. The function within does only use the returned value of sapply. But in your code, sapply returns NULL. Hence, within does not modify the data frame.

这里是一种更简单的修改数据框而无需循环或 apply

Here is an easier way to modify the data frame without a loop or sapply:

idx <- df$outcome == "1"
df1 <- within(df, {age[idx] <- age[idx] + 15; 
                   sex[idx] <- sample(c("m", "f"), sum(idx), 
                                      replace = TRUE, prob = c(0.8, 0.2))})

现在,数据帧不同:

> tapply(df$age, df$outcome, mean)
       0        1 
60.46341 57.55556 
> tapply(df1$age, df$outcome, mean)
       0        1 
60.46341 72.55556 

> tapply(df$sex, df$outcome, summary)
$`0`
 f  m 
24 17 

$`1`
f m 
2 7 

> tapply(df1$sex, df$outcome, summary)
$`0`
 f  m 
24 17 

$`1`
f m 
1 8