根据其他列中的值有条件地更改数据框列
问题描述:
在模拟数据集中
n = 50
set.seed(378)
df <- data.frame(
age = sample(c(20:90), n, rep = T),
sex = sample(c("m", "f"), n, rep = T, prob = c(0.55, 0.45)),
smoker = sample(c("never", "former", "active"), n, rep = T, prob = c(0.4, 0.45, 0.15)),
py = abs(rnorm(n, 25, 10)),
yrsquit = abs (rnorm (n, 10,2)),
outcome = as.factor(sample(c(0, 1), n, rep = T, prob = c(0.8, 0.2)))
)
我需要在结局组之间引入一些失衡(1 =疾病,0 =无疾病)。例如,患有该疾病的受试者年龄较大,并且更有可能是男性。我试过
I need to introduce some imbalance between the outcome groups (1=disease, 0=no disease). For example, subjects with the disease are older and more likely to be male. I tried
df1 <- within(df, sapply(length(outcome), function(x) {
if (outcome[x] == 1) {
age[x] <- age[x] + 15
sex[x] <- sample(c("m","f"), prob=c(0.8,0.2))
}
}))
,但没有显示
tapply(df$sex, df$outcome, length)
tapply(df1$sex, df$outcome, length)
tapply(df$age, df$outcome, mean)
tapply(df1$age, df$outcome, mean)
答
使用 sapply
内部
内的内容无法正常工作。 内部
中的函数仅使用 sapply
的返回值。但是在您的代码中, sapply
返回 NULL
。因此,之内
不会修改数据框。
The use of sapply
inside within
does not work as you expect. The function within
does only use the returned value of sapply
. But in your code, sapply
returns NULL
. Hence, within
does not modify the data frame.
这里是一种更简单的修改数据框而无需循环或 apply
:
Here is an easier way to modify the data frame without a loop or sapply
:
idx <- df$outcome == "1"
df1 <- within(df, {age[idx] <- age[idx] + 15;
sex[idx] <- sample(c("m", "f"), sum(idx),
replace = TRUE, prob = c(0.8, 0.2))})
现在,数据帧不同:
> tapply(df$age, df$outcome, mean)
0 1
60.46341 57.55556
> tapply(df1$age, df$outcome, mean)
0 1
60.46341 72.55556
> tapply(df$sex, df$outcome, summary)
$`0`
f m
24 17
$`1`
f m
2 7
> tapply(df1$sex, df$outcome, summary)
$`0`
f m
24 17
$`1`
f m
1 8