按行数拆分数据帧

问题描述：

我有一个由 400'000 行和大约 50 列组成的数据框.由于此数据框太大，因此计算量太大而无法使用.我想将此数据帧拆分为较小的数据帧，然后我将运行我想要运行的函数，然后在最后重新组装数据帧.

I have a dataframe made up of 400'000 rows and about 50 columns. As this dataframe is so large, it is too computationally taxing to work with. I would like to split this dataframe up into smaller ones, after which I will run the functions I would like to run, and then reassemble the dataframe at the end.

没有我想用来拆分此数据框的分组变量.我只想按行数拆分它.例如，我想将这个 400'000 行的表拆分为 400 个 1'000 行的数据帧.我该怎么做?

There is no grouping variable that I would like to use to split up this dataframe. I would just like to split it up by number of rows. For example, I would like to split this 400'000-row table into 400 1'000-row dataframes. How might I do this?

答

制作自己的分组变量.

d <- split(my_data_frame,rep(1:400,each=1000))

您还应该考虑 plyr 包中的 ddply 函数，或 dplyr 包中的 group_by() 函数>.

You should also consider the ddply function from the plyr package, or the group_by() function from dplyr.

为简洁起见进行了编辑，在 Hadley 发表评论后.

edited for brevity, after Hadley's comments.

如果您不知道数据框中有多少行，或者数据框的长度可能与您想要的块大小不等，您可以这样做

If you don't know how many rows are in the data frame, or if the data frame might be an unequal length of your desired chunk size, you can do

chunk <- 1000
n <- nrow(my_data_frame)
r  <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(my_data_frame,r)

你也可以使用

r <- ggplot2::cut_width(1:n,chunk,boundary=0)

对于未来的读者，基于 dplyr 和 data.table 包的方法在对数据帧进行分组操作时可能会(快得多)，例如类似的东西

For future readers, methods based on the dplyr and data.table packages will probably be (much) faster for doing group-wise operations on data frames, e.g. something like

(my_data_frame 
   %>% mutate(index=rep(1:ngrps,each=full_number)[seq(.data)])
   %>% group_by(index)
   %>% [mutate, summarise, do()] ...
)

还有许多答案这里一个>

按行数拆分数据帧

相关推荐