如何在 R 中读取具有不同列数的 CSV 文件

问题描述:

我有一个 csv 格式的稀疏数据集,其列数的长度各不相同.这是文件文本的示例.

I have a sparse data set, one whose number of columns vary in length, in a csv format. Here is a sample of the file text.

12223, University
12227, bridge, Sky
12828, Sunset
13801, Ground
14853, Tranceamerica
14854, San Francisco
15595, shibuya, Shrine
16126, fog, San Francisco
16520, California, ocean, summer, golden gate, beach, San Francisco

当我使用

read.csv("data.txt", header = F)

R 会将数据集解释为具有 3 列,因为大小是从前 5 行确定的.无论如何要强制 r 将数据放在更多列中?

R will interpret the data set as having 3 columns because the size is determined from the first 5 rows. Is there anyway to force r to put the data in more columns?

?read.table 文档的深处有以下内容:

Deep in the ?read.table documentation there is the following:

数据列数是通过查看前五位来确定的输入行(或整个文件,如果少于五行),或从 col.names 的长度如果它被指定并且更长.这个如果 fillblank.lines.skip 为真,则可能是错误的,所以如有必要,请指定 col.names(如示例"中所示).

The number of data columns is determined by looking at the first five lines of input (or the whole file if it has less than five lines), or from the length of col.names if it is specified and is longer. This could conceivably be wrong if fill or blank.lines.skip are true, so specify col.names if necessary (as in the ‘Examples’).

因此,让我们将 col.names 定义为长度 X(其中 X 是数据集中字段的最大数量),并设置 fill = TRUE:>

Therefore, let's define col.names to be length X (where X is the max number of fields in your dataset), and set fill = TRUE:

dat <- textConnection("12223, University
12227, bridge, Sky
12828, Sunset
13801, Ground
14853, Tranceamerica
14854, San Francisco
15595, shibuya, Shrine
16126, fog, San Francisco
16520, California, ocean, summer, golden gate, beach, San Francisco")

read.table(dat, header = FALSE, sep = ",", 
  col.names = paste0("V",seq_len(7)), fill = TRUE)

     V1             V2             V3      V4           V5     V6             V7
1 12223     University                                                          
2 12227         bridge            Sky                                           
3 12828         Sunset                                                          
4 13801         Ground                                                          
5 14853  Tranceamerica                                                          
6 14854  San Francisco                                                          
7 15595        shibuya         Shrine                                           
8 16126            fog  San Francisco                                           
9 16520     California          ocean  summer  golden gate  beach  San Francisco

如果最大字段数未知,您可以使用漂亮的实用函数 count.fields(我在 read.table 示例代码中找到):

If the maximum number of fields is unknown, you can use the nifty utility function count.fields (which I found in the read.table example code):

count.fields(dat, sep = ',')
# [1] 2 3 2 2 2 2 3 3 7
max(count.fields(dat, sep = ','))
# [1] 7

可能有帮助的相关阅读:R 中仅读取有限数量的列

Possibly helpful related reading: Only read limited number of columns in R