如何使用Shell从带引号逗号的CSV提取列?
I have a CSV file, but unlike in related questions, it has some columns containing double-quoted strings with commas, e.g.
foo,bar,baz,quux
11,"first line, second column",13.0,6
210,"second column of second line",23.1,5
(当然更长,并且引号的逗号不一定是1或0,也不是可预测的文本.)文本在双引号中也可能有(转义)双引号,或者没有双引号.对于通常被引用的字段,完全使用双引号.我们唯一可以做的假设是没有引号的换行符,因此我们可以使用\n
轻松地分割行.
(of course it's longer, and the number of quoted commas is not necessarily one or 0, nor is the text predictable.) The text might also have (escaped) double-quotes within double-quotes, or not have double-quotes altogether for a typically-quoted field. The only assumption we can make is that there are no quoted newlines, so we can split lines trivially using \n
.
现在,我想提取一个特定的列(例如,第三列)-例如,要打印在标准输出上,每行一个值.我不能简单地使用逗号作为字段定界符(因此,例如,使用cut
);相反,我需要更复杂的东西.那会是什么?
Now, I'd like to extract a specific column (say, the third one) - say, to be printed on standard output, one value per line. I can't simply use commas as field delimiters (and thus, e.g., use cut
); rather, I need to something more sophisticated. What could that be?
注意:我在Linux系统上使用bash.
Note: I'm using bash on a Linux system.
这是一个快速且肮脏的Python csvcut
. Python csv
库已经了解各种CSV方言等的所有知识,因此您只需一个薄的包装纸即可.
Here is a quick and dirty Python csvcut
. The Python csv
library already knows everything about various CSV dialects etc so you just need a thin wrapper.
第一个参数应表示您希望提取的字段的索引,例如
The first argument should express the index of the field you wish to extract, like
csvcut 3 sample.csv
从CSV文件sample.csv
(可能是带引号的)中提取第三列.
to extract the third column from the (possibly, quoted etc) CSV file sample.csv
.
#!/usr/bin/env python3
import csv
import sys
writer=csv.writer(sys.stdout)
# Python indexing is zero-based
col = 1+int(sys.argv[1])
for input in sys.argv[2:]:
with open(input) as handle:
for row in csv.reader(handle):
writer.writerow(row[col])
要做的事:错误处理,提取多列. (本质上并不难;使用row[2:5]
提取第3、4和5列;但是我懒得编写适当的命令行参数解析器.)
To do: error handling, extraction of multiple columns. (Not hard per se; use row[2:5]
to extract columns 3, 4, and 5; but I'm too lazy to write a proper command-line argument parser.)