如何用空格替换csv引用字段中的换行符?
在带引号的列中有一些带有换行符的大型csv文件.我需要运行一个shell脚本,sed,awk,perl都很好,并且只用空格替换引号内的换行符.行尾必须保留,我不知道列数或哪些字段可以嵌入这些换行符.
Have a large csv file with some line feeds in quoted columns. I need to run a shell script, sed, awk, perl are fine and replace only the line feeds inside quotes with a space. The end of line must be preserved and I don't know the number of columns or which fields might have these embedded line feeds.
对文件的进一步检查显示这是cat -v test_lf.csv的结果
Further examination of the file shows this as a result of cat -v test_lf.csv
"NORTH ?M-^@?OLMSTED"
"PORT?M-^@?ST?M-^@?LUCIE"
CSV文件中的
在excel中显示换行符,其中?M-^ @?是.
in the csv file which in excel shows a line break where the ?M-^@? is.
我想用tr替换空格.我该怎么做?那是什么顺序?
I'd like to use tr to replace whatever that is with a space. How would I do that? What is that sequence?
我现在发现文件的一小部分看起来像下面的字符和十六进制.
I've now found that a small part of the file looks like the following in characters and in hex.
13:20:59 [woo:~/Development/scripts] > tail -n 8 test_lf.csv | head -n 1 | od -x
0000000 3431 3136 3439 3039 2c39 542c 4c45 3543
0000020 5f36 3430 2c47 4c46 332c 3934 3338 312c
0000040 3634 3931 3934 3930 222c 4f50 5452 80e2
0000060 53a8 e254 a880 554c 4943 2245 462c 2c4c
0000100 3433 3839 2c33 3737 2c32 3237 2c37 3535
0000120 2c2c 5441 334c 2c51 6e43 7463 222c 6f4e
0000140 80e2 4da8 6773 2c22 2c34 3832 312c 0d35
0000160 000a
0000161
13:21:50 [woo:~/Development/scripts] > tail -n 8 test_lf.csv | head -n 1 | od -c
0000000 1 4 6 1 9 4 9 0 9 , , T E L C 5
0000020 6 _ 0 4 G , F L , 3 4 9 8 3 , 1
0000040 4 6 1 9 4 9 0 9 , " P O R T
**
0000060 ** S T
** ** L U C I E " , F L ,
0000100 3 4 9 8 3 , 7 7 2 , 7 2 7 , 5 5
0000120 , , A T L 3 Q , C n c t , " N o
0000140
** ** M s g " , 4 , 2 8 , 1 5 \r
0000160 \n
我在Mac上,文件为utf-8,我的语言环境为utf-8.似乎字节已交换(小尾数),因此十六进制3431 3136在字符表示形式中为1463.因此,从此输出的字节60开始,我们有:S,T,空白,L,十六进制表示:53a8 e254 a880 554c,53是S,54是T,4c是L.因此,T和L之间有一个e2,a8、80序列.这将在Excel电子表格字段中产生换行.
如何查找并用空格替换这些字节?
I'm on a mac and the file is utf-8, and my locale is utf-8. It appears that the bytes are swapped (little-endian), so hexadecimal 3431 3136 is 1463 in the character representation. So, starting at byte 60 in this output, we have:
something, S, T, blank, something, something, L and in hex this is:
53a8 e254 a880 554c, 53 is an S, 54 is a T, and 4c is an L . So, there is an e2, a8, 80 sequence between the T and the L. This produces a line-feed in the Excel spreadsheet field.
How do I find and replace these bytes by a space?
我将使用Perl模块文本:: CSV
I'd use the Perl module Text::CSV
#!/usr/bin/perl
use strict;
use warnings;
use feature qw/say/;
use open IO => ':encoding(utf8)';
use open ':std';
use Text::CSV;
my $file = shift @ARGV;
open my $fh, "<", $file or die "cannot open $file: $!\n";
my $csv = Text::CSV->new({binary => 1});
while (my $row = $csv->getline($fh)) {
my @no_newlines = map {s/\n/ /g; $_} @$row;
$csv->combine(@no_newlines);
say $csv->string();
}
close $fh;
然后您可以像这样运行它:
And then you can run it like:
/path/to/csvfixer.pl file.csv > fixed.csv