如何用空格替换csv引用字段中的换行符?

问题描述：

在带引号的列中有一些带有换行符的大型csv文件.我需要运行一个shell脚本，sed，awk，perl都很好，并且只用空格替换引号内的换行符.行尾必须保留，我不知道列数或哪些字段可以嵌入这些换行符.

Have a large csv file with some line feeds in quoted columns. I need to run a shell script, sed, awk, perl are fine and replace only the line feeds inside quotes with a space. The end of line must be preserved and I don't know the number of columns or which fields might have these embedded line feeds.

对文件的进一步检查显示这是cat -v test_lf.csv的结果

Further examination of the file shows this as a result of cat -v test_lf.csv

"NORTH ?M-^@?OLMSTED"
"PORT?M-^@?ST?M-^@?LUCIE"

CSV文件中的

在excel中显示换行符，其中?M-^ @?是.

in the csv file which in excel shows a line break where the ?M-^@? is.

我想用tr替换空格.我该怎么做?那是什么顺序?

I'd like to use tr to replace whatever that is with a space. How would I do that? What is that sequence?

我现在发现文件的一小部分看起来像下面的字符和十六进制.

I've now found that a small part of the file looks like the following in characters and in hex.

13:20:59 [woo:~/Development/scripts] > tail -n 8 test_lf.csv | head -n 1 | od -x
0000000      3431    3136    3439    3039    2c39    542c    4c45    3543
0000020      5f36    3430    2c47    4c46    332c    3934    3338    312c
0000040      3634    3931    3934    3930    222c    4f50    5452    80e2
0000060      53a8    e254    a880    554c    4943    2245    462c    2c4c
0000100      3433    3839    2c33    3737    2c32    3237    2c37    3535
0000120      2c2c    5441    334c    2c51    6e43    7463    222c    6f4e
0000140      80e2    4da8    6773    2c22    2c34    3832    312c    0d35
0000160      000a                                                        
0000161
13:21:50 [woo:~/Development/scripts] > tail -n 8 test_lf.csv | head -n 1 | od -c
0000000    1   4   6   1   9   4   9   0   9   ,   ,   T   E   L   C   5
0000020    6   _   0   4   G   ,   F   L   ,   3   4   9   8   3   ,   1
0000040    4   6   1   9   4   9   0   9   ,   "   P   O   R   T          **
0000060   **   S   T          **  **   L   U   C   I   E   "   ,   F   L   ,
0000100    3   4   9   8   3   ,   7   7   2   ,   7   2   7   ,   5   5
0000120    ,   ,   A   T   L   3   Q   ,   C   n   c   t   ,   "   N   o
0000140           **  **   M   s   g   "   ,   4   ,   2   8   ,   1   5  \r
0000160   \n

我在Mac上，文件为utf-8，我的语言环境为utf-8.似乎字节已交换(小尾数)，因此十六进制3431 3136在字符表示形式中为1463.因此，从此输出的字节60开始，我们有:S，T，空白，L，十六进制表示:53a8 e254 a880 554c，53是S，54是T，4c是L.因此，T和L之间有一个e2，a8、80序列.这将在Excel电子表格字段中产生换行.
如何查找并用空格替换这些字节?

I'm on a mac and the file is utf-8, and my locale is utf-8. It appears that the bytes are swapped (little-endian), so hexadecimal 3431 3136 is 1463 in the character representation. So, starting at byte 60 in this output, we have: something, S, T, blank, something, something, L and in hex this is: 53a8 e254 a880 554c, 53 is an S, 54 is a T, and 4c is an L . So, there is an e2, a8, 80 sequence between the T and the L. This produces a line-feed in the Excel spreadsheet field.
How do I find and replace these bytes by a space?

答

我将使用Perl模块文本:: CSV

I'd use the Perl module Text::CSV

#!/usr/bin/perl

use strict;
use warnings;
use feature qw/say/;
use open IO => ':encoding(utf8)';
use open ':std';
use Text::CSV;

my $file = shift @ARGV;
open my $fh, "<", $file or die "cannot open $file: $!\n";

my $csv = Text::CSV->new({binary => 1});

while (my $row = $csv->getline($fh)) {
    my @no_newlines = map {s/\n/ /g; $_} @$row;
    $csv->combine(@no_newlines);
    say $csv->string();
}

close $fh;

然后您可以像这样运行它:

And then you can run it like:

/path/to/csvfixer.pl file.csv > fixed.csv

如何用空格替换csv引用字段中的换行符?

相关推荐