为什么我的工具输出会覆盖自身,我该如何修复?

问题描述:

这个问题的目的是为日常问题提供一个答案,这些问题的答案是你有 DOS 行结尾",所以我们可以简单地将它们关闭为这个问题的重复,而不会重复相同的答案令人作呕em>.

The intent of this question is to provide an answer to the daily questions whose answer is "you have DOS line endings" so we can simply close them as duplicates of this one without repeating the same answers ad nauseam.

注意:这不是任何现有问题的重复.此问答的目的不仅是提供运行此工具"的答案,而且是为了解释问题,以便我们可以将任何有相关问题的人指向此处,他们将清楚地解释为什么指向此处以及运行的工具,以便解决他们的问题.我花了几个小时阅读所有现有的问答,但它们都缺乏对问题的解释、可用于解决问题的替代工具和/或可能解决方案的优缺点/注意事项.此外,他们中的一些人已经接受了那些完全危险且永远不应使用的答案.

NOTE: This is NOT a duplicate of any existing question. The intent of this Q&A is not just to provide a "run this tool" answer but also to explain the issue such that we can just point anyone with a related question here and they will find a clear explanation of why they were pointed here as well as the tool to run so solve their problem. I spent hours reading all of the existing Q&A and they are all lacking in the explanation of the issue, alternative tools that can be used to solve it, and/or the pros/cons/caveats of the possible solutions. Also some of them have accepted answers that are just plain dangerous and should never be used.

现在回到导致推荐的典型问题:

我有一个包含 1 行的文件:

I have a file containing 1 line:

what isgoingon

当我使用这个 awk 脚本打印它以反转字段的顺序时:

and when I print it using this awk script to reverse the order of the fields:

awk '{print $2, $1}' file

而不是看到我期望的输出:

instead of seeing the output I expect:

isgoingon what

我得到应该在行尾的字段出现在行首,覆盖了行首的一些文本:

I get the field that should be at the end of the line appear at the start of the line, overwriting some text at the start of the line:

 whatngon

或者我将输出分成两行:

or I get the output split onto 2 lines:

isgoingon
 what

可能是什么问题,我该如何解决?

What could the problem be and how do I fix it?

问题是你的输入文件使用 CRLF 的 DOS 行结尾,而不是 LF的 UNIX 行结尾code> 并且您正在其上运行 UNIX 工具,因此 CR 仍然是 UNIX 工具正在操作的数据的一部分.CR 通常用 表示,当您运行 cat -vE 时可以将其视为 control-M (^M) 在文件上,而 LF 并显示为 $cat -vE.

The problem is that your input file uses DOS line endings of CRLF instead of UNIX line endings of just LF and you are running a UNIX tool on it so the CR remains part of the data being operated on by the UNIX tool. CR is commonly denoted by and can be seen as a control-M (^M) when you run cat -vE on the file while LF is and appears as $ with cat -vE.

所以你的输入文件不仅仅是:

So your input file wasn't really just:

what isgoingon

实际上是:

what isgoingon

正如您使用 cat -v 看到的:

as you can see with cat -v:

$ cat -vE file
what isgoingon^M$

od -c:

$ od -c file
0000000   w   h   a   t       i   s   g   o   i   n   g   o   n  
  

0000020

因此,当您在文件上运行像 awk 这样的 UNIX 工具(将 视为行尾)时, 会被读取行为消耗掉行,但将 2 个字段保留为:

so when you run a UNIX tool like awk (which treats as the line ending) on the file, the is consumed by the act of reading the line, but that leaves the 2 fields as:

<what> <isgoingon
>

注意第二个字段末尾的 . 表示 Carriage Return 字面意思是将光标返回到行首的指令,所以当你这样做时:

Note the at the end of the second field. means Carriage Return which is literally an instruction to return the cursor to the start of the line so when you do:

print $2, $1

awk 将打印 isgoingon,然后在打印 what 之前将光标返回到行首,这就是 what 出现的原因覆盖 isgoingon 的开头.

awk will print isgoingon and then will return the cursor to the start of the line before printing what which is why the what appears to overwrite the start of isgoingon.

要解决问题,请执行以下任一操作:

To fix the problem, do either of these:

dos2unix file
sed 's/
$//' file
awk '{sub(/
$/,"")}1' file
perl -pe 's/
$//' file

显然 dos2unix 在某些 UNIX 变体(例如 Ubuntu)中也就是 frodos.

Apparently dos2unix is aka frodos in some UNIX variants (e.g. Ubuntu).

如果你决定使用 tr -d ' ' 时要小心,因为这会删除 all s在您的文件中,而不仅仅是每行末尾的那些.

Be careful if you decide to use tr -d ' ' as is often suggested as that will delete all s in your file, not just those at the end of each line.

请注意,GNU awk 可以让您通过适当地设置 RS 来解析具有 DOS 行结尾的文件:

Note that GNU awk will let you parse files that have DOS line endings by simply setting RS appropriately:

gawk -v RS='
' '...' file

但其他 awks 不允许这样做,因为 POSIX 只要求 awks 支持单个字符 RS,而大多数其他 awks 会悄悄地将 RS=' ' 截断为 RS=' '.您可能需要为 gawk 添加 -v BINMODE=3 以什至看到 s 虽然底层 C 原语会在某些平台上剥离它们,例如cygwin.

but other awks will not allow that as POSIX only requires awks to support a single character RS and most other awks will quietly truncate RS=' ' to RS=' '. You may need to add -v BINMODE=3 for gawk to even see the s though as the underlying C primitives will strip them on some platforms, e.g. cygwin.

需要注意的一点是,由 Excel 等 Windows 工具创建的 CSV 将使用 CRLF 作为行尾,但可以将 LF 嵌入到特定字段中CSV,例如:

One thing to watch out for is that CSVs created by Windows tools like Excel will use CRLF as the line endings but can have LFs embedded inside a specific field of the CSV, e.g.:

"field1","field2.1
field2.2","field3"

确实是:

"field1","field2.1
field2.2","field3"

所以如果你只是将 s 转换为 s 那么你就不能再将字段内的换行从换行中区分为行尾,所以如果你想这样做我建议首先将所有场内换行符转换为其他内容,例如这会将所有域内 LFs 转换为制表符并将所有行结束 CRLFs 转换为 LFs:

so if you just convert s to s then you can no longer tell linefeeds within fields from linefeeds as line endings so if you want to do that I recommend converting all of the intra-field linefeeds to something else first, e.g. this would convert all intra-field LFs to tabs and convert all line ending CRLFs to LFs:

gawk -v RS='
' '{gsub(/
/,"	")}1' file

在没有 GNU awk 的情况下做类似的事情作为练习,但对于其他 awk,它涉及组合读取时不以 CR 结尾的行.

Doing similar without GNU awk left as an exercise but with other awks it involves combining lines that do not end in CR as they're read.

还要注意,虽然 CR 是 [[:space:]] POSIX 字符类的一部分,但它不是 的默认 FS 时作为分隔字段包含的空白字符之一"使用了 ",其空白字符只有制表符、空白和换行符.如果您的输入在 CRLF 之前有空格,这可能会导致结果混乱:

Also note that though CR is part of the [[:space:]] POSIX character class, it is not one of the whitespace characters included as separating fields when the default FS of " " is used, whose whitespace characters are only tab, blank, and newline. This can lead to confusing results if your input can have blanks before CRLF:

$ printf 'x y 
'
x y
$ printf 'x y 
' | awk '{print $NF}'
y
$

$ printf 'x y 
'
x y
$ printf 'x y 
' | awk '{print $NF}'

$

那是因为在具有 LF 行结尾的行的开头/结尾忽略了尾随字段分隔符空格,但是 是一行的最后一个字段如果前面的字符是空格,则以 CRLF 行结尾:

That's because trailing field separator white space is ignored at the beginning/end of a line that has LF line endings, but is the final field on a line with CRLF line endings if the character before it was whitespace:

$ printf 'x y 
' | awk '{print $NF}' | cat -Ev
^M$