修复由 UTF-8 和 Windows-1252 组成的文件

问题描述：

我有一个生成 UTF-8 文件的应用程序，但其中一些内容的编码不正确.一些字符被编码为 iso-8859-1 aka iso-latin-1 或 cp1252 aka Windows-1252.有没有办法恢复原文?

I have an application that produces a UTF-8 file, but some of the contents are incorrectly encoded. Some of the characters are encoded as iso-8859-1 aka iso-latin-1 or cp1252 aka Windows-1252. Is there a way of recovering the original text?

答

是的！

显然，最好修复创建文件的程序，但这并不总是可行的.以下是两种解决方案.

Obviously, it's better to fix the program creating the file, but that's not always possible. What follows are two solutions.

Encoding::FixLatin 提供了一个名为 fix_latin的函数> 解码由 UTF-8、iso-8859-1、cp1252 和 US-ASCII 混合组成的文本.

Encoding::FixLatin provides a function named fix_latin which decodes text that consists of a mix of UTF-8, iso-8859-1, cp1252 and US-ASCII.

$ perl -e'
   use Encoding::FixLatin qw( fix_latin );
   $bytes = "xD0 x92 xD0x92
";
   $text = fix_latin($bytes);
   printf("U+%v04X
", $text);
'
U+00D0.0020.2019.0020.0412.000A

采用了启发式方法，但它们相当可靠.只有以下情况会失败:

Heuristics are employed, but they are fairly reliable. Only the following cases will fail:

其中之一
[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß]
使用iso-8859编码-1 或 cp1252，后跟
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜š›œžŸ¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
使用 iso-8859-1 或 cp1252 编码.

One of
[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß]
encoded using iso-8859-1 or cp1252, followed by one of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
encoded using iso-8859-1 or cp1252.

其中之一
[àáâãäåæçèéêëìíîï]
使用iso-8859-1或cp1252编码，后跟两个
[€‚ƒ„…†‡ˆ‰Š‹ŗ‘’“”•–—˜™š›œx017E;Ÿ<NBSP>¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿]
使用iso-8859-1或cp1252编码.

One of
[àáâãäåæçèéêëìíîï]
encoded using iso-8859-1 or cp1252, followed by two of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
encoded using iso-8859-1 or cp1252.

其中一个
[ðñòóôõö÷]
编码使用iso-8859-1 或 cp1252，后跟两个
[[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—™š›œžŸ¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
使用 iso-8859-1 或 cp1252 编码.

One of
[ðñòóôõö÷]
encoded using iso-8859-1 or cp1252, followed by two of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
encoded using iso-8859-1 or cp1252.

使用核心模块 Encode 可以产生相同的结果，尽管我认为这是一个公平的安装 Encoding::FixLatin::XS 后，比 Encoding::FixLatin 慢一点.

The same result can be produced using core module Encode, though I imagine this is a fair bit slower than Encoding::FixLatin with Encoding::FixLatin::XS installed.

$ perl -e'
   use Encode qw( decode_utf8 encode_utf8 decode );
   $bytes = "xD0 x92 xD0x92
";
   $text = decode_utf8($bytes, sub { encode_utf8(decode("cp1252", chr($_[0]))) });
   printf("U+%v04X
", $text);
'
U+00D0.0020.2019.0020.0412.000A

每一行只使用一种编码

fix_latin 在字符级别起作用.如果知道每一行完全使用 UTF-8、iso-8859-1、cp1252 或 US-ASCII 之一进行编码，您可以通过检查该行是否为有效的 UTF-8 来使该过程更加可靠.

Each line only uses one encoding

fix_latin works on a character level. If it's known that each line is entirely encoded using one of UTF-8, iso-8859-1, cp1252 or US-ASCII, you could make the process even more reliable by check if the line is valid UTF-8.

$ perl -e'
   use Encode qw( decode );
   for $bytes ("xD0 x92 xD0x92
", "xD0x92
") {
      if (!eval {
         $text = decode("UTF-8", $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC);
         1  # No exception
      }) {
         $text = decode("cp1252", $bytes);
      }

      printf("U+%v04X
", $text);
   }
'
U+00D0.0020.2019.0020.00D0.2019.000A
U+0412.000A

采用了启发式方法，但它们非常可靠.仅当给定行的以下所有都为真时，它们才会失败:

Heuristics are employed, but they are very reliable. They will only fail if all of the following are true for a given line:

该行使用 iso-8859-1 或 cp1252 编码，

The line is encoded using iso-8859-1 or cp1252,

至少一个
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜x2122;š›œžŸ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷]
存在于行中，

At least one of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷]
is present in the line,

[ÀÁÂÃÄÅÆÇÈ&的所有实例#xC9;ÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß]
总是跟在后面其中之一
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™šx203A;œžŸ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿],

All instances of
[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß]
are always followed by exactly one of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],

All instances of
[àáâãäåæçèéêëìíîï]
are always followed by exactly two of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],

All instances of
[ðñòóôõö÷]
are always followed by exactly three of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],

[øùúûüýþÿ]
都不存在在行中，和

None of
[øùúûüýþÿ]
are present in the line, and

None of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
are present in the line except where previously mentioned.

注意事项:

Encoding::FixLatin 安装命令行工具 fix_latin 来转换文件，使用第二种方法编写一个就很简单了.
fix_latin(函数和文件)可以通过安装编码::FixLatin::XS.
同样的方法可用于 UTF-8 与其他单字节编码的混合.可靠性应该相似，但可能会有所不同.

Encoding::FixLatin installs command line tool fix_latin to convert files, and it would be trivial to write one using the second approach.
fix_latin (both the function and the file) can be sped up by installing Encoding::FixLatin::XS.
The same approach can be used for mixes of UTF-8 with other single-byte encodings. The reliability should be similar, but it can vary.

修复由 UTF-8 和 Windows-1252 组成的文件

每一行只使用一种编码

Each line only uses one encoding

相关推荐