修复由 UTF-8 和 Windows-1252 组成的文件
我有一个生成 UTF-8 文件的应用程序,但其中一些内容的编码不正确.一些字符被编码为 iso-8859-1 aka iso-latin-1 或 cp1252 aka Windows-1252.有没有办法恢复原文?
I have an application that produces a UTF-8 file, but some of the contents are incorrectly encoded. Some of the characters are encoded as iso-8859-1 aka iso-latin-1 or cp1252 aka Windows-1252. Is there a way of recovering the original text?
是的!
显然,最好修复创建文件的程序,但这并不总是可行的.以下是两种解决方案.
Obviously, it's better to fix the program creating the file, but that's not always possible. What follows are two solutions.
Encoding::FixLatin 提供了一个名为 fix_latin
的函数> 解码由 UTF-8、iso-8859-1、cp1252 和 US-ASCII 混合组成的文本.
Encoding::FixLatin provides a function named fix_latin
which decodes text that consists of a mix of UTF-8, iso-8859-1, cp1252 and US-ASCII.
$ perl -e'
use Encoding::FixLatin qw( fix_latin );
$bytes = "xD0 x92 xD0x92
";
$text = fix_latin($bytes);
printf("U+%v04X
", $text);
'
U+00D0.0020.2019.0020.0412.000A
采用了启发式方法,但它们相当可靠.只有以下情况会失败:
Heuristics are employed, but they are fairly reliable. Only the following cases will fail:
其中之一
[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞß]
使用iso-8859编码-1 或 cp1252,后跟
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜š›œžŸ
¡¢£¤¥¦§¨©ª«¬<SHY>
®¯°±²³´µ¶·¸¹º»¼½¾¿]
使用 iso-8859-1 或 cp1252 编码.
One of
[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞß]
encoded using iso-8859-1 or cp1252, followed by one of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>
¡¢£¤¥¦§¨©ª«¬<SHY>
®¯°±²³´µ¶·¸¹º»¼½¾¿]
encoded using iso-8859-1 or cp1252.
其中之一
[àáâãäåæçèéêëìíîï]
使用iso-8859-1或cp1252编码,后跟两个
[€‚ƒ„…†‡ˆ‰Š‹ŗ‘’“”•–—˜™š›œx017E;Ÿ<NBSP>
¡¢£¤¥¦§¨©ª«¬
®¯°±²³´µ¶·¸¹º»¼½¾¿]
使用iso-8859-1或cp1252编码.
One of
[àáâãäåæçèéêëìíîï]
encoded using iso-8859-1 or cp1252, followed by two of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>
¡¢£¤¥¦§¨©ª«¬<SHY>
®¯°±²³´µ¶·¸¹º»¼½¾¿]
encoded using iso-8859-1 or cp1252.
其中一个
[ðñòóôõö÷]
编码使用iso-8859-1 或 cp1252,后跟两个
[[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—™š›œžŸ
¡¢£¤¥¦§¨©ª«¬<SHY>
®¯°±²³´µ¶·¸¹º»¼½¾¿]
使用 iso-8859-1 或 cp1252 编码.
One of
[ðñòóôõö÷]
encoded using iso-8859-1 or cp1252, followed by two of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>
¡¢£¤¥¦§¨©ª«¬<SHY>
®¯°±²³´µ¶·¸¹º»¼½¾¿]
encoded using iso-8859-1 or cp1252.
使用核心模块 Encode 可以产生相同的结果,尽管我认为这是一个公平的安装 Encoding::FixLatin::XS 后,比 Encoding::FixLatin 慢一点.
The same result can be produced using core module Encode, though I imagine this is a fair bit slower than Encoding::FixLatin with Encoding::FixLatin::XS installed.
$ perl -e'
use Encode qw( decode_utf8 encode_utf8 decode );
$bytes = "xD0 x92 xD0x92
";
$text = decode_utf8($bytes, sub { encode_utf8(decode("cp1252", chr($_[0]))) });
printf("U+%v04X
", $text);
'
U+00D0.0020.2019.0020.0412.000A
每一行只使用一种编码
fix_latin
在字符级别起作用.如果知道每一行完全使用 UTF-8、iso-8859-1、cp1252 或 US-ASCII 之一进行编码,您可以通过检查该行是否为有效的 UTF-8 来使该过程更加可靠.
Each line only uses one encoding
fix_latin
works on a character level. If it's known that each line is entirely encoded using one of UTF-8, iso-8859-1, cp1252 or US-ASCII, you could make the process even more reliable by check if the line is valid UTF-8.
$ perl -e'
use Encode qw( decode );
for $bytes ("xD0 x92 xD0x92
", "xD0x92
") {
if (!eval {
$text = decode("UTF-8", $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC);
1 # No exception
}) {
$text = decode("cp1252", $bytes);
}
printf("U+%v04X
", $text);
}
'
U+00D0.0020.2019.0020.00D0.2019.000A
U+0412.000A
采用了启发式方法,但它们非常可靠.仅当给定行的以下所有都为真时,它们才会失败:
Heuristics are employed, but they are very reliable. They will only fail if all of the following are true for a given line:
该行使用 iso-8859-1 或 cp1252 编码,
The line is encoded using iso-8859-1 or cp1252,
至少一个
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜x2122;š›œžŸ
¡¢£¤¥¦§¨©ª«¬
®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷]
存在于行中,
At least one of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>
¡¢£¤¥¦§¨©ª«¬<SHY>
®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷]
is present in the line,
[ÀÁÂÃÄÅÆÇÈ&的所有实例#xC9;ÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞß]
总是跟在后面其中之一
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™šx203A;œžŸ
¡¢£¤¥¦§¨©ª«¬
®¯°±²³´µ¶·¸¹º»¼½¾¿],
All instances of
[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞß]
are always followed by exactly one of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>
¡¢£¤¥¦§¨©ª«¬<SHY>
®¯°±²³´µ¶·¸¹º»¼½¾¿],
[àáâãäåæçè&的所有实例#xE9;êëìíîï]
后面总是正好有两个
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘x2019;“”•–—˜™š›œž<NBSP>
¡¢£¤¥¦§¨©ª«¬<SHY>
®¯°±²³´µ¶·¸¹º»¼½¾¿],
All instances of
[àáâãäåæçèéêëìíîï]
are always followed by exactly two of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>
¡¢£¤¥¦§¨©ª«¬<SHY>
®¯°±²³´µ¶·¸¹º»¼½¾¿],
[ðñòóôõö÷]
的所有实例都是总是紧跟三个
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™&x0161;›œžŸ
¡¢£¤¥¦§¨©ª«¬
®¯°±²³´µ¶·¸¹º»¼½¾¿],
All instances of
[ðñòóôõö÷]
are always followed by exactly three of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>
¡¢£¤¥¦§¨©ª«¬<SHY>
®¯°±²³´µ¶·¸¹º»¼½¾¿],
[øùúûüýþÿ]
都不存在在行中,和
None of
[øùúûüýþÿ]
are present in the line, and
没有
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜š›œžŸ
¡¢£¤¥¦§¨©ª«¬<SHY>
®¯°±²³´µ¶·¸¹º»¼½¾¿]
出现在行中,除非前面提到过.
None of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>
¡¢£¤¥¦§¨©ª«¬<SHY>
®¯°±²³´µ¶·¸¹º»¼½¾¿]
are present in the line except where previously mentioned.
注意事项:
- Encoding::FixLatin 安装命令行工具
fix_latin
来转换文件,使用第二种方法编写一个就很简单了. -
fix_latin
(函数和文件)可以通过安装 编码::FixLatin::XS. - 同样的方法可用于 UTF-8 与其他单字节编码的混合.可靠性应该相似,但可能会有所不同.
- Encoding::FixLatin installs command line tool
fix_latin
to convert files, and it would be trivial to write one using the second approach. -
fix_latin
(both the function and the file) can be sped up by installing Encoding::FixLatin::XS. - The same approach can be used for mixes of UTF-8 with other single-byte encodings. The reliability should be similar, but it can vary.