Ruby将字符串编码从ISO-8859-1转换为UTF-8无效

问题描述:

我正在尝试将字符串从ISO-8859-1编码转换为UTF-8,但似乎无法正常工作.这是我在irb中所做的一个示例.

I am trying to convert a string from ISO-8859-1 encoding to UTF-8 but I can't seem to get it work. Here is an example of what I have done in irb.

irb(main):050:0> string = 'Norrlandsvägen'
=> "Norrlandsvägen"
irb(main):051:0> string.force_encoding('iso-8859-1')
=> "Norrlandsv\xC3\xA4gen"
irb(main):052:0> string = string.encode('utf-8')
=> "Norrlandsvägen" 

我不确定为什么将iso-8859-1中的Norrlandsvägen转换为utf-8中的Norrlandsvägen.

I am not sure why Norrlandsvägen in iso-8859-1 will be converted into Norrlandsvägen in utf-8.

我尝试了编码,编码!,encode(destinationEncoding,originalEncoding),iconv,force_encoding,以及我能想到的各种怪异的解决方法,但似乎没有任何效果.有人可以帮我/指出正确的方向吗?

I have tried encode, encode!, encode(destinationEncoding, originalEncoding), iconv, force_encoding, and all kinds of weird work-arounds I could think of but nothing seems to work. Can someone please help me/point me in the right direction?

Ruby新手仍然像疯了似的拔头发,但对这里的所有回复表示感谢... :)

此问题的背景:我正在编写一个gem,它将从某些网站(具有iso-8859-1编码)下载xml文件并将其保存在存储中,我想将其转换为utf-8第一的.但是诸如Norrlandsvägen之类的词不断让我感到困惑.真的,任何帮助将不胜感激!

Background of this question: I am writing a gem that will download an xml file from some websites (which will have iso-8859-1 encoding) and save it in a storage and I would like to convert it to utf-8 first. But words like Norrlandsvägen keep messing me up. Really any help would be greatly appreciated!

[UPDATE]:我意识到在irb控制台中运行这样的测试可能会给我带来不同的行为,所以这就是我的实际代码中的内容:

[UPDATE]: I realized running tests like this in the irb console might give me different behaviors so here is what I have in my actual code:

def convert_encoding(string, originalEncoding) 
  puts "#{string.encoding}" # ASCII-8BIT
  string.encode(originalEncoding)
  puts "#{string.encoding}" # still ASCII-8BIT
  string.encode!('utf-8')
end

但是最后一行给我以下错误:

but the last line gives me the following error:

Encoding::UndefinedConversionError - "\xC3" from ASCII-8BIT to UTF-8

由于下面@Amadan的回答,我注意到\xC3实际上在运行时显示在irb中:

Thanks to @Amadan's answer below, I noticed that \xC3 actually shows up in irb if you run:

irb(main):001:0> string = 'ä'
=> "ä"
irb(main):002:0> string.force_encoding('iso-8859-1')
=> "\xC3\xA4"

我还尝试为string.encode(originalEncoding)的结果分配一个新变量,但出现了甚至更奇怪的错误:

I have also tried to assign a new variable to the result of string.encode(originalEncoding) but got an even weirder error:

newString = string.encode(originalEncoding)
puts "#{newString.encoding}" # can't even get to this line...
newString.encode!('utf-8')

,错误为Encoding::UndefinedConversionError - "\xC3" to UTF-8 in conversion from ASCII-8BIT to UTF-8 to ISO-8859-1

在所有这些编码混乱中,我还是很失落,但是我非常感谢所有答复,并感谢大家给我的帮助!万分感谢! :)

I am still quite lost in all of this encoding mess but I am really grateful for all the replies and help everyone has given me! Thanks a ton! :)

您以UTF-8分配字符串.它包含ä. UTF-8用两个字节表示ä.

You assign a string, in UTF-8. It contains ä. UTF-8 represents ä with two bytes.

string = 'ä'
string.encoding
# => #<Encoding:UTF-8>
string.length
# 1
string.bytes
# [195, 164]

然后,您在不实际更改基础表示的情况下,将字节解释为好像是ISO-8859-1.它不再包含ä.它包含两个字符ä.

Then you force the bytes to be interpreted as if they were ISO-8859-1, without actually changing the underlying representation. This does not contain ä any more. It contains two characters, Ã and ¤.

string.force_encoding('iso-8859-1')
# => "\xC3\xA4"
string.length
# 2
string.bytes
# [195, 164]

然后将其翻译为UTF-8.由于这不是重新解释而是翻译,因此您保留了两个字符,但是现在使用UTF-8进行编码:

Then you translate that into UTF-8. Since this is not reinterpretation but translation, you keep the two characters, but now encoded in UTF-8:

string = string.encode('utf-8')
# => "ä" 
string.length
# 2
string.bytes
# [195, 131, 194, 164]

您所缺少的是,您原来没有具有ISO-8859-1字符串,就像您从Web服务中获得的那样-您有胡言乱语.幸运的是,这一切都在控制台测试中.如果您使用正确的输入编码读取了网站的回复,则一切正常.

What you are missing is the fact that you originally don't have an ISO-8859-1 string, as you would from your Web-service - you have gibberish. Fortunately, this is all in your console tests; if you read the response of the website using the proper input encoding, it should all work okay.

对于您的控制台测试,让我们演示一下,如果您以正确的ISO-8859-1字符串开头,那么所有操作都可以:

For your console test, let's demonstrate that if you start with a proper ISO-8859-1 string, it all works:

string = 'Norrlandsvägen'.encode('iso-8859-1')
# => "Norrlandsv\xE4gen"
string = string.encode('utf-8')
# => "Norrlandsvägen"

编辑对于您的特定问题,这应该可以解决:

EDIT For your specific problem, this should work:

require 'net/https'
uri = URI.parse("https://rusta.easycruit.com/intranet/careerbuilder_se/export/xml/full")
options = {
  :use_ssl => uri.scheme == 'https', 
  :verify_mode => OpenSSL::SSL::VERIFY_NONE
}
response = Net::HTTP.start(uri.host, uri.port, options) do |https|
  https.request(Net::HTTP::Get.new(uri.path))
end
body = response.body.force_encoding('ISO-8859-1').encode('UTF-8')