

我正在读一个XML文档(UTF-8),最终使用ISO-8859-1在网页上显示内容。如预期,有几个字符无法正确显示,例如 - '(它们显示为?)。

I am reading an XML document (UTF-8) and ultimately displaying the content on a Web page using ISO-8859-1. As expected, there are a few characters are not displayed correctly, such as ", and (they display as ?).


Is it possible to convert these characters from UTF-8 to ISO-8859-1?


Here is a snippet of code I have written to attempt this:

BufferedReader br = new BufferedReader(new InputStreamReader(urlConnection.getInputStream(), "UTF-8"));
StringBuilder sb = new StringBuilder();

String line = null;
while ((line = br.readLine()) != null) {

byte[] latin1 = sb.toString().getBytes("ISO-8859-1");

return new String(latin1);

我不太清楚发生了什么,但我相信readLine (因为字符串将被Java / UTF-16编码?)。我尝试的另一个变化是用

I'm not quite sure what's going awry, but I believe it's readLine() that's causing the grief (since the strings would be Java/UTF-16 encoded?). Another variation I tried was to replace latin1 with

byte[] latin1 = new String(sb.toString().getBytes("UTF-8")).getBytes("ISO-8859-1");


I have read previous posts on the subject and I'm learning as I go. Thanks in advance for your help.

我不确定在标准库中是否有一个规范化例程这个。我不认为智能报价的转换是由标准 Unicode normalizer 例程 - 但不要引用我。

I'm not sure if there is a normalization routine in the standard library that will do this. I do not think conversion of "smart" quotes is handled by the standard Unicode normalizer routines - but don't quote me.

聪明的事情是转储 ISO-8859-1 并开始使用 UTF-8 。也就是说,可以将任何正常允许的Unicode代码点编码为编码为 ISO-8859-1 的HTML页面。您可以使用转义序列对其进行编码,如下所示:

The smart thing to do is to dump ISO-8859-1 and start using UTF-8. That said, it is possible to encode any normally allowed Unicode code point into a HTML page encoded as ISO-8859-1. You can encode them using escape sequences as shown here:

public final class HtmlEncoder {
  private HtmlEncoder() {}

  public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
      T out) throws java.io.IOException {
    for (int i = 0; i < sequence.length(); i++) {
      char ch = sequence.charAt(i);
      if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) {
      } else {
        int codepoint = Character.codePointAt(sequence, i);
        // handle supplementary range chars
        i += Character.charCount(codepoint) - 1;
        // emit entity
    return out;


String foo = "This is Cyrillic Ya: \u044F\n"
    + "This is fraktur G: \uD835\uDD0A\n" + "This is a smart quote: \u201C";

StringBuilder sb = HtmlEncoder.escapeNonLatin(foo, new StringBuilder());

上面的字符LEFT DOUBLE QUOTATION MARK( U + 201C &#x201C; )编码为&#x201C;一些其他任意代码点也被编码。

Above, the character LEFT DOUBLE QUOTATION MARK ( U+201C ) is encoded as &#x201C;. A couple of other arbitrary code points are likewise encoded.


Care needs to be taken with this approach. If your text needs to be escaped for HTML, that needs to be done before the above code or the ampersands end up being escaped.