如何将这个UTF-8转义字符串从亚马逊MWS响应转换为正确的UTF-8?

如何将这个UTF-8转义字符串从亚马逊MWS响应转换为正确的UTF-8?

问题描述:

In part of an XML Amazon MWS ListOrders response we got an escaped UTF-8 character in one element:

<Address><Name>Ram&#xC3;&#xAD;rez Jones</Name></Address>

The name is supposed to be Ramírez. The diacritic character í is UTF-8 character U+00ED (\xc3\xad in literal; see this chart for reference).

However PHP's SimpleXML function mangles this string(which you can see because I simply pasted), transforming it into

Ramírez Jones

into the editor box here (evidently *'s ASP.NET underpinnings do the same thing as PHP).

Now when this mangled string gets saved into, then pulled out of MongoDB, it then becomes

RamÃ-­rez Jones

For some reason a hyphen is inserted there, although believe it or not, if you select the above bold text, then paste it back into a * editor window, it will simply appear as Ramírez (the hyphen mysteriously vanishes, at least on OS X 10.8.5)!

Here is some example code to show this problem:

$xml = "<Address><Name>Ram&#xC3;&#xAD;rez Jones</Name></Address>";
$elem = new SimpleXMLAddressent($xml);
$bad_string = $elem->Name;
echo mb_detect_encoding($bad_string)."
";
echo $elem->Name->__toString()."
";
echo iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $elem->Name->__toString());

Here is the output from the above sample code, as run on onlinephpfunction.com's sandbox:

UTF-8
Ramírez Jones
RamA-rez Jones

How can we avoid this problem? It's really screwing things up.

EDIT:

Let me add that while the name in the XML is supposed to be Ramírez Jones, I need to transliterate it to Ramirez Jones (strip the diacrtic mark off of the í).

REVISED FINAL SOLUTION:

It's different than the correct answer below but this was the most elegant solution that I found. Just replace the last line of the example with this:

echo iconv('UTF-8','ASCII//TRANSLIT', html_entity_decode($xml));

This works because "&#xC3;&#xAD;" are HTML entities.

ALTERNATE SOLUTION

Strangely, this also works:

$xml = '<?xml version="1.0"?><Address><Name>Ram&#xC3;&#xAD;rez Jones</Name></Address>';
$xml= str_replace('<?xml version="1.0"?>', '<?xml version="1.0" encoding="ISO-8859-1"?>' , $xml);
$domdoc = new DOMDocument();
$domdoc->loadXML($xml);
$xml = iconv('UTF-8','ASCII//TRANSLIT',$domdoc->saveXML());
$elem = new SimpleXMLElement($xml);
echo $elem->Name; 

SimpleXML does not decode the hex entities and understand the result as UTF-8, because that's not how XML or UTF-8 actually works. Nevertheless, if Amazon produces such nonsense, you need to correct that error before parsing it as XML.

function decode_hexentities($xml) {
  return
    preg_replace_callback(
      '~&#x([0-9a-fA-F]+);~i', 
      function ($matches) { return chr(hexdec($matches[1])); }, 
      $xml
    );
}

$xml = "<Address><Name>Ram&#xC3;&#xAD;rez Jones</Name></Address>";
$xml = decode_hexentities($xml);
$elem = new SimpleXMLElement($xml);
$bad_string = $elem->Name;
echo mb_detect_encoding($bad_string)."
";
echo $elem->Name->__toString()."
";
echo iconv('UTF-8', 'ASCII//TRANSLIT', $elem->Name->__toString());

results in:

UTF-8
Ramírez Jones
Ramirez Jones

It does not work because it is encoded twice. The character í has the code U+00ED and it should be encoded in XML as &#ED;.

You can fix its encoding using either:

$name = iconv('UTF-8', 'ISO-8859-1//TRANSLIT//IGNORE', $elem->Name->__toString());

or

$name = mb_convert_encoding($elem->Name->__toString(), 'ISO-8859-1', 'UTF-8');

UPDATE:

Both ways suggested above work to fix the encoding (they actually convert the encoding of the string from UTF-8 to ISO-8859-1 which incidentally fix the issue at hand).

The solution provided by @Hazzit also works.

The real challenge for both solutions (and for your code) is to detect if the received data is encoded in a wrong way and apply these fixes only in that situation, letting the code work correctly when Amazon fixes the encoding issue. I hope they will do it.

Stripping the accents with minimum loss of information

After you fix the encoding, in order to replace the accented letters with similar letters from the ASCII subset you must use iconv() (because only iconv() can help), as you already did in the sample code.

$nameAscii = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $name);

An explanation of the second parameter can be found in the documentation page of iconv(); please also read the user comments.