PHP输出编码与MySQL数据库中的UTF-8字符串有关

PHP输出编码与MySQL数据库中的UTF-8字符串有关

问题描述:

I know this question comes up in one form or another all the time on here, but I'm kind of at a loss on how to resolve it. I've got a PHP website that's running off of MySQL, that's showing some extended characters as a garbled mess. As far as I know it's all encoded as UTF-8, on every step from the content import to displaying it on the screen. Still, it's showing weird encoding issues. Here's the first test example (Natural Phënåm¥na, this is on purpose), which mb_detect_encoding identifies as UTF-8, which I can only get to display correctly with utf8_decode:

no utf8_decode: Natural Phënåm¥na
utf8_decode: Natural Phënåm¥na

Second example, which never even utf8_decodes properly (should be an ümlaut and “typographer's quotes” (extended characters added on purpose, as a test:

no utf8_decode: This pürson from “Vancouver, Canadaâ€
utf8_decode: This pürson from �??Vancouver, Canada�?�

My initial thought was it was doubly encoded, but I don't think that's what's going on. Everything is displaying correctly in MySQL when I do queries on the command line.

Here's a rundown of all the things I've investigated:

  • Content imported is verified to be UTF-8, imported with UTF-8 connection to MySQL
  • MySQL Database, tables, columns are UTF-8, utf_unicode_*
  • character_set_client, etc variables in MySQL set to utf8 on Amazon RDS
  • PHP PDO connection is UTF-8, NAME set to UTF-8
  • Both PHP header charset and HTML meta charset are UTF-8
  • mb_detect_encoding is returning UTF-8 for both strings

So after a few hours of troubleshooting, I'm kind of at a loss. On a whim I even tried setting the HTML header/meta and PHP headers to ISO-8559-1, but that's not doing the trick either.

I last spent a while battling with Amazon RDS to get the right variables set, but otherwise I'm out of ideas.

mysql> show variables like '%character%';
+--------------------------+-------------------------------------------+
| Variable_name            | Value                                     |
+--------------------------+-------------------------------------------+
| character_set_client     | utf8                                      |
| character_set_connection | utf8                                      |
| character_set_database   | utf8                                      |
| character_set_filesystem | utf8                                      |
| character_set_results    | utf8                                      |
| character_set_server     | utf8                                      |
| character_set_system     | utf8                                      |
| character_sets_dir       | /rdsdbbin/mysql-5.5.40.R1/share/charsets/ |
+--------------------------+-------------------------------------------+

So I'm wondering, are there steps I'm missing? Something obvious? Thanks in advance.

UPDATE

Here's my PHP output script, for further clarification on the "output" that I mentioned:

<?php header("Content-type: text/html; charset=utf-8"); ?>
<html>
<header>
    <meta charset="utf-8" />
    <title>My test</title>
</header>
    <body>
<?php


    try {
        $dbh = new PDO("mysql:host=localhost;dbname=database", 
        "user", "password", array(PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8"));
    }
    catch(PDOException $e) {
        echo $e->getMessage();
    }

    $sth = $dbh->prepare("my select statement");
$sth->execute();
$rows = $sth->fetchAll(PDO::FETCH_ASSOC);


foreach ($rows as $row) {
    echo mb_detect_encoding($row['name']);
    echo "<br>no utf8 decode: ". $row['name'] . "<br>
";
    echo "single utf8 decode: ". utf8_decode($row['name']) . "<br>
";
    echo "no utf8 decode: ". $row['description'] . "<br>
";
    echo "single utf8 decode: ". (utf8_decode($row['description'])) . "<br>
";
}

?>
</body>
</html>

UPDATE #2 I tried also just outputting these same characters into the browser directly from a PHP echo, and straight static HTML, and the characters display perfectly fine.

echo "“test ü ö”<br>"; ?>
<p>“test ü ö”</p>

You should not change all the character_set% fields, just the three that are affected by SET NAMES utf8;.

Don't use utf8_encode or decode.

You have probably messed up when storing.

This seems to recover the characters, but this not a viable fix:

CONVERT(CAST(CONVERT('pürson from “Vancouver, Canadaâ€' USING latin1)
             AS BINARY)
        USING utf8)
--> 'pürson from “Vancouver, Canada - spec',

In order to figure out what was done, please provide

SELECT col, HEX(col) FROM tbl WHERE ...

for some cell that is not rendering properly.

You mentioned that it is all in utf-8 in all the data flow, except when it is rendered on screen. I'm assuming on a browser, not a console. If it is so, check if the html has the <meta charset="utf-8"> inside the <head> tag. Like in the html5 boilerplate https://github.com/h5bp/html5-boilerplate/blob/master/dist/index.html

So it looks like somehow on the MySQL level it was double-encoding UTF-8 characters in some of these fields. I was finally able to ascertain it via this great blog post Getting out of MySQL Character Set Hell. Not 100% clear if it's being "double-encoded" when it's sent from Python, or when it hits the PHP API, but it's 90% of the answer, right there.