PHP - fgetcsv - 引号标记在UTF-8编码的Web应用程序中输入丢失

问题描述:

I'm trying to import data from CSV files into a web app that uses utf-8 encoding. I'm using fgetcsv (I don't have to if there's a better way). I'm using utf8_encode to attempt to translate characters from whatever the file's encoding is. When I call mb_check_encoding on strings that come out of this particular file, I get 'ASCII'.

There are a few strange characters in the input. utf8_encode deals happily with é characters (where before they were coming out as black diamond question marks). However, it fails to translate double quotes and apostrophes, and instead just removes them.

Help much appreciated, thanks. I'm using CakePHP, in case that gives me some more options!

Edit - I meant utf8_encode, not utf8_decode.

我正在尝试将CS​​V文件中的数据导入到使用utf-8编码的Web应用程序中。 我正在使用fgetcsv(如果有更好的方法,我没有必要)。 我正在使用utf8_encode尝试翻译来自文件编码的字符。 当我对来自这个特定文件的字符串调用mb_check_encoding时,我得到'ASCII'。 p>

输入中有一些奇怪的字符。 utf8_encode与é字符愉快地交易(在它们作为黑色钻石问号出现之前)。 但是,它无法翻译双引号和撇号,而只是删除它们。 p>

非常感谢,非常感谢。 我正在使用CakePHP,以防我给出更多选项! p>

编辑 - 我的意思是utf8_encode,而不是utf8_decode。 p> div>

You only need one call to iconv with the correct charset for the $in_charset parameter.

$utf8Text = iconv($inputCharset, 'UTF-8', $text);

You need to know the input charset. There's no way around it. Make a specification that all input needs to be in ISO-8859-1, or whatever you prefer. Alternatively, find out what the charset of your input is (ask the author, test yourself in an editor, whatever). Alternatively, require that the input needs to specify what encoding it's in somewhere, somehow.

Encoding is not black magic. You just need to be aware of what encoding some text is in and what encoding you want it to be in. Then use a function like iconv that can cleanly translate the characters from one encoding to another. utf8_encode and utf8_decode translate between ISO-8859-1 and UTF-8. Their names are chosen terribly, since they suggest they can automagically translate anything from and to UTF-8, but that's not the case.

You can fix the problem of strange characters by using the function below:

function htmlallentities($str){
  $res = '';
  $strlen = strlen($str);
  for($i=0; $i<$strlen; $i++){
    $byte = ord($str[$i]);
    if($byte < 128) // 1-byte char
      $res .= $str[$i];
    elseif($byte < 192); // invalid utf8
    elseif($byte < 224) // 2-byte char
      $res .= '&#'.((63&$byte)*64 + (63&ord($str[++$i]))).';';
    elseif($byte < 240) // 3-byte char
      $res .= '&#'.((15&$byte)*4096 + (63&ord($str[++$i]))*64 + (63&ord($str[++$i]))).';';
    elseif($byte < 248) // 4-byte char
      $res .= '&#'.((15&$byte)*262144 + (63&ord($str[++$i]))*4096 + (63&ord($str[++$i]))*64 + (63&ord($str[++$i]))).';';
  }
  return $res; 

For example, for apostrophe (') i used the following code snippet:

$value = "What’s your name?";
$value = htmlallentities(utf8_decode($value));
$str = "&#12287;";
$str2 = "'";
$value = str_replace($str, $str2, $value);
$value = mysql_real_escape_string($value); 

Will be glad if those help you.