如何在PHP中处理xls文件的不同编码?

如何在PHP中处理xls文件的不同编码?

问题描述:

I'm developing a php script involving parsing data from xls files. I'm using library phpexcelreader. All mostly works, but I stumbled upon a strange problem. Some files are parsed incorrecty. Looks like xls files may use different character encodings internally. At least, then I pipe output from my script through iconv -f cp1251 -t utf8, strings get corrected.

Phpexcelreader has an option for specifing output encoding, but looks like it lacks an ability detect input encoding. Any ideas?

我正在开发一个涉及从xls文件解析数据的php脚本。 我正在使用图书馆 phpexcelreader 。 一切都很有效,但我偶然发现了一个奇怪的问题。 有些文件解析不正确。 看起来xls文件可能在内部使用不同的字符编码。 至少,然后我通过iconv -f cp1251 -t utf8从我的脚本输出输出,字符串得到纠正。 p>

Phpexcelreader有一个指定输出编码的选项,但看起来它缺乏一个能力 检测输入编码。 有任何想法吗? p> div>

The _defaultEncoding property of the workbook object can be set to contain the charset used by the Excel file, and this is then used to handle conversion to UTF-16LE by the reader, but it makes no effort to identify the internal charset itself.

If you define

define('SPREADSHEET_EXCEL_READER_TYPE_CODEPAGE',  0x0042);

among the other SPREADSHEET_EXCEL_READER_TYPE definitions, and then modify the switch statement starting at line 464 to include a case for SPREADSHEET_EXCEL_READER_TYPE_CODEPAGE. The logic for this case needs to be something like:

$length = $this->_GetInt2d($this->_data, $pos + 2);
$recordData = substr($this->_data, $pos + 4, $length);

// move stream pointer to next record
$pos += 4 + $length;

// offset: 0; size: 2; code page identifier
$codepage = $this->_GetInt2d($recordData, 0);
$codepage = $this->_CodePageNumberToName($codepage)

Recreate the _GetInt2d method (that seems to have been stripped from the code at some point) as

function _GetInt2d($data, $pos)
{
    return ord($data[$pos]) | (ord($data[$pos + 1]) << 8);
}

and create a _CodePageNumberToName method to return the codepage name from its numeric value:

function _CodePageNumberToName($codePage = '1252')
{
    switch ($codePage) {
        case 367:   return 'ASCII';     break;  //  ASCII
        case 437:   return 'CP437';     break;  //  OEM US
        case 720:   throw new Exception('Code page 720 not supported.');
                                        break;  //  OEM Arabic
        case 737:   return 'CP737';     break;  //  OEM Greek
        case 775:   return 'CP775';     break;  //  OEM Baltic
        case 850:   return 'CP850';     break;  //  OEM Latin I
        case 852:   return 'CP852';     break;  //  OEM Latin II (Central European)
        case 855:   return 'CP855';     break;  //  OEM Cyrillic
        case 857:   return 'CP857';     break;  //  OEM Turkish
        case 858:   return 'CP858';     break;  //  OEM Multilingual Latin I with Euro
        case 860:   return 'CP860';     break;  //  OEM Portugese
        case 861:   return 'CP861';     break;  //  OEM Icelandic
        case 862:   return 'CP862';     break;  //  OEM Hebrew
        case 863:   return 'CP863';     break;  //  OEM Canadian (French)
        case 864:   return 'CP864';     break;  //  OEM Arabic
        case 865:   return 'CP865';     break;  //  OEM Nordic
        case 866:   return 'CP866';     break;  //  OEM Cyrillic (Russian)
        case 869:   return 'CP869';     break;  //  OEM Greek (Modern)
        case 874:   return 'CP874';     break;  //  ANSI Thai
        case 932:   return 'CP932';     break;  //  ANSI Japanese Shift-JIS
        case 936:   return 'CP936';     break;  //  ANSI Chinese Simplified GBK
        case 949:   return 'CP949';     break;  //  ANSI Korean (Wansung)
        case 950:   return 'CP950';     break;  //  ANSI Chinese Traditional BIG5
        case 1200:  return 'UTF-16LE';  break;  //  UTF-16 (BIFF8)
        case 1250:  return 'CP1250';    break;  //  ANSI Latin II (Central European)
        case 1251:  return 'CP1251';    break;  //  ANSI Cyrillic
        case 0:                                 //  CodePage is not always correctly set when the xls file was saved by Apple's Numbers program
        case 1252:  return 'CP1252';    break;  //  ANSI Latin I (BIFF4-BIFF7)
        case 1253:  return 'CP1253';    break;  //  ANSI Greek
        case 1254:  return 'CP1254';    break;  //  ANSI Turkish
        case 1255:  return 'CP1255';    break;  //  ANSI Hebrew
        case 1256:  return 'CP1256';    break;  //  ANSI Arabic
        case 1257:  return 'CP1257';    break;  //  ANSI Baltic
        case 1258:  return 'CP1258';    break;  //  ANSI Vietnamese
        case 1361:  return 'CP1361';    break;  //  ANSI Korean (Johab)
        case 10000: return 'MAC';       break;  //  Apple Roman
        case 32768: return 'MAC';       break;  //  Apple Roman
        case 32769: throw new Exception('Code page 32769 not supported.');
                                        break;  //  ANSI Latin I (BIFF2-BIFF3)
        case 65001: return 'UTF-8';     break;  //  Unicode (UTF-8)
    }
}

And store the returned value in $_defaultEncoding

Alternatively, switch to an Excel reader that can handle the codepage correctly in the first place

My 2 cents:

I just replaced the encodeUTF16 with this one

 function _encodeUTF16($string, $check = false) {
    if ($check) {
        $from = api_detect_encoding($string);
        $string = api_convert_encoding($string, $this->_defaultEncoding, $from);
        return $string;    
    }
    $string =  api_convert_encoding($string, $this->_defaultEncoding, 'UTF-16LE');
    return $string;

and change the line 568

$retstr = ($asciiEncoding) ? $this->_encodeUTF16($retstr, true) : $this->_encodeUTF16($retstr);

these functions api_detect_encoding and api_convert_encoding can be found in this lib:

http://code.google.com/p/chamilo/source/browse/main/inc/lib/internationalization.lib.php?repo=classic

for Persian language i added one line after the line 568 or in other version 336.

$retstr = ($asciiEncoding) ? $retstr : $this->_encodeUTF16($retstr);

$retstr=iconv("UTF-16LE","UTF-8", $retstr);

this code support Persian but u can't use English any more.