使我的脚本UTF-8兼容?
I have quite a long script which involves chopping lots of large text files into individual words and processing them.
I lowercase everything then remove all characters except for letters and spaces with:
$content=preg_replace('/[^a-z\s]/', '', $content); // Remove non-letters
This is then exploded and each word goes into an associated array as the key with the number of occurances as the value:
$words=array_count_values($content);
I want to convert the script to be able to work with languages other than English. Is PHP going to be OK with this? Can I use UTF-8 characters as array keys? And how would I preg_replace to remove everything except letters from any language? (All numbers, punctuation and random characters still need to be removed.)
我有一个很长的脚本,它涉及将大量文本文件切成单个单词并进行处理。 p >
我小写所有内容然后删除除字母和空格以外的所有字符: p>
$ content = preg_replace('/ [^ az \ s] / ','',$ content); //删除非字母
code> pre>
然后展开,每个单词作为关键字进入关联数组,其中出现次数为值: p >
$ words = array_count_values($ content);
code> pre>
我想将脚本转换为能够使用语言 除了英语。 PHP可以用这个吗? 我可以使用UTF-8字符作为数组键吗? 我将如何preg_replace删除除了来自任何语言的字母之外的所有内容? (仍然需要删除所有数字,标点符号和随机字符。) p>
div>
Yes you can use UTF-8 characters as keys (is there anything that can't be a key in a PHP array? :)). Your regexp might look something like:
/\pL+/u
EDIT: Sorry, should be:
/[^\pL\p{Zs}]/u
This should work, for both your problems.
<?php
$string = "Héllø";
echo preg_replace('/[^a-z\s]/i', '', $string) . "
";
echo preg_replace('/[^a-z\W\s]/ui', '', $string) . "
";
$arr = array(
$string => 5
);
print_r($arr);
?>
In the preg_replace
the u
flag means it's unicode safe, the i
flag means it's case-insensitive. \W
are all word characters.
Ultimately, you won't be able to create an algorithm that works realiably for all languages. Unicode Standard Annex #29 provides a "Default Word Boundary Specification" (which I'm not sure would be easy to implement in PHP, because the only source of character properties available in userland is PCRE; mbstring
has this information, but it doesn't expose it), but it warns the algorithm must be tailored for specific languages:
It is not possible to provide a uniform set of rules that resolves all issues across languages or that handles all ambiguous situations within a given language. [...]
For Thai, Lao, Khmer, Myanmar, and other scripts that do not use typically use spaces between words, a good implementation should not depend on the default word boundary specification. [...]