使我的脚本UTF-8兼容?

使我的脚本UTF-8兼容?

问题描述:

I have quite a long script which involves chopping lots of large text files into individual words and processing them.

I lowercase everything then remove all characters except for letters and spaces with:

$content=preg_replace('/[^a-z\s]/', '', $content); // Remove non-letters

This is then exploded and each word goes into an associated array as the key with the number of occurances as the value:

$words=array_count_values($content);

I want to convert the script to be able to work with languages other than English. Is PHP going to be OK with this? Can I use UTF-8 characters as array keys? And how would I preg_replace to remove everything except letters from any language? (All numbers, punctuation and random characters still need to be removed.)

我有一个很长的脚本,它涉及将大量文本文件切成单个单词并进行处理。 p >

我小写所有内容然后删除除字母和空格以外的所有字符: p>

  $ content = preg_replace('/ [^ az \ s] /  ','',$ content);  //删除非字母
  code>  pre> 
 
 

然后展开,每个单词作为关键字进入关联数组,其中出现次数为值: p >

  $ words = array_count_values($ content); 
  code>  pre> 
 
 

我想将脚本转换为能够使用语言 除了英语。 PHP可以用这个吗? 我可以使用UTF-8字符作为数组键吗? 我将如何preg_replace删除除了来自任何语言的字母之外的所有内容? (仍然需要删除所有数字,标点符号和随机字符。) p> div>

Yes you can use UTF-8 characters as keys (is there anything that can't be a key in a PHP array? :)). Your regexp might look something like:

/\pL+/u

EDIT: Sorry, should be:

/[^\pL\p{Zs}]/u

This should work, for both your problems.

<?php
$string = "Héllø";

echo preg_replace('/[^a-z\s]/i', '', $string) . "
";
echo preg_replace('/[^a-z\W\s]/ui', '', $string) . "
";

$arr = array(
    $string => 5
);

print_r($arr);
?>

In the preg_replace the u flag means it's unicode safe, the i flag means it's case-insensitive. \W are all word characters.

Ultimately, you won't be able to create an algorithm that works realiably for all languages. Unicode Standard Annex #29 provides a "Default Word Boundary Specification" (which I'm not sure would be easy to implement in PHP, because the only source of character properties available in userland is PCRE; mbstring has this information, but it doesn't expose it), but it warns the algorithm must be tailored for specific languages:

It is not possible to provide a uniform set of rules that resolves all issues across languages or that handles all ambiguous situations within a given language. [...]

For Thai, Lao, Khmer, Myanmar, and other scripts that do not use typically use spaces between words, a good implementation should not depend on the default word boundary specification. [...]