正则表达式匹配以unicode字符开头的单词会返回意外结果

问题描述：

I want to check existence of the word 'açilek' in the context. Running this:

$word = 'açilek';
$article='elma  and  açilek word';
$mat=preg_match('/\b'. $word .'\b/', $article);
var_dump($mat);

Succeeds. This is expected. However, to match the word 'çilek', the code returns False which is not expected:

$word = 'çilek';
$article='elma  and  çilek word';
$mat=preg_match('/\b'. $word .'\b/', $article);
var_dump($mat); //returns false !!!!

Additionally, it will match this word if it is a part of a word, also unexpected:

$word = 'çilek';
$article='elma  and  açilek word';
$mat=preg_match('/\b'. $word .'\b/', $article);
var_dump($mat); //returns true !!!!

Why am I seeing this behavior?

我想在上下文中检查“açilek”这个词的存在。运行这个： p>

  $ word ='açilek'; 
 $ article ='elmaandaçilekword'; 
 $ mat = preg_match（'/ \ b'。$ 单词。'\ b /'，$ article）; 
var_dump（$ mat）; 
  code>  pre> 
 
 成功。 这是预料之中的。 但是，要匹配单词'çilek'，代码将返回False，这是不期望的： p> 
 
 
  $ word ='çilek'; 
 $ article ='elmaandçilekword  '; 
 $ mat = preg_match（'/ \ b'。$ word。'\ b /'，$ article）; 
var_dump（$ mat）;  //返回false !!!! 
  code>  pre> 
 
 此外，如果它是单词的一部分，它将匹配此单词，也是意外的： p> 
  
 
  $ word ='çilek'; 
 $ article ='elmaandaçilekword'; 
 $ mat = preg_match（'/ \ b'。$ word。'\ b /'，$ 文章）; 
var_dump（$垫）;  //返回true !!!! 
  code>  pre> 
 
 为什么我会看到这种行为？ p> 
  div>

答

You need to use the /u modifier to make the regex (especially \b) Unicode-aware:

$mat=preg_match('/\b'. $word .'\b/u', $article);

Otherwise, \b only considers positions between ASCII alphanumerics and ASCII non-alnums as word boundaries, therefore matching between a and çilek but not between and çilek.

答

beware that UTF8 characters patterns/metacharacters are not seen as such by the PCRE engine (and may very well break the matching) if you don't provide the "u" switch, as so :

http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

$mat=preg_match('/\b'. $word .'\b/u', $article);

正则表达式匹配以unicode字符开头的单词会返回意外结果

相关推荐