正则表达式匹配以unicode字符开头的单词会返回意外结果
I want to check existence of the word 'açilek' in the context. Running this:
$word = 'açilek';
$article='elma and açilek word';
$mat=preg_match('/\b'. $word .'\b/', $article);
var_dump($mat);
Succeeds. This is expected. However, to match the word 'çilek', the code returns False which is not expected:
$word = 'çilek';
$article='elma and çilek word';
$mat=preg_match('/\b'. $word .'\b/', $article);
var_dump($mat); //returns false !!!!
Additionally, it will match this word if it is a part of a word, also unexpected:
$word = 'çilek';
$article='elma and açilek word';
$mat=preg_match('/\b'. $word .'\b/', $article);
var_dump($mat); //returns true !!!!
Why am I seeing this behavior?
我想在上下文中检查“açilek”这个词的存在。 运行这个: p>
$ word ='açilek';
$ article ='elmaandaçilekword';
$ mat = preg_match('/ \ b'。$ 单词。'\ b /',$ article);
var_dump($ mat);
code> pre>
成功。 这是预料之中的。 但是,要匹配单词'çilek',代码将返回False,这是不期望的: p>
$ word ='çilek';
$ article ='elmaandçilekword ';
$ mat = preg_match('/ \ b'。$ word。'\ b /',$ article);
var_dump($ mat); //返回false !!!!
code> pre>
此外,如果它是单词的一部分,它将匹配此单词,也是意外的: p>
$ word ='çilek';
$ article ='elmaandaçilekword';
$ mat = preg_match('/ \ b'。$ word。'\ b /',$ 文章);
var_dump($垫); //返回true !!!!
code> pre>
为什么我会看到这种行为? p>
div>
You need to use the /u
modifier to make the regex (especially \b
) Unicode-aware:
$mat=preg_match('/\b'. $word .'\b/u', $article);
Otherwise, \b
only considers positions between ASCII alphanumerics and ASCII non-alnums as word boundaries, therefore matching between a
and çilek
but not between
and çilek
.
beware that UTF8 characters patterns/metacharacters are not seen as such by the PCRE engine (and may very well break the matching) if you don't provide the "u" switch, as so :
http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
$mat=preg_match('/\b'. $word .'\b/u', $article);