正则表达式匹配除标点符号之外的任何UTF字符
I'm preparing a function in PHP to automatically convert a string to be used as a filename in a URL (*.html). Although ASCII should be use to be on the safe side, for SEO needs I need to allow the filename to be in any language but I don't want it to include punctuation other than a dash (-) and underscore (_), chars like *%$#@"' shouldn't be allowed.
Spaces should be converted to dashes.
I think that using Regex will be the easiest way, but I'm not sure it how to handle UTF8 strings.
My ASCII functions looks like this:
function convertToPath($string)
{
$string = strtolower(trim($string));
$string = preg_replace('/[^a-z0-9-]/', '-', $string);
$string = preg_replace('/-+/', "-", $string);
return $string;
}
Thanks,
Roy.
我正在准备PHP中的一个函数来自动转换字符串以用作URL中的文件名(* 的.html)。 尽管ASCII应该用于安全方面,但对于搜索引擎优化需要我需要允许文件名使用任何语言,但我不希望它包括除短划线( - )和下划线(_)之外的标点符号,字符 喜欢*%$#@“'不应该被允许。 p>
空格应该转换成破折号。 p>
我认为使用正则表达式将是 最简单的方法,但我不确定如何处理UTF8字符串。 p>
我的ASCII函数如下所示: p>
function convertToPath($ string)
{
$ string = strtolower(trim($ string));
$ string = preg_replace('/ [^ a-z0-9 - ] /',' - ',$ string) ;
$ string = preg_replace('/ - + /',“ - ”,$ string);
返回$ string;
}
code> pre>
谢谢 , p>
Roy。 p>
div>
I think that for SEO needs you should stick to ASCII characters in the URL.
In theory, many more characters are allowed in URLs. In practice most systems only parse ASCII reliable.
Also, many automagically-parse-the-link scripts choke on non-ASCII characters. So allowing URLs with non-ASCII characters in your URLs drastically reduces the change of your link showing up (correctly) in user generated content. (if you want an example of such a script, take a look at the stackoverflow script, it chokes on parenthesis for example)
You could also take a look at: How to handle diacritics (accents) when rewriting ‘pretty URLs’
The accepted solution there is to transiterate the non-ASCII characters:
<?php
$text = iconv('UTF-8', 'US-ASCII//TRANSLIT', $text);
?>
Hope this helps
If UTF-8 mode is selected you can select all non-Letters (according to the Unicode general category - please refer to the PHP documentation Regular Expression Details) by using
/\P{L}+/
so I'd try the following (untested):
function convertToPath($string)
{
$string = mb_strtolower(trim($string), 'UTF-8');
$string = preg_replace('/\P{L}+/', '-', $string);
$string = preg_replace('/-+/', "-", $string);
return $string;
}
Be aware that you'll get prolems with strtolower()
on UTF-8 strings as it'll mess with you multi-byte characters - use mb_strtolower()
instead.