PHP:如何删除字符串中所有不可打印的字符?
我想我需要删除0-31和127字符,
I imagine I need to remove chars 0-31 and 127,
是否有一个函数或一段代码可以有效地做到这一点.
Is there a function or piece of code to do this efficiently.
7位ASCII?
如果您的Tardis刚好在1963年登陆,并且您只想要7位可打印的ASCII字符,则可以使用以下方法从0-31和127-255中删除所有内容:
7 bit ASCII?
If your Tardis just landed in 1963, and you just want the 7 bit printable ASCII chars, you can rip out everything from 0-31 and 127-255 with this:
$string = preg_replace('/[\x00-\x1F\x7F-\xFF]/', '', $string);
它匹配0-31、127-255范围内的任何内容并将其删除.
It matches anything in range 0-31, 127-255 and removes it.
您掉进了热水浴缸计时机,而您又回到了八十年代. 如果您具有某种形式的8位ASCII,则可能需要将字符保持在128-255范围内.轻松调整-只需查找0-31和127
You fell into a Hot Tub Time Machine, and you're back in the eighties. If you've got some form of 8 bit ASCII, then you might want to keep the chars in range 128-255. An easy adjustment - just look for 0-31 and 127
$string = preg_replace('/[\x00-\x1F\x7F]/', '', $string);
UTF-8?
啊,欢迎回到21世纪.如果您使用UTF-8编码的字符串,则/u
修饰符可以在正则表达式上使用
UTF-8?
Ah, welcome back to the 21st century. If you have a UTF-8 encoded string, then the /u
modifier can be used on the regex
$string = preg_replace('/[\x00-\x1F\x7F]/u', '', $string);
这只会删除0-31和127.这可用于ASCII和UTF-8,因为它们都共享相同的控件集范围(如下面的mgutt所述).严格来说,如果没有/u
修饰符,这将起作用.但是,如果您要删除其他字符,则可以使生活更轻松...
This just removes 0-31 and 127. This works in ASCII and UTF-8 because both share the same control set range (as noted by mgutt below). Strictly speaking, this would work without the /u
modifier. But it makes life easier if you want to remove other chars...
如果您要处理Unicode,则可能有可能有很多非打印元素,但让我们考虑一个简单的元素:不间断空格(U + 00A0)
If you're dealing with Unicode, there are potentially many non-printing elements, but let's consider a simple one: NO-BREAK SPACE (U+00A0)
在UTF-8字符串中,它将被编码为0xC2A0
.您可以查找并删除该特定序列,但是使用/u
修饰符,您可以简单地将\xA0
添加到字符类:
In a UTF-8 string, this would be encoded as 0xC2A0
. You could look for and remove that specific sequence, but with the /u
modifier in place, you can simply add \xA0
to the character class:
$string = preg_replace('/[\x00-\x1F\x7F\xA0]/u', '', $string);
附录:那么str_replace呢?
preg_replace非常有效,但是如果您经常执行此操作,则可以构建要删除的字符数组,并使用下面的mgutt指出的str_replace,例如
Addendum: What about str_replace?
preg_replace is pretty efficient, but if you're doing this operation a lot, you could build an array of chars you want to remove, and use str_replace as noted by mgutt below, e.g.
//build an array we can re-use across several operations
$badchar=array(
// control characters
chr(0), chr(1), chr(2), chr(3), chr(4), chr(5), chr(6), chr(7), chr(8), chr(9), chr(10),
chr(11), chr(12), chr(13), chr(14), chr(15), chr(16), chr(17), chr(18), chr(19), chr(20),
chr(21), chr(22), chr(23), chr(24), chr(25), chr(26), chr(27), chr(28), chr(29), chr(30),
chr(31),
// non-printing characters
chr(127)
);
//replace the unwanted chars
$str2 = str_replace($badchar, '', $str);
直觉上,这似乎会很快,但并非总是如此,您绝对应该进行基准测试,看看它是否可以为您节省任何费用.我使用随机数据在各种字符串长度上进行了一些基准测试,这种模式是使用php 7.0.12
Intuitively, this seems like it would be fast, but it's not always the case, you should definitely benchmark to see if it saves you anything. I did some benchmarks across a variety string lengths with random data, and this pattern emerged using php 7.0.12
2 chars str_replace 5.3439ms preg_replace 2.9919ms preg_replace is 44.01% faster
4 chars str_replace 6.0701ms preg_replace 1.4119ms preg_replace is 76.74% faster
8 chars str_replace 5.8119ms preg_replace 2.0721ms preg_replace is 64.35% faster
16 chars str_replace 6.0401ms preg_replace 2.1980ms preg_replace is 63.61% faster
32 chars str_replace 6.0320ms preg_replace 2.6770ms preg_replace is 55.62% faster
64 chars str_replace 7.4198ms preg_replace 4.4160ms preg_replace is 40.48% faster
128 chars str_replace 12.7239ms preg_replace 7.5412ms preg_replace is 40.73% faster
256 chars str_replace 19.8820ms preg_replace 17.1330ms preg_replace is 13.83% faster
512 chars str_replace 34.3399ms preg_replace 34.0221ms preg_replace is 0.93% faster
1024 chars str_replace 57.1141ms preg_replace 67.0300ms str_replace is 14.79% faster
2048 chars str_replace 94.7111ms preg_replace 123.3189ms str_replace is 23.20% faster
4096 chars str_replace 227.7029ms preg_replace 258.3771ms str_replace is 11.87% faster
8192 chars str_replace 506.3410ms preg_replace 555.6269ms str_replace is 8.87% faster
16384 chars str_replace 1116.8811ms preg_replace 1098.0589ms preg_replace is 1.69% faster
32768 chars str_replace 2299.3128ms preg_replace 2222.8632ms preg_replace is 3.32% faster
计时本身是10000次迭代,但更有趣的是相对差异.最多512个字符,我一直看到preg_replace总是赢.在1-8kb的范围内,str_replace具有边沿边缘.
The timings themselves are for 10000 iterations, but what's more interesting is the relative differences. Up to 512 chars, I was seeing preg_replace alway win. In the 1-8kb range, str_replace had a marginal edge.
我认为这是一个有趣的结果,因此将其包含在此处. 重要的是不要获取此结果并用它来决定使用哪种方法,而是要对自己的数据进行基准测试然后再决定.
I thought it was interesting result, so including it here. The important thing is not to take this result and use it to decide which method to use, but to benchmark against your own data and then decide.