PHP:使用过滤器删除XML中的无效utf-8字符

问题描述:

I have a large file, so I have created a filter for removing invalid utf-8 characters from XML.

class ValidUTF8XMLFilter extends php_user_filter {

    protected static $pattern = '/([\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})|./x';

    function filter($in, $out, &$consumed, $closing)
    {
        while ($bucket = stream_bucket_make_writeable($in)) {
            $bucket->data = preg_replace(self::$pattern, '$1', $bucket->data);
            $consumed += $bucket->datalen;
            stream_bucket_append($out, $bucket);
        }
        return PSFS_PASS_ON;
    }
}

This filter will remove also utf-8 characters not only invalid in xml, but also in utf-8. The regex is taken from Multilingual form encoding. The class was taken from this answer: How to skip invalid characters in XML file using PHP and rewritten. The pattern in that answer won't work for invalid utf-8 characters, eg. 0x1D.

Will this filter work, in situation, where invalid bytes starts at the end of buffer and ends in beginning of next filtering? Is this situation possible?

No, I don't think it will work. It will strip valid sequences of code units that happen to be split between several buckets.

It should not consume potentially incomplete sequences in the end (and, if necessary, it should pass nothing and return PSFS_FEED_ME).