全面的字符替换模块在python非unicode和非ascii为HTML

问题描述：

有一个全面的字符替换模块python找到字符串中的所有非ascii或非unicode字符，并用ascii或unicode equivilents替换它们？在编码或解码期间，这种忽略参数的舒适度是疯狂的，但是同样地，在非翻译字符的每个地方也是如此。

Is there a comprehensive character replacement module for python that finds all non-ascii or non-unicode characters in a string and replaces them with ascii or unicode equivilents? This comfort with the "ignore" argument during encoding or decoding is insane, but likewise so is a '?' in every place that a non translated character was.

寻找一个模块，找到令人讨厌的字符，并使其符合任何标准要求。
我意识到，现存的字母和编码的数量使这有点不可能，但肯定有人已经刺了它吗？

I'm looking for one module that finds irksome characters and conforms them to whatever standard is requested. I realize that the amount of extant alphabets and encodings makes this somewhat impossible, but surely someone has taken a stab at it? Even a rudimentary solution would be better than the status quo.

这就意味着数据传输的简化是巨大的。

The simplification for data transfer that this would mean is enormous.

答

我不认为你想要的是真的可能 - 但我认为有一个体面的选择。

i don't think what you want is really possible - but i think there is a decent option.

unicodedata有一个'normalize'方法，可以为你优雅地降低文本...

unicodedata has a 'normalize' method that can gracefully degrade text for you...

import unicodedata
def gracefully_degrade_to_ascii( text ):
    return unicodedata.normalize('NFKD',text).encode('ascii','ignore')

假设你使用的字符集已经映射到unicode - 或者至少可以映射到unicode - 你应该能够将该文本的unicode版本降级为ascii或utf-8这个模块（也是标准库的一部分）

assuming the charset you're using is already mapped into unicode - or at least can be mapped into unicode - you should be able to degrade the unicode version of that text down to ascii or utf-8 with this module ( it's part of the standard library too )

完整文档 - http://docs.python.org/library/unicodedata.html

全面的字符替换模块在python非unicode和非ascii为HTML

相关推荐