全面的字符替换模块在python非unicode和非ascii为HTML
有一个全面的字符替换模块python找到字符串中的所有非ascii或非unicode字符,并用ascii或unicode equivilents替换它们?在编码或解码期间,这种忽略参数的舒适度是疯狂的,但是同样地,在非翻译字符的每个地方也是如此。
Is there a comprehensive character replacement module for python that finds all non-ascii or non-unicode characters in a string and replaces them with ascii or unicode equivilents? This comfort with the "ignore" argument during encoding or decoding is insane, but likewise so is a '?' in every place that a non translated character was.
寻找一个模块,找到令人讨厌的字符,并使其符合任何标准要求。
我意识到,现存的字母和编码的数量使这有点不可能,但肯定有人已经刺了它吗?
I'm looking for one module that finds irksome characters and conforms them to whatever standard is requested. I realize that the amount of extant alphabets and encodings makes this somewhat impossible, but surely someone has taken a stab at it? Even a rudimentary solution would be better than the status quo.
这就意味着数据传输的简化是巨大的。
The simplification for data transfer that this would mean is enormous.
我不认为你想要的是真的可能 - 但我认为有一个体面的选择。
i don't think what you want is really possible - but i think there is a decent option.
unicodedata有一个'normalize'方法,可以为你优雅地降低文本...
unicodedata has a 'normalize' method that can gracefully degrade text for you...
import unicodedata
def gracefully_degrade_to_ascii( text ):
return unicodedata.normalize('NFKD',text).encode('ascii','ignore')
假设你使用的字符集已经映射到unicode - 或者至少可以映射到unicode - 你应该能够将该文本的unicode版本降级为ascii或utf-8这个模块(也是标准库的一部分)
assuming the charset you're using is already mapped into unicode - or at least can be mapped into unicode - you should be able to degrade the unicode version of that text down to ascii or utf-8 with this module ( it's part of the standard library too )