minidom无法解析特殊unicode字符的有关问题
minidom无法解析特殊unicode字符的问题.
报错:UnicodeEncodeError: 'ascii' codec can't encode characters in position... .
我上网查了一下,要修改C:\Python26\Lib下面的site.py,把
这里的if 0改成if 1
我改完重启python,运行一样的程序,现在报的错误是:
Traceback (most recent call last):
File "<pyshell#9>", line 1, in <module>
minidom.parseString(xmlstring)
File "C:\Python26\lib\xml\dom\minidom.py", line 1928, in parseString
return expatbuilder.parseString(string)
File "C:\Python26\lib\xml\dom\expatbuilder.py", line 940, in parseString
return builder.parseString(string)
File "C:\Python26\lib\xml\dom\expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
File "C:\Python26\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0e42' in position 5: character maps to <undefined>
这是为什么呢? 是python的问题,还是minidom的问题? 怎么解决这个问题?
------解决方案--------------------
建议别乱动site.py,既然要byte string就自己动手转换成utf-8编码:
>>> s = u'<tag>\u0e42</tag>'.encode('utf-8')
>>> s
'<tag>\xe0\xb9\x82</tag>'
>>> from xml.dom.minidom import parseString
>>> doc = parseString(s)
>>> doc.documentElement.firstChild.data
u'\u0e42'
>>> from xml.etree.ElementTree import fromstring
>>> root = fromstring(s)
>>> root.text
u'\u0e42'
>>>
- Python code
>>> c=u'\u0e42' >>> c u'\u0e42' >>> print c โ >>> from xml.dom import minidom >>> xmlstring="<tag>" >>> xmlstring+=c >>> xmlstring+="</tag>" >>> xmlstring u'<tag>\u0e42</tag>' >>> minidom.parseString(xmlstring)
报错:UnicodeEncodeError: 'ascii' codec can't encode characters in position... .
我上网查了一下,要修改C:\Python26\Lib下面的site.py,把
- Python code
def setencoding(): """Set the string encoding used by the Unicode implementation. The default is 'ascii', but if you're willing to experiment, you can change this.""" encoding = "ascii" # Default value set by _PyUnicode_Init() if 0:
这里的if 0改成if 1
我改完重启python,运行一样的程序,现在报的错误是:
Traceback (most recent call last):
File "<pyshell#9>", line 1, in <module>
minidom.parseString(xmlstring)
File "C:\Python26\lib\xml\dom\minidom.py", line 1928, in parseString
return expatbuilder.parseString(string)
File "C:\Python26\lib\xml\dom\expatbuilder.py", line 940, in parseString
return builder.parseString(string)
File "C:\Python26\lib\xml\dom\expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
File "C:\Python26\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0e42' in position 5: character maps to <undefined>
这是为什么呢? 是python的问题,还是minidom的问题? 怎么解决这个问题?
------解决方案--------------------
建议别乱动site.py,既然要byte string就自己动手转换成utf-8编码:
>>> s = u'<tag>\u0e42</tag>'.encode('utf-8')
>>> s
'<tag>\xe0\xb9\x82</tag>'
>>> from xml.dom.minidom import parseString
>>> doc = parseString(s)
>>> doc.documentElement.firstChild.data
u'\u0e42'
>>> from xml.etree.ElementTree import fromstring
>>> root = fromstring(s)
>>> root.text
u'\u0e42'
>>>