Python CSV写入到Excel中不可读的文件(中文字符)
我正在尝试对中文文本进行文本分析.该程序在下面提供.我得到了带有诸如浜烘皯镞ユ皯镞捐
之类的不可读字符的结果.而且,如果我将输出文件 result.csv
更改为 result.txt
,则字符正确为人民日报论文
.那么,这怎么了?我不知道.我尝试了几种方法,包括添加 decoder
和 encoder
.
I am trying to performing text analysis on Chinese texts. The program is provided below. I got the result with unreadable characters such as 浜烘皯鏃ユ姤绀捐
. And if I change the output file result.csv
to result.txt
, the characters are correct as 人民日报社论
. So what's wrong with this? I can not figure out. I tried several ways including add decoder
and encoder
.
# -*- coding: utf-8 -*-
import os
import glob
import jieba
import jieba.analyse
import csv
import codecs
segList = []
raw_data_path = 'monthly_raw_data/'
file_name = ["201010", "201011", "201012", "201101", "201103", "201105", "201107", "201109", "201110", "201111", "201112", "201201", "201202", "201203", "201205", "201206", "201208", "201210", "201211"]
jieba.load_userdict("customized_dict.txt")
for name in file_name:
all_text = ""
multi_line_text = ""
with open(raw_data_path + name + ".txt", "r") as file:
for line in file:
if line != '\n':
multi_line_text += line
templist = multi_line_text.split('\n')
for text in templist:
all_text += text
seg_list = jieba.cut(all_text,cut_all=False)
temp_text = []
for item in seg_list:
temp_text.append(item.encode('utf-8'))
stop_list = []
with open("stopwords.txt", "r") as stoplistfile:
for item in stoplistfile:
stop_list.append(item.rstrip('\r\n'))
text_without_stopwords = []
for word in temp_text:
if word not in stop_list:
text_without_stopwords.append(word)
segList.append(text_without_stopwords)
with open("results/result.csv", 'wb') as f:
writer = csv.writer(f)
writer.writerows(segList)
对于UTF-8编码,Excel要求在文件开头写入BOM(字节顺序标记)代码点,否则它将假定 ANSI
>编码,这取决于语言环境. U + FEFF
是Unicode BOM.这是一个可以在Excel中正确打开的示例:
For UTF-8 encoding, Excel requires BOM (byte order mark) codepoint written at the start of the file or it will assume ANSI
encoding, which is locale-dependent. U+FEFF
is the Unicode BOM. Here's an example that will open in Excel correctly:
#!python2
#coding:utf8
import csv
data = [[u'American',u'美国人'],
[u'Chinese',u'中国人']]
with open('results.csv','wb') as f:
f.write(u'\ufeff'.encode('utf8'))
w = csv.writer(f)
for row in data:
w.writerow([item.encode('utf8') for item in row])
为了完整起见,Python 3使此操作更容易.注意 newline =''
参数而不是 wb
和 utf-8-sig
编码会自动添加BOM.Unicode字符串是直接编写的,而不需要对每个项目进行编码.
For completeness, Python 3 makes this easier. Note newline=''
parameter instead of wb
and utf-8-sig
encoding automatically adds a BOM. Unicode strings are written directly instead of needing to encode each item.
#!python3
#coding:utf8
import csv
data = [[u'American',u'美国人'],
[u'Chinese',u'中国人']]
with open('results.csv','w',newline='',encoding='utf-8-sig') as f:
w = csv.writer(f)
w.writerows(data)
还有第三方模块 unicodecsv
,它也使Python 2更容易:
There is also the 3rd party module unicodecsv
that makes Python 2 easier as well:
#!python2
#coding:utf8
import unicodecsv
data = [[u'American',u'美国人'],
[u'Chinese',u'中国人']]
with open('results.csv','wb') as f:
w = unicodecsv.writer(f,encoding='utf-8-sig')
w.writerows(data)