在python中读取德语csv文件的问题
我有一个德语csv文件,我想用pd.read_csv
读取.
I am having a german csv file, which I want to read with pd.read_csv
.
数据:
原始文件如下:
因此它有两个列(A,B),并且分隔符应为';'
,
So it has two Columns (A,B) and the seperator should be ';'
,
问题: 当我运行命令时:
dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
encoding='utf-8', header=None, sep=';')
我得到了错误:
ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3
半解决方案: 我知道这可能有几个原因,但是当我运行命令时:
Half-Solution: I understand this could have several reasons, but when I ran the command:
dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
encoding='utf-8', header=None, sep='delimiter')
我获得了以下数据集:
0
0 Etat;Die ARD-Tochter Degeto hat sich verpflich...
1 Etat;App sei nicht so angenommen worden wie ge...
2 Etat;'Zum Welttag der Suizidprävention ist es ...
3 Etat;Mitarbeiter überreichten Eigentümervertre...
4 Etat;Service: Jobwechsel in der Kommunikations...
所以我只得到一列,而不是两个所需的列,
so I only get one column instead of the two desired columns,
目标: 任何想法如何正确加载我拥有的数据集:
Target: any idea how to load the dataset correctly that I have:
0 1
0 Etat Die ARD-Tochter Degeto hat sich verpflich...
1 Etat App sei nicht so angenommen worden wie ge...
提示/尝试:
当我在excel中对数据运行搜索功能时,我也没有在其中找到任何;
.
When I run the search function over my data in excel, I am also not finding any ;
in it.
似乎有些行有多于两列(例如,您可以在示例的第3行和第13行中看到
It seems like that some lines have more then two columns (as you can see for example in line 3 and 13 of my example
一种可能的解决方案是创建一个列DataFrame
,并用分隔符(不在delimiter
之类的数据中)创建分隔符,然后使用
One possible solution is create one column DataFrame
with separator not in data like delimiter
and then use Series.str.split
with n
parameter and expand=True
for new DataFrame
:
dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
encoding='utf-8', header=None, sep='delimiter')
#more general solution is use some value NOT exist in data like yen ¥
#dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
# encoding='utf-8', header=None, sep='¥')
df = dataset[0].str.split(';', n=1, expand=True)
df.columns = ['A','B']
print (df)