regexp_tokenize和阿拉伯文字
问题描述:
我正在使用 regexp_tokenize()
来从没有任何标点符号的阿拉伯文字:
I'm using regexp_tokenize()
to return tokens from an Arabic text without any punctuation marks:
import re,string,sys
from nltk.tokenize import regexp_tokenize
def PreProcess_text(Input):
tokens=regexp_tokenize(Input, r'[،؟!.؛]\s*', gaps=True)
return tokens
H = raw_input('H:')
Cleand= PreProcess_text(H)
print '\n'.join(Cleand)
它工作正常,但是问题出在我尝试打印文本时.
It worked fine, but the problem is when I try to print the text.
文本ايمان،سعد
的输出:
?يم
?ن
?
?
?
但是如果文本是英语,即使带有阿拉伯标点符号,它也会打印正确的结果.
but if the text is in English, even with an Arabic punctuation marks, it prints the right result.
文本hi،eman
的输出:
hi
eman
答
使用raw_input
时,符号被编码为字节.
When you use raw_input
, the symbols are coded as bytes.
您需要使用以下命令将其转换为Unicode字符串
You need to convert it into a Unicode string with
H.decode('utf8')
您可以保留正则表达式:
And you may keep your regex:
tokens=regexp_tokenize(Input, r'[،؟!.؛]\s*', gaps=True)