如何在Python Natural Language Toolkit中创建自己的语料库?

问题描述:

我最近在nltk中扩展了名称主体,并想知道如何将我拥有的两个文件(male.txt,female.txt)转换为主体,以便可以使用现有的nltk.corpus访问它们.方法.有人有什么建议吗?

I have recently expanded the names corpus in nltk and would like to know how I can turn the two files I have (male.txt, female.txt) in to a corpus so I can access them using the existing nltk.corpus methods. Does anyone have any suggestions?

非常感谢, 詹姆斯.

As the readme says, the names corpus is not in the public domain -- you should send an email with any changes you make to the corpus author (address is in that file). Apart from that detail of law and courtesy, you can simply replace either or both of those files with your own, they're in perfectly simple format (one name per line, comments allowed [[and ignored]] and start with '#').

要安装全新的语料库,而不仅仅是调整现有的语料库,您可以从给出

To install a totally new corpus rather than just tweaking an existing ones, you could start with the docs given here.