是什么产生了"maxent_treebank_pos_tagger/english.pickle"?

是什么产生了

问题描述:

nltk程序包的内置词性标记器似乎并未针对我的用例进行优化(此处的源代码表明它正在使用一个保存的,经过预先训练的分类器,称为maxent_treebank_pos_tagger.

The nltk package's built-in part-of-speech tagger does not seem to be optimized for my use-case (here, for instance). The source code here shows that it's using a saved, pre-trained classifier called maxent_treebank_pos_tagger.

是什么创造了maxent_treebank_pos_tagger/english.pickle?我猜想那里有一个标记的语料库用于训练此标记器,所以我想我正在寻找(a)标记语料库和(b)基于标记的标记器训练的确切代码.语料库.

What created maxent_treebank_pos_tagger/english.pickle? I'm guessing that there is a tagged corpus out there somewhere that was used to train this tagger, so I think I'm looking for (a) that tagged corpus and (b) the exact code that trains the tagger based on the tagged corpus.

除了进行大量的谷歌搜索外,到目前为止,我试图直接查看.pickle对象,以查找其中的任何线索,像这样开始

In addition to lots of googling, so far I tried to look at the .pickle object directly to find any clues inside it, starting like this

from nltk.data import load
x = load("nltk_data/taggers/maxent_treebank_pos_tagger/english.pickle")
dir(x)

NLTK源是

The NLTK source is https://github.com/nltk/nltk/blob/develop/nltk/tag/init.py#L83

NLTK的MaxEnt POS标记器的原始来源来自 https://github .com/arne-cl/nltk-maxent-pos-tagger

The original source of NLTK's MaxEnt POS tagger is from https://github.com/arne-cl/nltk-maxent-pos-tagger

培训数据:宾夕法尼亚树银行语料库的《华尔街日报》子集

Training Data: Wall Street Journal subset of the Penn Tree bank corpus

功能: Ratnaparki(1996)

算法:最大熵

准确度:准确度是多少的是ltk pos_tagger?