是什么产生了"maxent_treebank_pos_tagger/english.pickle"?
nltk
程序包的内置词性标记器似乎并未针对我的用例进行优化(此处的源代码表明它正在使用一个保存的,经过预先训练的分类器,称为maxent_treebank_pos_tagger
.
The nltk
package's built-in part-of-speech tagger does not seem to be optimized for my use-case (here, for instance). The source code here shows that it's using a saved, pre-trained classifier called maxent_treebank_pos_tagger
.
是什么创造了maxent_treebank_pos_tagger/english.pickle
?我猜想那里有一个标记的语料库用于训练此标记器,所以我想我正在寻找(a)标记语料库和(b)基于标记的标记器训练的确切代码.语料库.
What created maxent_treebank_pos_tagger/english.pickle
? I'm guessing that there is a tagged corpus out there somewhere that was used to train this tagger, so I think I'm looking for (a) that tagged corpus and (b) the exact code that trains the tagger based on the tagged corpus.
除了进行大量的谷歌搜索外,到目前为止,我试图直接查看.pickle
对象,以查找其中的任何线索,像这样开始
In addition to lots of googling, so far I tried to look at the .pickle
object directly to find any clues inside it, starting like this
from nltk.data import load
x = load("nltk_data/taggers/maxent_treebank_pos_tagger/english.pickle")
dir(x)
The NLTK source is https://github.com/nltk/nltk/blob/develop/nltk/tag/init.py#L83
NLTK的MaxEnt POS标记器的原始来源来自 https://github .com/arne-cl/nltk-maxent-pos-tagger
The original source of NLTK's MaxEnt POS tagger is from https://github.com/arne-cl/nltk-maxent-pos-tagger
培训数据:宾夕法尼亚树银行语料库的《华尔街日报》子集
Training Data: Wall Street Journal subset of the Penn Tree bank corpus
功能: Ratnaparki(1996)
算法:最大熵