nltk wordpunct_tokenize与word_tokenize

问题描述:

有人知道nltkwordpunct_tokenizeword_tokenize之间的区别吗?我正在使用nltk=3.2.4,并且wordpunct_tokenize的文档字符串上没有任何内容可以解释差异.在nltk的文档中也找不到此信息(也许我没有在正确的位置搜索!).我本来希望第一个摆脱标点符号之类的东西,但事实并非如此.

Does anyone know the difference between nltk's wordpunct_tokenize and word_tokenize? I'm using nltk=3.2.4 and there's nothing on the doc string of wordpunct_tokenize that explains the difference. I couldn't find this info either in the documentation of nltk (perhaps I didn't search in the right place!). I would have expected that first one would get rid of punctuation tokens or the like, but it doesn't.

wordpunct_tokenize基于简单的正则表达式令牌化.定义为

wordpunct_tokenize is based on a simple regexp tokenization. It is defined as

wordpunct_tokenize = WordPunctTokenizer().tokenize

,您可以在此处找到.基本上,它使用正则表达式\w+|[^\w\s]+分割输入.

which you can find here. Basically it uses the regular expression \w+|[^\w\s]+ to split the input.

word_tokenize基于TreebankWordTokenizer,请参见文档此处.它基本上像在Penn Treebank中标记文本一样.这是一个愚蠢的示例,应该显示两者之间的差异.

word_tokenize on the other hand is based on a TreebankWordTokenizer, see the docs here. It basically tokenizes text like in the Penn Treebank. Here is a silly example that should show how the two differ.

sent = "I'm a dog and it's great! You're cool and Sandy's book is big. Don't tell her, you'll regret it! 'Hey', she'll say!"
>>> word_tokenize(sent)
['I', "'m", 'a', 'dog', 'and', 'it', "'s", 'great', '!', 'You', "'re", 
 'cool', 'and', 'Sandy', "'s", 'book', 'is', 'big', '.', 'Do', "n't", 'tell',
 'her', ',', 'you', "'ll", 'regret', 'it', '!', "'Hey", "'", ',', 'she', "'ll", 'say', '!']
>>> wordpunct_tokenize(sent)
['I', "'", 'm', 'a', 'dog', 'and', 'it', "'", 's', 'great', '!', 'You', "'",
 're', 'cool', 'and', 'Sandy', "'", 's', 'book', 'is', 'big', '.', 'Don',
 "'", 't', 'tell', 'her', ',', 'you', "'", 'll', 'regret', 'it', '!', "'", 
 'Hey', "',", 'she', "'", 'll', 'say', '!']

如我们所见,wordpunct_tokenize将在所有特殊符号上进行大量拆分,并将它们视为单独的单位.另一方面,word_tokenize将类似're的东西放在一起.不过,这似乎并不那么聪明,因为我们可以看到它无法将初始单引号与'Hey'分开.

As we can see, wordpunct_tokenize will split pretty much at all special symbols and treat them as separate units. word_tokenize on the other hand keeps things like 're together. It doesn't seem to be all that smart though, since as we can see it fails to separate the initial single quote from 'Hey'.

有趣的是,如果我们改写这样的句子(单引号作为字符串定界符,双引号放在"Hey"周围):

Interestingly, if we write the sentence like this instead (single quotes as string delimiter and double quotes around "Hey"):

sent = 'I\'m a dog and it\'s great! You\'re cool and Sandy\'s book is big. Don\'t tell her, you\'ll regret it! "Hey", she\'ll say!'

我们得到

>>> word_tokenize(sent)
['I', "'m", 'a', 'dog', 'and', 'it', "'s", 'great', '!', 'You', "'re", 
 'cool', 'and', 'Sandy', "'s", 'book', 'is', 'big', '.', 'Do', "n't", 
 'tell', 'her', ',', 'you', "'ll", 'regret', 'it', '!', '``', 'Hey', "''", 
 ',', 'she', "'ll", 'say', '!']

因此,word_tokenize确实会分隔双引号,但是也会将双引号转换为``''. wordpunct_tokenize不这样做:

so word_tokenize does split off double quotes, however it also converts them to `` and ''. wordpunct_tokenize doesn't do this:

>>> wordpunct_tokenize(sent)
['I', "'", 'm', 'a', 'dog', 'and', 'it', "'", 's', 'great', '!', 'You', "'", 
 're', 'cool', 'and', 'Sandy', "'", 's', 'book', 'is', 'big', '.', 'Don', 
 "'", 't', 'tell', 'her', ',', 'you', "'", 'll', 'regret', 'it', '!', '"', 
 'Hey', '",', 'she', "'", 'll', 'say', '!']