Python re.split()与nltk word_tokenize和send_tokenize

问题描述:

我正在经历这个问题.

我只是想知道NLTK在单词/句子标记化方面是否会比正则表达式更快.

Am just wondering whether NLTK would be faster than regex in word/sentence tokenization.

默认nltk.word_tokenize()使用的是 Penn树库令牌生成器.

请注意,str.split()在语言学意义上无法获得标记,例如:

Do note that str.split() doesn't achieve tokens in the linguistics sense, e.g.:

>>> sent = "This is a foo, bar sentence."
>>> sent.split()
['This', 'is', 'a', 'foo,', 'bar', 'sentence.']
>>> from nltk import word_tokenize
>>> word_tokenize(sent)
['This', 'is', 'a', 'foo', ',', 'bar', 'sentence', '.']

通常用于使用指定的分隔符分隔字符串,例如在制表符分隔的文件中,可以使用str.split('\t'),或者当文本文件每行只有一个句子时,尝试用换行符\n拆分字符串.

It is usually used to separate strings with specified delimiter, e.g. in a tab-separated file, you can use str.split('\t') or when you are trying to split a string by the newline \n when your textfile has one sentence per line.

让我们在python3中进行一些基准测试:

And let's do some benchmarking in python3:

import time
from nltk import word_tokenize

import urllib.request
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')

for _ in range(10):
    start = time.time()
    for line in data.split('\n'):
        line.split()
    print ('str.split():\t', time.time() - start)

for _ in range(10):
    start = time.time()
    for line in data.split('\n'):
        word_tokenize(line)
    print ('word_tokenize():\t', time.time() - start)

[输出]:

str.split():     0.05451083183288574
str.split():     0.054320573806762695
str.split():     0.05368804931640625
str.split():     0.05416440963745117
str.split():     0.05299568176269531
str.split():     0.05304527282714844
str.split():     0.05356955528259277
str.split():     0.05473494529724121
str.split():     0.053118228912353516
str.split():     0.05236077308654785
word_tokenize():     4.056122779846191
word_tokenize():     4.052812337875366
word_tokenize():     4.042144775390625
word_tokenize():     4.101543664932251
word_tokenize():     4.213029146194458
word_tokenize():     4.411528587341309
word_tokenize():     4.162556886672974
word_tokenize():     4.225975036621094
word_tokenize():     4.22914719581604
word_tokenize():     4.203172445297241

如果我们尝试使用 [输出]:

toktok:  1.5902607440948486
toktok:  1.5347232818603516
toktok:  1.4993178844451904
toktok:  1.5635688304901123
toktok:  1.5779635906219482
toktok:  1.8177132606506348
toktok:  1.4538452625274658
toktok:  1.5094449520111084
toktok:  1.4871931076049805
toktok:  1.4584410190582275

(注意:文本文件的来源来自 https://github.com/Simdiva/DSL -任务)

(Note: the source of the text file is from https://github.com/Simdiva/DSL-Task)

如果我们看一下本机perl的实现,则ToktokTokenizerpythonperl时间比较.但是,在python实现中执行此操作即可,而在perl中则可以对正则表达式进行预编译,但不是

If we look at the native perl implementation, the python vs perl time for the ToktokTokenizer is comparable. But do that in the python implementation the regexes are pre-compiled while in perl, it isn't but then the proof is still in the pudding:

alvas@ubi:~$ wget https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl
--2016-02-11 20:36:36--  https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.31.17.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.31.17.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2690 (2.6K) [text/plain]
Saving to: ‘tok-tok.pl’

100%[===============================================================================================================================>] 2,690       --.-K/s   in 0s      

2016-02-11 20:36:36 (259 MB/s) - ‘tok-tok.pl’ saved [2690/2690]

alvas@ubi:~$ wget https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt
--2016-02-11 20:36:38--  https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.31.17.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.31.17.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3483550 (3.3M) [text/plain]
Saving to: ‘test.txt’

100%[===============================================================================================================================>] 3,483,550    363KB/s   in 7.4s   

2016-02-11 20:36:46 (459 KB/s) - ‘test.txt’ saved [3483550/3483550]

alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null

real    0m1.703s
user    0m1.693s
sys 0m0.008s
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null

real    0m1.715s
user    0m1.704s
sys 0m0.008s
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null

real    0m1.700s
user    0m1.686s
sys 0m0.012s
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null

real    0m1.727s
user    0m1.700s
sys 0m0.024s
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null

real    0m1.734s
user    0m1.724s
sys 0m0.008s

(注意:对tok-tok.pl进行计时时,我们必须将输出通过管道传输到文件中,因此这里的计时包括机器将其输出到文件所花费的时间,而在nltk.tokenize.ToktokTokenizer计时中,则不是包括输出到文件的时间)

(Note: When timing the tok-tok.pl, we had to pipe the output into a file, so the timing here includes the time the machine takes to output to file, whereas in the nltk.tokenize.ToktokTokenizer timing, it's doesn't include time to output into a file)

关于sent_tokenize(),这有点不同,在不考虑精度的情况下比较速度基准有些古怪.

With regards to sent_tokenize(), it's a little different and comparing speed benchmark without considering accuracy is a little quirky.

考虑一下:

  • 如果正则表达式将文本文件/段落拆分为1个句子,则速度几乎是瞬时的,即完成了0个工作.但这将是一个可怕的句子标记器...

  • If a regex splits a textfile/paragraph up in 1 sentence, then the speed is almost instantaneous, i.e. 0 work done. But that would be a horrible sentence tokenizer...

如果文件中的句子已经被\n分隔,那么这只是比较str.split('\n')re.split('\n')nltk与句子标记化无关的一种情况; P

If sentences in a file is already separated by \n, then that is simply a case of comparing how str.split('\n') vs re.split('\n') and nltk would have nothing to do with the sentence tokenization ;P

有关sent_tokenize()在NLTK中如何工作的信息,请参阅:

For information on how sent_tokenize() works in NLTK, see:

  • training data format for nltk punkt
  • Use of PunktSentenceTokenizer in NLTK

因此,要有效地将sent_tokenize()与其他基于正则表达式的方法(不是str.split('\n'))进行比较,人们还必须评估准确性,并拥有带有人为评估的句子的数据集,其形式为标记化格式.

So to effectively compare sent_tokenize() vs other regex based methods (not str.split('\n')), one would have to evaluate also the accuracy and have a dataset with humanly evaluated sentence in a tokenized format.

考虑此任务: https://www.hackerrank.com/challenges/从段落到句子

给出文字:

在第三类中,他包括那些 在共济会中,除了外部形式和仪式外,什么都没看到, 珍惜这些表格的严格表现而不必担心 其目的或意义.威拉尔斯基(Willarski)甚至大酒店(Grand) 校长的主人.最后,到第四类 许多兄弟都属于,特别是那些最近 加入了.根据皮埃尔的观察,这些人没有 信仰任何事物,也不渴望任何事物,但加入了共济会 只是为了和那些有钱的年轻兄弟结伴 通过他们的联系或等级来影响,并且有谁 旅馆里的很多人.皮埃尔开始对他的所作所为感到不满意 正在做.共济会,无论如何他有时在这里看到的 在他看来,仅仅是基于外部因素.他没有想到怀疑 共济会本身,但怀疑俄罗斯砌体采取了 错误的路径,并偏离了其原本的原则.依此类推 到年底,他出国升学 命令的秘密.在这种情况下该怎么办?到 支持革命,推翻一切,以武力击退?我们 距离那很远.每次暴力改革都应受到谴责,因为它 当人们保持现状的同时,也完全无法纠正邪恶,而且 因为智慧不需要暴力. 但是整个过程到底有什么 是这样吗?"伊拉金的新郎说.一旦她错过了,转过身来. 它消失了,任何杂种都能把它拿走." 时间,从他的疾驰和激动中喘不过气来.

In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance. Such were Willarski and even the Grand Master of the principal lodge. Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined. These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.Pierre began to feel dissatisfied with what he was doing. Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals. He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles. And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.What is to be done in these circumstances? To favor revolutions, overthrow everything, repel force by force?No! We are very far from that. Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence. "But what is there in running across it like that?" said Ilagin's groom. "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement.

我们想要得到这个:

In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance.
Such were Willarski and even the Grand Master of the principal lodge.
Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined.
These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.
Pierre began to feel dissatisfied with what he was doing.
Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals.
He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles.
And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.
What is to be done in these circumstances?
To favor revolutions, overthrow everything, repel force by force?
No!
We are very far from that.
Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence.
"But what is there in running across it like that?" said Ilagin's groom.
"Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement.

因此,简单地执行str.split('\n')不会给您任何帮助.即使不考虑句子的顺序,您也会得到0个肯定的结果:

So simply doing str.split('\n') will give you nothing. Even without considering the order of the sentences, you will yield 0 positive result:

>>> text = """In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance. Such were Willarski and even the Grand Master of the principal lodge. Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined. These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.Pierre began to feel dissatisfied with what he was doing. Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals. He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles. And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.What is to be done in these circumstances? To favor revolutions, overthrow everything, repel force by force?No! We are very far from that. Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence. "But what is there in running across it like that?" said Ilagin's groom. "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement. """
>>> answer = """In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance.
... Such were Willarski and even the Grand Master of the principal lodge.
... Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined.
... These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.
... Pierre began to feel dissatisfied with what he was doing.
... Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals.
... He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles.
... And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.
... What is to be done in these circumstances?
... To favor revolutions, overthrow everything, repel force by force?
... No!
... We are very far from that.
... Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence.
... "But what is there in running across it like that?" said Ilagin's groom.
... "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement."""
>>> 
>>> output = text.split('\n')
>>> sum(1 for sent in text.split('\n') if sent in answer)
0