计算文本中每个单词的出现次数 - Python
我知道我可以在文本/数组中找到一个单词:
I know that I can find a word in a text/array with this:
if word in text:
print 'success'
我想做的是阅读文本中的一个单词,并根据找到的单词不断计数(这是一个简单的计数器任务).但问题是我真的不知道如何read
已经读过的单词.最后:统计每个单词出现的次数?
What I want to do is read a word in a text, and keep counting as many times as the word is found (it is a simple counter task). But the thing is I do not really know how to read
words that have already been read. In the end: count the number occurrences of each word?
我想过保存在一个数组中(甚至是多维数组,所以保存单词和它出现的次数,或者两个数组),每次在该数组中出现一个单词时求和1.
I have thought of saving in an array (or even multidimensional array, so save the word and the number of times it appears, or in two arrays), summing 1 every time it appears a word in that array.
那么,当我读一个词时,我能不能用类似这样的东西来读它:
So then, when I read a word, can I NOT read it with something similar to this:
if word not in wordsInText:
print 'success'
既然我们已经确定了您要实现的目标,我可以给您一个答案.现在您需要做的第一件事是将文本转换为单词列表.虽然 split
方法可能看起来是一个不错的解决方案,但当句子以单词结尾,后跟句号、逗号或任何其他字符时,它会在实际计数中产生问题.所以这个问题的一个很好的解决方案是 NLTK.假设您拥有的文本存储在名为 text
的变量中.您要查找的代码如下所示:
Now that we established what you're trying to achieve, I can give you an answer. Now the first thing you need to do is convert the text into a list of words. While the split
method might look like a good solution, it will create a problem in the actual counting when sentences end with a word, followed by a full stop, commas or any other characters. So a good solution for this problem would be NLTK. Assume that the text you have is stored in a variable called text
. The code you are looking for would look something like this:
from itertools import chain
from collections import Counter
from nltk.tokenize import sent_tokenize, word_tokenize
text = "This is an example text. Let us use two sentences, so that it is more logical."
wordlist = list(chain(*[word_tokenize(s) for s in sent_tokenize(text)]))
print(Counter(wordlist))
# Counter({'.': 2, 'is': 2, 'us': 1, 'more': 1, ',': 1, 'sentences': 1, 'so': 1, 'This': 1, 'an': 1, 'two': 1, 'it': 1, 'example': 1, 'text': 1, 'logical': 1, 'Let': 1, 'that': 1, 'use': 1})