python 自然语言处理(二)____获得文本语料和词汇资源

一, 获取文本语料库

  一个文本语料库是一大段文本。它通常包含多个单独的文本,但为了处理方便,我们把他们头尾连接起来当做一个文本对待。

1. 古腾堡语料库

  nltk包含古腾堡项目(Project Gutenberg)电子文本档案的一小部分文本。要使用该语料库通常需要用Python解释器加载nltk包,然后尝试nltk.corpus.gutenberg.fileids().实例如下:

1 >>> import nltk
2 >>> nltk.corpus.gutenberg.fileids()
3 ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt'
4 , 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-a
5 lice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.t
6 xt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', '
7 shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'w
8 hitman-leaves.txt']
9 >>>

运行结果显示的是nltk包含了该语料库的哪些文本。我们可以对其中的任意文本进行操作。

1)统计词数。实例如下:

1 >>> emma = nltk.corpus.gutenberg.words('austen-emma.txt')
2 >>> len(emma)
3 192427
4 >>>

2)索引文本。实例如下:

1 >>> emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
2 >>> emma.concordance("surprise")
3 Displaying 1 of 1 matches:
4  that Emma could not but feel some surprise , and a little displeasure , on he
5 >>>

3)获取文本的标识符,词,句。实例如下:

279 >>> for fileid in gutenberg.fileids():
280 ...     raw = gutenberg.raw(fileid)
281 ...     num_chars = len(raw)
282 ...     words = gutenberg.words(fileid)
283 ...     num_words = len(words)
284 ...     sents = gutenberg.sents(fileid)
285 ...     num_sents = len(sents)
286 ...     vocab = set([w.lower() for w in gutenberg.words(fileid)])
287 ...     num_vocab = len(vocab)
288 ...     print("%d %d %d %s" % (num_chars, num_words, num_sents, fileid))
289 ...
290 887071 192427 7752 austen-emma.txt
291 466292 98171 3747 austen-persuasion.txt
292 673022 141576 4999 austen-sense.txt
293 4332554 1010654 30103 bible-kjv.txt
294 38153 8354 438 blake-poems.txt
295 249439 55563 2863 bryant-stories.txt
296 84663 18963 1054 burgess-busterbrown.txt
297 144395 34110 1703 carroll-alice.txt
298 457450 96996 4779 chesterton-ball.txt
299 406629 86063 3806 chesterton-brown.txt
300 320525 69213 3742 chesterton-thursday.txt
301 935158 210663 10230 edgeworth-parents.txt
302 1242990 260819 10059 melville-moby_dick.txt
303 468220 96825 1851 milton-paradise.txt
304 112310 25833 2163 shakespeare-caesar.txt
305 162881 37360 3106 shakespeare-hamlet.txt
306 100351 23140 1907 shakespeare-macbeth.txt
307 711215 154883 4250 whitman-leaves.txt
308 
309 >>> raw[:1000]
310 "[Leaves of Grass by Walt Whitman 1855]


Come, said my soul,
Such verses fo
311 r my Body let us write, (for we are one,)
That should I after return,
Or, long
312 , long hence, in other spheres,
There to some group of mates the chants resumin
313 g,
(Tallying Earth's soil, trees, winds, tumultuous waves,)
Ever with pleas'd
314 smile I may keep on,
Ever and ever yet the verses owning--as, first, I here and
315  now
Signing for Soul and Body, set to them my name,

Walt Whitman



[BO
316 OK I.  INSCRIPTIONS]

}  One's-Self I Sing

One's-self I sing, a simple sepa
317 rate person,
Yet utter the word Democratic, the word En-Masse.

Of physiology
318  from top to toe I sing,
Not physiognomy alone nor brain alone is worthy for th
319 e Muse, I say
    the Form complete is worthier far,
The Female equally with t
320 he Male I sing.

Of Life immense in passion, pulse, and power,
Cheerful, for
321 freest action form'd under the laws divine,
The Modern Man I sing.



}  As
322  I Ponder'd in Silence

As I ponder'd in silence,
Returning upon my poems, c"
323 >>>
324 >>> words
325 ['[', 'Leaves', 'of', 'Grass', 'by', 'Walt', 'Whitman', ...]
326 >>> sents
327 [['[', 'Leaves', 'of', 'Grass', 'by', 'Walt', 'Whitman', '1855', ']'], ['Come',
328 ',', 'said', 'my', 'soul', ',', 'Such', 'verses', 'for', 'my', 'Body', 'let', 'u
329 s', 'write', ',', '(', 'for', 'we', 'are', 'one', ',)', 'That', 'should', 'I', '
330 after', 'return', ',', 'Or', ',', 'long', ',', 'long', 'hence', ',', 'in', 'othe
331 r', 'spheres', ',', 'There', 'to', 'some', 'group', 'of', 'mates', 'the', 'chant
332 s', 'resuming', ',', '(', 'Tallying', 'Earth', "'", 's', 'soil', ',', 'trees', '
333 ,', 'winds', ',', 'tumultuous', 'waves', ',)', 'Ever', 'with', 'pleas', "'", 'd'
334 , 'smile', 'I', 'may', 'keep', 'on', ',', 'Ever', 'and', 'ever', 'yet', 'the', '
335 verses', 'owning', '--', 'as', ',', 'first', ',', 'I', 'here', 'and', 'now', 'Si
336 gning', 'for', 'Soul', 'and', 'Body', ',', 'set', 'to', 'them', 'my', 'name', ',
337 '], ...]

raw表示的是文本中所有的标识符,words是词,sents是句子。显然句子都是划分成一个个词来进行存储的。除了words(), raw() 和 sents()以外,大多数nltk语料库阅读器还包括多种访问方法。

2. 网络和聊天文本

古腾堡项目包含的是成千上万的书籍,它们比较正式,代表了既定的文学。除此之外, nltk中还有很多的网络文本小集合,其内容包括Firefox交流论坛,在纽约无意中听到的对话,《加勒比海盗》的电影剧本,个人广告和葡萄酒的评论。访问该部分的文本实例如下:

 
 1 >>> for fileid in webtext.fileids():
 2 ...     print("%s   %s ..." % (fileid, webtext.raw(fileid)[:65]))
 3 ...
 4 firefox.txt   Cookie Manager: "Don't allow sites that set removed cookies to se
 5 ...
 6 grail.txt   SCENE 1: [wind] [clop clop clop]
 7 KING ARTHUR: Whoa there!  [clop ...
 8 overheard.txt   White guy: So, do you have any plans for this evening?
 9 Asian girl ...
10 pirates.txt   PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr
11 ...
12 singles.txt   25 SEXY MALE, seeks attrac older single lady, for discreet encoun
13 ...
14 wine.txt   Lovely delicate, fragrant Rhone wine. Polished leather and strawb ...
15 
16 >>>

3. 即时消息聊天会话语料库

该语料库最初是由美国海军研究生院为研究自动检测互联网入侵者而收集的,包含超过1000个帖子,被分成15个文件,每个文件包含几百个从特定日期和特定年龄的聊天室收集的帖子。文件名包含日期,聊天室和帖子的数量。引用实例如下:

4.布朗语料库

布朗语料库是第一个百万词级的英语电子语料库,其中包含500个不同来源的文本,按照文体分类,如新闻,社论等。它主要用于研究文体之间的系统性差异(又叫做文体学的语言学研究)。我们可以将语料库作为词链表或者句子链表来访问。

1)按特定类别或文件阅读

 1 >>> from nltk.corpus import brown
 2 >>> brown.categories()
 3 ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies',
 4  'humor', 'learned', 'lore', 'mystery', 'new', 'news', 'religion', 'reviews', 'r
 5 omance', 'science_fiction']
 6 >>> brown.words(categories='news')
 7 ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
 9 >>> brown.words(fileids=['cg22'])
10 ['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]
11 >>> brown.sents(categories=['news', 'editorial', 'reviews', ])
12 [['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investiga
13 tion', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no
14 ', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['T
15 he', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the',
16  'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'o
17 f', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks',
18 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which
19 ', 'the', 'election', 'was', 'conducted', '.'], ...]
20 >>>

2)比较不同文体之间情态动词的用法

 1 >>> from nltk.corpus import brown
 2 >>> news_text = brown.words(categories='news')
 3 >>> fdist=nltk.FreqDist([w.lower() for w in news_text])
 4 >>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
 5 >>> for m in modals:
 6 ...     print("%s:%d" %(m, fdist[m]))
 7 ...
 8 can:94
 9 could:87
10 may:93
11 might:38
12 must:53
13 will:389
14 >>>

5. 路透社语料库

 路透社语料库包含10788个新闻文档,共计130万字。这些文档分成90个主题,按照训练和测试分为两组,这样分割是为了方便运用训练和测试算法的自动检测文档的主题。与布朗语料库不同,路透社语料库的类别是相互重叠的,因为新闻报道往往涉及多个主题。我们可以查找由一个或多个文档涵盖的主题,也可以查找包含在一个或者多个类别中的文档。应用实例如下:

 1 >>> reuters.categories()
 2 ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', ...]
 3 >>> reuters.categories('training/9865')
 4 ['barley', 'corn', 'grain', 'wheat']
 5 >>> reuters.categories(['training/9865', 'training/9880'])
 6 ['barley', 'corn', 'grain', 'money-fx', 'wheat']
 7 >>> reuters.fileids('barley')
 8 ['test/15618', 'test/15649', 'test/15676', 'test/15728', 'test/15871', 'test/158
 9 75',....]
10 >>> reuters.fileids(['barley', 'corn'])
11 ['test/14832', 'test/14858', 'test/15033', 'test/15043', 'test/15106', 'test/152
12 87', 'test/15341', 'test/15618', 'test/15648', 'test/15649', ...]
13 >>>
14 >>> reuters.words('training/9865')[:14]
15 ['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', 'BIDS', 'DETAILED', 'French', '
16 operators', 'have', 'requested', 'licences', 'to', 'export']
17 >>> reuters.words(['training/9865', 'training/9880'])
18 ['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...]
19 >>> reuters.words(categories='barley')
20 ['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...]
21 >>> reuters.words(categories=['barley', 'corn'])
22 ['THAI', 'TRADE', 'DEFICIT', 'WIDENS', 'IN', 'FIRST', ...]
23 >>>
View Code

6.就职演说预料库

该语料库实际上是55个文本的集合,每个文本都是一个总统的演说。这个集合的一个显著特征是时间维度。

 1 >>> from nltk.corpus import inaugural
 2 >>> inaugural.fileids()
 3 ['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson
 4 .txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monro
 5 e.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.t
 6 xt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt
 7 ', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt
 8 ', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1
 9 885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.tx
10 t', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt
11 ', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt'
12 , '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosev
13 elt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961
14 -Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1973-Nixon.txt', '1977-Car
15 ter.txt', '1981-Reagan.txt', '1985-Reagan.txt', '1989-Bush.txt', '1993-Clinton.t
16 xt', '1997-Clinton.txt', '2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt']
17 >>> [fileid[:4] for fileid in inaugural.fileids()]
18 ['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', '1825',
19  '1829', '1833', '1837', '1841', '1845', '1849', '1853', '1857', '1861', '1865',
20  '1869', '1873', '1877', '1881', '1885', '1889', '1893', '1897', '1901', '1905',
21  '1909', '1913', '1917', '1921', '1925', '1929', '1933', '1937', '1941', '1945',
22  '1949', '1953', '1957', '1961', '1965', '1969', '1973', '1977', '1981', '1985',
23  '1989', '1993', '1997', '2001', '2005', '2009']
24 >>>
View Code

需要注意的是每个文本的年代都出现在它的文件名中。要从文件名中获得年代,使用fileid[:4]提取前四个字符。

1 >>> import nltk
2 >>> cfd=nltk.ConditionalFreqDist(
3 ... (target, fileid[:4]
4 ... )
5 ... for fileid in inaugural.fileids()
6 ... for w in inaugural.words(fileid)
7 ... for target in ['america', 'citizen']
8 ... if w.lower().startswith(target))
9 >>> cfd.plot()
View Code

python 自然语言处理(二)____获得文本语料和词汇资源

以上实例是词汇america和citizen随时间推移的使用情况。就职演说语料库中所有以america或citizen开始的词都将被计数。每个演讲单独计数并绘制出图形,这样就能观察出随时间变化这些用法的演变趋势。计数没有与文档长度进行归一化处理。

7.标注文本语料库

 许多文本语料库都包含语言学标注,有词性标注,命名实体,句法结构,语义角色等。nltk中提供了几种很方便的方法来访问这几个语料库,而且还包含有语料库和语料样本的数据包,用于教学和科研时可以免费下载。

8.其他语言语料库

nltk还包含多国语言语料库。比如udhr,包含有超过300种语言的世界*宣言。这个语料库的fileids包括有关文件所使用的字符编码信息,比如:UTF8或者Latin1。利用条件频率分布来研究“世界*宣言”(udhr)语料库中不同语言版本中的字长差异。应用实例如下:

 1 >>> from nltk.corpus import udhr
 2 >>> languages=['Chickasaw', 'English', 'German_Deutsch', 'Greenlandic_Inuktikut'
 3 , 'Hungarian_Magyar', 'Ibibio_Efik']
 4 >>>
 5 >>> cfd=nltk.ConditionalFreqDist(
 6 ... (lang, len(word))
 7 ... for lang in languages
 8 ... for word in udhr.words(lang+'-Latin1'))
 9 >>> cfd.plot(cumulative=True)
10 >>>
View Code

python 自然语言处理(二)____获得文本语料和词汇资源

9.nltk中定义的基本语料库函数

示例 描述
fileids() 语料库中的文件
fileids([categories]) 分类对应的语料库中的文件
categories() 语料库中的分类
categories([fileids]) 文件对应的语料库中的分类
raw() 语料库的原始内容
raw([fileids=[f1, f2, f3]) 指定文件的原始内容
raw(categories=[c1, c2]) 指定分类的原始内容
words() 整个语料库中的词汇
words(fileids=[f1,f2,f3]) 指定文件中的词汇
words(categories=[c1,c2]) 指定分类中的词汇
sents() 指定分类中的句子
sents(fileids=[f1,f2,f3]) 指定文件中的句子
sents(categories=[c1,c2]) 指定分类中的句子
abspath(fileid) 指定文件在磁盘上的位置
encoding(fileid) 文件编码(如果知道的话)
open(fileid) 打开指定语料库文件的文件流
root() 到本地安装的语料库根目录的路径

readme()

语料库中的README文件的内容

10.载入自己的语料库

 1 >>> from nltk.corpus import *
 2 >>> corpus_root = r"E:corpora"             //本地存放文本的目录,原始的nltk数据库存放目录为D:
 3 >>> wordlists=PlaintextCorpusReader(corpus_root, '.*')
 4 >>> wordlists.fileids()                    //获取文件列表
 5 ['README', 'aaaaaaaaaaa.txt', 'austen-emma.txt', 'austen-persuasion.txt', 'auste              //其中的aaaaaaaaaaa.txt是自定义的文件
 6 n-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess
 7 -busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown
 8 .txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'luo.txt', 'melville-
 9 moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-ha
10 mlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
11 >>>

自己的语料库加载成功后,我们就可以使用各种函数对其中的语料进行操作。

相关推荐