python 自然语言处理第二章

xiaoxiao2021-02-28 109

第二章获得文本语料和词汇资源

import nltk from nltk.corpus import brown cfd=nltk.ConditionalFreqDist((genre,word)\ for genre in brown.categories()\ for word in brown.words(categories=genre)) genre_word=[(genre,word) for genre in ['news','romance'] for word in brown.words(categories=genre)] cfd=nltk.ConditionalFreqDist(genre_word) cfd['news'] FreqDist({u'sunbonnet': 1, u'Elevated': 1, u'narcotic': 2, u'four': 73, u'woods': 4, u'railing': 1, u'Until': 5, u'aggression': 1, u'marching': 2, u'increase': 24, u'eligible': 4, ...})

生成器generator 通过列表生成式，我们可以直接创建一个列表。但是，受到内存限制，列表容量肯定是有限的。而且，创建一个包含100万个元素的列表，不仅占用很大的存储空间，如果我们仅仅需要访问前面几个元素，那后面绝大多数元素占用的空间都白白浪费了。所以，如果列表元素可以按照某种算法推算出来，那我们是否可以在循环的过程中不断推算出后续的元素呢？这样就不必创建完整的list，从而节省大量的空间。在Python中，这种一边循环一边计算的机制，称为生成器（Generator）。

text=nltk.corpus.genesis.words('english-kjv.txt') bigrams=nltk.bigrams(text) cfd=nltk.ConditionalFreqDist(bigrams) print bigrams print cfd <generator object bigrams at 0x000000000751F090> <ConditionalFreqDist with 2789 conditions> cfd['living'] FreqDist({u',': 1, u'.': 1, u'creature': 7, u'soul': 1, u'substance': 2, u'thing': 4})

转载请注明原文地址: https://www.6miu.com/read-57084.html

技术

最新回复(0)

python 自然语言处理 第二章

技术

python 自然语言处理第二章