基于文本模式的主题模式识别

xiaoxiao2021-02-28  46

     前面几篇博文都介绍了几种不同的分类器,基于分类,好像其他场合应用的监督学习,但有时我们不知道主题分类,这时,相当于其他场合的无监督学习,如果能实现,先用机器学习进行主题识别,再加上人工标记,这样就能实现强大使用的主题库。

   下面的时间,我们来探讨一下如何来实现,主要有以下几个基本步骤:

(1)  加载数据,包括需要分类的输入数据,还有停用词、词干提取和标记解析等。

def load_data(input_file):

    data = []

    with open(input_file, 'r')as f:

        for line inf.readlines():

           data.append(line[:-1])

return data

(2)  预处理数据:

  ①  正则表达式过滤数据       tokens = RegexpTokenizer(r'\w+').tokenize(input_text.lower())   ②  停用词提取       stop_words_english = stopwords.words('english')   ③  根据第二步结果,移除停用词:       tokens_stopwords = [x for x in tokens if not x in stop_words_english]   ④  定义一种词干提取器:       stemmer = SnowballStemmer('english')   ⑤  词干提取器进行提取:       tokens_stemmed = [stemmer.stem(x) for x intokens_stopwords] (3)  建立基于预处理后文档字典:       dict_tokens = corpora.Dictionary(processed_tokens) (4)  建立文档-词矩阵,便于机器学习:      corpus = [dict_tokens.doc2bow(text) for text inprocessed_tokens] (5)  使用LDA做主题建模,设置好参数:      ldamodel = models.ldamodel.LdaModel(corpus,num_topics=num_topics,id2word=dict_tokens, passes=25) (6)  识别出主题后,我们可以输出识别规则:     item=ldamodel.print_topics(num_topics=num_topics, num_words=num_words)     print(item)#item中存放了两个主题文档模型识别规则。     Topic 0 ==> 0.063*"need" + 0.062*"order" +0.037*"encrypt" + 0.037*"modern"

    Topic 1 ==> 0.052*"need" +0.031*"train" + 0.031*"develop" + 0.031*"younger"

 本例中测试数据是:

   data= ['He spenta lot of time studying cryptography. ', 'You need to have a very goodunderstanding of modern encryption systems in order to work there.', "Iftheir team doesn't win this match, they will be out of the competition.",'Those codes are generated by a specialized machine. ', 'The club needs todevelop a policy of training and promoting younger talent. ', 'His movement offthe ball is really great. ', 'In order to evade the defenders, he needs to moveswiftly.', 'We need to make sure only the authorized parties can read themessage.']

转载请注明原文地址: https://www.6miu.com/read-2625447.html

最新回复(0)