通常称这些在每条文本中都出现的常用词汇为停用词(Stop Words),如英文中的 the,a等。这些停用词在文本特征抽取中经常以黑名单的方式过滤掉,并且用来提高模型的性能表现。
下面重新对“20类新闻文本分类”问题进行分析处理,重点在于列举上述两种文本特征量化模型的使用方法,并比较性能差异。 Python源码: #coding=utf-8 from sklearn.datasets import fetch_20newsgroups #------------- from sklearn.cross_validation import train_test_split #--------------------------- from sklearn.feature_extraction.text import CountVectorizer #------------- from sklearn.naive_bayes import MultinomialNB #------------- from sklearn.metrics import classification_report #--------------------------- from sklearn.feature_extraction.text import TfidfVectorizer #-------------download data, subset='all' means download all data news=fetch_20newsgroups(subset='all') #-------------split data #75% training set,25% testing set X_train,X_test,y_train,y_test=train_test_split(news.data,news.target,test_size=0.25,random_state=33) #-----------------------transfer data to vector (default config won't clear english stop words) count_vec=CountVectorizer() X_count_train=count_vec.fit_transform(X_train) vectorizer_test = CountVectorizer(vocabulary=count_vec.vocabulary_) X_count_test = vectorizer_test.transform(X_test) #-------------training #initialize NB model with default config mnb_count=MultinomialNB() mnb_count.fit(X_count_train,y_train) #-------------performance print 'The accuracy of classifying 20newsgroups using Naive bayes(CountVectorizer without filtering stopwords):',mnb_count.score(X_count_test,y_test) y_count_predict=mnb_count.predict(X_count_test) print classification_report(y_test,y_count_predict,target_names=news.target_names) #-----------------------transfer data to vector (default config won't clear english stop words) tfidf_vec=TfidfVectorizer() X_tfidf_train=tfidf_vec.fit_transform(X_train) X_tfidf_test=tfidf_vec.transform(X_test) #-------------training mnb_tfidf=MultinomialNB() mnb_tfidf.fit(X_tfidf_train,y_train) #-------------performance print 'The accuracy of classifying 20newsgroups with Naive Bayes(TfidfVectorizer without filtering stopwords):',mnb_tfidf.score(X_tfidf_test,y_test) y_tfidf_predict=mnb_tfidf.predict(X_tfidf_test) print classification_report(y_test,y_tfidf_predict,target_names=news.target_names) Result: 在使用TfidfVectorizer而不去掉停用词的条件下,对训练和测试文本进行特征量化,并利用默认配置的朴素贝叶斯分类器,在测试文本上可以得到比CountVectorizer更加高的预测准确率。P,R和F1指标都得到提升,表明:训练文本量较多时,yongTfidfVectorizer压制这些常用词汇对分类决策的干扰,可以起到提升模型性能的作用 下面验证:停用词在文本特征抽取中经常以黑名单的方式过滤掉,并且用来提高模型的性能表现。 Python源码: #coding=utf-8 from sklearn.datasets import fetch_20newsgroups #------------- from sklearn.cross_validation import train_test_split #--------------------------- from sklearn.feature_extraction.text import CountVectorizer #------------- from sklearn.naive_bayes import MultinomialNB #------------- from sklearn.metrics import classification_report #--------------------------- from sklearn.feature_extraction.text import TfidfVectorizer #------------- from sklearn.metrics import classification_report #-------------download data, subset='all' means download all data news=fetch_20newsgroups(subset='all') #-------------split data #75% training set,25% testing set X_train,X_test,y_train,y_test=train_test_split(news.data,news.target,test_size=0.25,random_state=33) #-----------------------transfer data to vector (default config won't clear english stop words) #initialize with filtering stop words count_filter_vec,tfidf_filter_vec=CountVectorizer(analyzer='word',stop_words='english'),TfidfVectorizer(analyzer='word',stop_words='english') X_count_filter_train=count_filter_vec.fit_transform(X_train) #X_count_filter_test=count_filter_vec.fit_transform(X_test) dimension mismatch countVectorizer_test = CountVectorizer(vocabulary=count_filter_vec.vocabulary_) X_count_filter_test = countVectorizer_test.transform(X_test) X_tfidf_filter_train=tfidf_filter_vec.fit_transform(X_train) #X_tfidf_filter_test=tfidf_filter_vec.fit_transform(X_test) tfidfVectorizer_test = CountVectorizer(vocabulary=tfidf_filter_vec.vocabulary_) X_tfidf_filter_test = tfidfVectorizer_test.transform(X_test) mnb_count_filter=MultinomialNB() mnb_count_filter.fit(X_count_filter_train,y_train) print 'The accuracy of classifying 20newsgroups using Naive bayes(CountVectorizer with filtering stopwords):',mnb_count_filter.score(X_count_filter_test,y_test) y_count_filter_predict=mnb_count_filter.predict(X_count_filter_test) mnb_tfidf_filter=MultinomialNB() mnb_tfidf_filter.fit(X_tfidf_filter_train,y_train) print 'The accuracy of classifying 20newsgroups using Naive bayes(TfidfVectorizer with filtering stopwords):',mnb_tfidf_filter.score(X_count_filter_test,y_test) y_tfidf_filter_predict=mnb_count_filter.predict(X_tfidf_filter_test) print classification_report(y_test,y_count_filter_predict,target_names=news.target_names) print classification_report(y_test,y_tfidf_filter_predict,target_names=news.target_names) Result:输出证明 TfidfVectorizer的特征抽取和量化方法更加具备优势,对比发现:对停用词进行过滤的文本特征抽取方法,平均比不过滤停用词的模型综合性能高出3%~4%