TfidfVectorizer in scikit-learn : ValueError: np.nan is an invalid document

xiaoxiao2021-02-28  47

我使用scikit-learn中的TfidfVectorizer学习从文本数据中提取一些特征。我有一个带标志的CSV文件(可以是+1或-1)和一个评论(文本)。我将这些数据导入DataFrame,以便运行Vectorizer。 代码如下:

import pandas as pd import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer df = pd.read_csv("train_new.csv", names = ['Score', 'Review'], sep=',') # x = df['Review'] == np.nan # # print x.to_csv(path='FindNaN.csv', sep=',', na_rep = 'string', index=True) # # print df.isnull().values.any() v = TfidfVectorizer(decode_error='replace', encoding='utf-8') x = v.fit_transform(df['Review'])

报错:

ValueError: np.nan is an invalid document, expected byte or unicode string.

解决方案:

x = v.fit_transform(df['Review'].values.astype('U')) ## Even astype(str) would work

我们从说明文档中可以看到:

fit_transform(raw_documents, y=None) Parameters: raw_documents : iterable an iterable which yields either str, unicode or file objects
转载请注明原文地址: https://www.6miu.com/read-2626506.html

最新回复(0)