TensorFlow中内建的类tf.contrib.learn.preprocessing.VocabularyProcessor( max_document_length, min_frequency=0, vocabulary=None, tokenizer_fn=None)可以返回一个“能够将文档中的词汇转化为数字索引文档”的对象。其中,max_document_length表示转换完之后,文档中,每句话的长度,min_frequency=0表示文档中,每个词出现的频次最小数。
from tensorflow.contrib import learn texts = ['go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat', 'ok lar joking wif u oni', 'free entry in a wkly comp to win fa cup final tkts st may text fa to to receive entry questionstd txt ratetcs apply overs', 'u dun say so early hor u c already then say', 'nah i dont think he goes to usf he lives around here though'] texts2 = texts[0:5] vocab_processor = learn.preprocessing.VocabularyProcessor(20, min_frequency=1) transformed_texts = np.array([x for x in vocab_processor.transform(texts)]) print(transformed_texts) ## 运行结果: [[ 1 2 3 ... 18 19 20] [ 21 22 23 ... 0 0 0] [ 27 28 8 ... 32 41 28] ... [7687 302 8 ... 0 0 0] [ 128 3066 205 ... 166 68 54] [3173 64 1156 ... 0 0 0]]