seq2seq 训练时 feed 自己的数据

xiaoxiao2021-02-28 18

在这个文件加入以下代码https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/translate.py

def vectorize_data(data, word_idx): #word_idx >=1 ,frist is unknow-token Q = [] for line in data: ss = [] for word in line: if word not in word_idx: ss.append(0) else: ss.append(word_idx[word]) Q.append(ss) return Q def load_data(file): with open(file) as f: lines = f.readlines() chinese_data = [] english_data = [] index = 0 for line in lines: if line == "\n": continue words_list=[] words = line.split(' ') [words_list.append(word.strip("\n")) for word in words] if index % 2 == 0: chinese_data.append(words_list) elif index%2 == 1: english_data.append(words_list) index+=1 return chinese_data,english_data chinese_data, english_data = load_data('./data.txt') _PAD = b"_PAD" _GO = b"_GO" _EOS = b"_EOS" _UNK = b"_UNK" PAD_ID = 0 GO_ID = 1 EOS_ID = 2 UNK_ID = 3 temp = reduce(lambda x, y: x + y, [story for story in chinese_data]) chinese_vocab = set(temp) chinese_word_idx = dict((c, i + 4) for i, c in enumerate(chinese_vocab)) chinese_word_idx[_PAD]= PAD_ID chinese_word_idx[_GO] = GO_ID chinese_word_idx[_EOS] = EOS_ID chinese_word_idx[_UNK] = UNK_ID sentence_max_word_number_chinese = max(map(len, chinese_data)) temp = reduce(lambda x, y: x + y, [story for story in english_data]) english_vocab = set(temp) english_word_idx = dict((c, i + 4) for i, c in enumerate(english_vocab)) english_word_idx[_PAD] = PAD_ID english_word_idx[_GO] = GO_ID english_word_idx[_EOS] = EOS_ID english_word_idx[_UNK] = UNK_ID sentence_max_word_number_english = max(map(len, english_data)) chinese_ids = vectorize_data(chinese_data,chinese_word_idx) english_ids = vectorize_data(english_data, english_word_idx) for line in english_ids: line.append(EOS_ID) data_set = [[] for _ in _buckets] for chinese_line,english_line in zip(chinese_ids,english_ids): for bucket_id, (source_size, target_size) in enumerate(_buckets): if len(chinese_line) < source_size and len(english_line) < target_size: data_set[bucket_id].append([chinese_line, english_line]) break train_set = data_set # 替换原来的train_set

数据文件的样子

纽约比加州早三个小时 New York is 3 hours ahead of California 但这没有让加州变慢 but it does not make California slow 有人 22岁毕业了 Someone graduated at the age of 22 但等了五年才找到好的工作 but waited 5 years before securing a good job 有人 25岁当上 CEO Someone became a CEO at 25 却在 50岁去世 and died at 50 然而另一个人 50岁当上 CEO While another became a CEO at 50 然后活到 90岁 and lived to 90 years 有人依然单身 Someone is still single 然而也有人已经结婚 while someone else got married 奥巴马 55岁退休 Obama retires at 55 但川普 70岁开始 but Trump starts at 70 本来世界上每个人在自己的时区工作 Absolutely everyone in this world works based on their Time Zone 身边有人可能看似走在你前面 People around you might seem to go ahead of you 有人可能看似在你后面 some might seem to be behind you. 但每个人正在以他们的速度奔跑在他们自己的时区 But everyone is running their own RACE in their own TIME. 不要嫉妒或嘲笑他们 Don’t envy them or mock them 他们在他们的时区你在你的 They are in their TIME ZONE and you are in yours 生命是关于等待正确的时机行动 Life is about waiting for the right moment to act 所以放轻松 So RELAX 你没有落后 You’re not LATE 你没有领先 You’re not EARLY 你非常准时在命运为你安排的时区 You are very much ON TIME, and in your TIME ZONE Destiny set up for you.

转载请注明原文地址: https://www.6miu.com/read-250373.html

技术

最新回复(0)