RNN在迭代运用状态转换操作“输入到隐状态”实现任意长序列的定长表示时,会遭遇到“对隐状态扰动过于敏感”的困境。
dropout的数学形式化:
y=f(W⋅d(x)) , 其中 d(x)={mask∗x, if train phaseing(1−p)x,otherwise p 为dropout率,mask为以1-p为概率的贝努力分布生成的二值向量改变传统做法“在每个时间步采用不同的mask对隐节点进行屏蔽”,提出新的策略(如下图所示),其特点是:1)generates the dropout mask only at the beginning of each training sequence and fixes it through the sequence;2)dropping both the non-recurrent and recurrent connections。 Moon T, Choi H, Lee H, et al. RNNDROP: A novel dropout for RNNS in ASR[C]// Automatic Speech Recognition and Understanding. IEEE, 2016:65-70.
简单RNN及其dropout: RNN: ht=f(Wh⊙[xt,ht−1] bh); dropout: ht=f(Wh⊙[xt,d(ht−1)]+bh) , d(⋅) 为dropout函数 LSTM: ct=ft⊙ct−1+it⊙d(gt) GRU: ht=(1−zt)⊙ct−1+zt⊙d(gt) 从理论上讲, masks can be applied to any subset of the gates, cells, and states. 文献:Semeniuta S, Severyn A, Barth E. Recurrent Dropout without Memory Loss[J]. 2016.
针对多层LSTM网络,对其垂直连接进行随机dropout, 也即是否允许 L 层某个LSTM单元的隐状态信息流入L 1层对应单元 图中虚线是进行随机dropout的操作对象。 dropout操作后的信息流 文献: Zaremba W, Sutskever I, Vinyals O. Recurrent Neural Network Regularization[C]. ICLR 2015. 源码:https://github.com/wojzaremba/lstm .
图中虚线代表不进行dropout,而不同颜色的实线表示不同的mask。 传统dropout rnn: use different masks at different time steps 基于变分推理的dropout: uses the same dropout mask at each time step, including the recurrent layers 基于变分推理的dropout的具体实现(上图(b)的实线颜色可知):为每个连接矩阵一次性生成贝努力随机变量的mask,然后在后续的每个时间点上都采用相同的mask. 文献:Gal Y. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks[J]. Statistics, 2015:285-290. 源码: http://yarin.co/BRNN
ct=dct⊙ct−1+(1−dct)⊙ft⊙ct−1+it⊙gt
ht=dht⊙ht−1+(1−dht)⊙ot⊙tanh(ct−1⊙ft+it⊙gt) 其中 dht 为0与1的二值随机向量。
文献:Krueger D, Maharaj T, Kramár J, et al. Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations[C]. ICLR 2017 源码:http://github.com/teganmaharaj/zoneout
