Internal covariate shift: Inputs of each layer change during training, as the previous parameters change.
This slows down the training and requires lower learning rate et careful.
Covariate shift : The input distribution to a learning system change
The saturated regime of non-linearity of Sigmoid activate function-
BN is a transformation which applies on layer inputs x so as to normalize the distribution of mini-batch. (LeCun et al., 1998b; Wiesler & Ney, 2011) proved that the network converges faster if its inputs are whitened, that means inputs are zero mean, unit variances and decorrelated, convergence could be faster But the full whitening is costly and no everywhere differentiable, we make two simplifications: Normalize each scalar feature independently, not layer inputs and outputs jointly. x̂ =1N∑i=0x Such operation speeds up convergence, even when the features are not decorrelated. Note that simply normalizing the input layer may change what the layer can represent. To adresse this, we make sure that the transformation insert in the network can represent the identity transform by introduce paires of learnable parameters γ(k) , β(k) to scale and shift the normalized value: yk=γ(k)x̂ (k)+β(k) how to make sure the network will learn to represent a identity network?Since we use SGD, the normalization could be applied only on batch of data, not on the whole set. So in our case, each mini-batch produces estimates of the mean and variance of each activation. - 啊Let x be a layer input, treated as a vector, and X be the set of these inputs over the training dataset, the normalization can then be written as a transformation: x̂ =Norm(x,X) , that is
x̂ =1N∑i=0x which depends not only on x itself bu also all training exemples - each of which depends on θ在一般的神经网络中，每个网络层的输出可以作为一个n为向量，n 是神经元的个数。从BN的原理可以看出，它是对一个神经元的操作，我们可以选择对所有神经元进行BN，也可以选择只对部分神经元进行该操作。对每个被归一化的神经元，都有一个可学习的参数对 (λ,β) 。通常我们对所有神经元进行批量归一化。
但是在CNN中，如果每个神经元都有一个可学习的参数对，那么参数数量将急剧增加：假设输入层是一个大小为 （n∗c∗p∗g） 的张量，c是channel数量, n是特征图的数量，p，g是特征图的高度和宽度，那么加入BN后，参数数量增加了 n∗c∗p∗g∗2 个，这便太恐怖了。因此在CNN中，我们采用权值共享的方式，对每个feature map，采用同样的 (λ,β) ， 这样参数便只有 n∗c∗2 个了。