[Paper Note] Batch normalization(未完成)

Paper: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Notation：

Internal covariate shift: Inputs of each layer change during training, as the previous parameters change.

This slows down the training and requires lower learning rate et careful.

Covariate shift : The input distribution to a learning system change

The saturated regime of non-linearity of Sigmoid activate function

Problem:

Internal covariate shiftLayers need to continuously adapt to the new distributions.

BN effects

Reducing internal covariate shift problem by taking few steps which dramatically accelerate the training.It has a benefit on gradient flow through the network, bu reducing the the dependency of gradients on the scale of parameters and their initial values. It regularizes the the models and reduces the need of Dropout is make it possible to use saturating nonlinearities by preventing the network from getting stuck in the saturated nodes.Using only 7% of training steps to match performance of ImageNet

Main points

BN is a transformation which applies on layer inputs x so as to normalize the distribution of mini-batch. (LeCun et al., 1998b; Wiesler & Ney, 2011) proved that the network converges faster if its inputs are whitened, that means inputs are zero mean, unit variances and decorrelated, convergence could be faster But the full whitening is costly and no everywhere differentiable, we make two simplifications: Normalize each scalar feature independently, not layer inputs and outputs jointly. x̂ =1Ni=0x Such operation speeds up convergence, even when the features are not decorrelated. Note that simply normalizing the input layer may change what the layer can represent. To adresse this, we make sure that the transformation insert in the network can represent the identity transform by introduce paires of learnable parameters γ(k) , β(k) to scale and shift the normalized value: yk=γ(k)x̂ (k)+β(k) how to make sure the network will learn to represent a identity network?Since we use SGD, the normalization could be applied only on batch of data, not on the whole set. So in our case, each mini-batch produces estimates of the mean and variance of each activation. - 啊

Let x be a layer input, treated as a vector, and X be the set of these inputs over the training dataset, the normalization can then be written as a transformation: x̂ =Norm(x,X) , that is

x̂ =1Ni=0x which depends not only on x itself bu also all training exemples - each of which depends on θ