Trial and Error + Intuition(试错+直觉)
But gradient descent never guarantee global minima
Deep → Modularization
Each basic classifier can have sufficient training examples
Sharing by the following classifiers as module
The modulrization is automatically learned from data
Do not always blame overfitting
Reason for overfitting: Training data and testing data can be differentPanacea for Overfitting: Have more training data or Create more training dataDifferent approaches for different problems
Chossing proper loss
square error(mse) ∑(yi−yi^)2 cross entropy(categorical_crosssentropy) −∑(yi^lnyi) When using softmax output layer, choose cross entropyMini-batch
Mini-batch is Faster Randomly initialize network parametersPick the 1 st batch, update parameters oncePick the 2 nd batch, update parameters once…Until all mini-batches have been picked(one epoch finished)Repeat the above process(2-5)New activation function
Vanishing Gradient ProblemRBM pre-trainingRectified Linear Unit (ReLU)Fast to computeBiological reasonInfinite sigmoid z with different biasesA Thinner linear networkA special cases of MaxoutVanishing gradient problemReLU - variantAdaptive Learning Rate
Popular & Simple Idea: Reduce the learning rate by some factor every few epochs
ηt=ηt+1√Adagrad
Original: w←w−η∂L∂w Adagrad: w←w−ηw∂L∂w,ηw=η∑ti=0(gi)2√Momentum
Movement = Negative of ∂L/∂w + MomentumAdam = RMSProp (Advanced Adagrad) + MomentumEarly Stopping
Weight Decay
Original: w←w−η∂L∂w Weight Decay: w←0.99w−η∂L∂wDropout
Training:Each neuron has p% to dropoutThe structure of the network is changed.Using the new network for trainingTesting:If the dropout rate at training is p%, all the weights times (1-p)%Dropout is a kind of ensembleWorry about training first!
This ultra deep network have special structure
Ultra deep network is the ensemble of many networks with different depth
Ensemble: 6 layers, 4 layers or 2 layers
FractalNet
Residual Network
Highway Network
Attention-based Model
Attention-based Model v2