一天搞懂深度学习

xiaoxiao2021-02-28 100

Lecture I: Introduction of Deep Learning

Three Steps for Deep Learning

define a set of function(Neural Network)goodness of functionpick the best function

Soft max layer as the output layer.

FAQhow many layers? How many neutons for each layer?

Trial and Error + Intuition(试错＋直觉)

Gradient Descent

pick an initial value for

w random(good enough) RBM pre-train compute

∂L∂w

w←w−η∂L∂w , where

η is called “learning rate”repeat Until

∂L∂w is approximately small

But gradient descent never guarantee global minima

Modularization(模块化)

Deep → Modularization

Each basic classifier can have sufficient training examples

Sharing by the following classifiers as module

The modulrization is automatically learned from data

Lecture II: Tips for Training DNN

Do not always blame overfitting

Reason for overfitting: Training data and testing data can be differentPanacea for Overfitting: Have more training data or Create more training data

Different approaches for different problems

Chossing proper loss

square error(mse)

∑(yi−yi^)2 cross entropy(categorical_crosssentropy)

−∑(yi^lnyi) When using softmax output layer, choose cross entropy

Mini-batch

Mini-batch is Faster Randomly initialize network parametersPick the 1 st batch, update parameters oncePick the 2 nd batch, update parameters once…Until all mini-batches have been picked(one epoch finished)Repeat the above process(2-5)

New activation function

Vanishing Gradient ProblemRBM pre-trainingRectified Linear Unit (ReLU)Fast to computeBiological reasonInfinite sigmoid z with different biasesA Thinner linear networkA special cases of MaxoutVanishing gradient problemReLU - variant

Adaptive Learning Rate

Popular & Simple Idea: Reduce the learning rate by some factor every few epochs

ηt=ηt+1√

Adagrad

Original:

w←w−η∂L∂w Adagrad:

w←w−ηw∂L∂w,ηw=η∑ti=0(gi)2√

Momentum

Movement = Negative of

∂L/∂w + MomentumAdam = RMSProp (Advanced Adagrad) + Momentum

Early Stopping

Weight Decay

Original:

w←w−η∂L∂w Weight Decay:

w←0.99w−η∂L∂w

Dropout

Training:Each neuron has p% to dropoutThe structure of the network is changed.Using the new network for trainingTesting:If the dropout rate at training is p%, all the weights times (1-p)%Dropout is a kind of ensemble

Lecture III: Variants of Neural Networks

Convolutional Neural Network (CNN)

The convolution is not fully connectedThe convolution is sharing weightsLearning: gradient descent

Recurrent Neural Network (RNN)

Long Short-term Memory (LSTM)

Gated Recurrent Unit (GRU): simpler than LSTM

Lecture IV: Next Wave

Supervised Learning

Ultra Deep Network

Worry about training first!

This ultra deep network have special structure

Ultra deep network is the ensemble of many networks with different depth

Ensemble: 6 layers, 4 layers or 2 layers

FractalNet

Residual Network

Highway Network

Attention Model

Attention-based Model

Attention-based Model v2

Reinforcement Learning

Agent learns to take actions to maximize expected reward.Difficulties of Reinforcement Learning It may be better to sacrifice immediate reward to gain more long-term rewardAgent’s actions affect the subsequent data it receives

Unsupervised Learning

Image: Realizing what the World Looks LikeText: Understanding the Meaning of WordsAudio: Learning human language without supervision

转载请注明原文地址: https://www.6miu.com/read-42823.html

技术

最新回复(0)