假设随机变量X服从高斯分布,记作 X ∼ N ( μ , σ 2 ) X\sim N(\mu, \sigma^2) X∼N(μ,σ2)。概率密度函数为: f ( x ) = 1 2 π σ 2 e − ( x − μ ) 2 2 σ 2 f(x) = \frac {1} {\sqrt{2\pi \sigma^2}}e^{- \frac{(x-\mu)^2}{2\sigma^2}} f(x)=2πσ2 1e−2σ2(x−μ)2
为了得到之前推导出的相同的线性回归算法,定义 p ( y ∣ x ) = N ( y ; y ∙ ( x ; w ) , σ 2 ) p(y|x)=N(y;y^{\bullet}(x;w),\sigma^2) p(y∣x)=N(y;y∙(x;w),σ2)。函数 y ∙ ( x ; w ) y^{\bullet}(x;w) y∙(x;w)预测高斯的均值。 θ M L = a r g m a x θ ∑ i = 1 m l o g p ( y ( i ) ∣ x ( i ) ; θ ) \theta_{ML} = arg max_{\theta} \sum_{i=1}^{m}logp(y^{(i)}|x^{(i)};\theta) θML=argmaxθi=1∑mlogp(y(i)∣x(i);θ) 上式中: ∑ i = 1 m l o g p ( y ( i ) ∣ x ( i ) ; θ ) = ∑ i = 1 m l o g 1 2 π σ 2 e − [ y ( i ) − y ∙ ( i ) ] 2 2 σ 2 \sum_{i=1}^{m} logp(y^{(i)}|x^{(i)};\theta) = \sum_{i=1}^{m} log\frac {1} {\sqrt{2\pi \sigma^2}} e^{- \frac{ [y^{(i)}-y^{\bullet(i)}]^2 } {2\sigma^2}} i=1∑mlogp(y(i)∣x(i);θ)=i=1∑mlog2πσ2 1e−2σ2[y(i)−y∙(i)]2
= ∑ i = 1 m l o g 1 2 π σ 2 + ∑ i = 1 m − [ y ( i ) − y ∙ ( i ) ] 2 2 σ 2 =\sum_{i=1}^{m} log\frac {1} {\sqrt{2\pi \sigma^2}}+ \sum_{i=1}^{m}-\frac{[y^{(i)}-y^{\bullet(i)}]^2 } {2\sigma^2 } =i=1∑mlog2πσ2 1+i=1∑m−2σ2[y(i)−y∙(i)]2
= − ∑ i = 1 m l o g 2 π σ 2 − ∑ i = 1 m ∣ ∣ y ( i ) − y ∙ ( i ) ∣ ∣ 2 2 π σ 2 =-\sum_{i=1}^{m} log\sqrt{2\pi\sigma^2} -\sum_{i=1}^{m} \frac{||y^{(i)}-y^{\bullet(i)}||^2} {2\pi\sigma^2} =−i=1∑mlog2πσ2 −i=1∑m2πσ2∣∣y(i)−y∙(i)∣∣2
= − m l o g σ − m 2 l o g ( 2 π ) − ∑ i = 1 m ∣ ∣ y ( i ) − y ∙ ( i ) ∣ ∣ 2 2 π σ 2 =-mlog\sigma-\frac{m}{2}log(2\pi)-\sum_{i=1}^{m} \frac{||y^{(i)}-y^{\bullet(i)}||^2} {2\pi\sigma^2} =−mlogσ−2mlog(2π)−i=1∑m2πσ2∣∣y(i)−y∙(i)∣∣2
由上式可知最大化 ∑ i = 1 m l o g p ( y ( i ) ∣ x ( i ) ; θ ) \sum_{i=1}^{m}logp(y^{(i)}|x^{(i)};\theta) ∑i=1mlogp(y(i)∣x(i);θ)等价于最小化 ∑ i = 1 m ∣ ∣ y ( i ) − y ∙ ( i ) ∣ ∣ 2 2 π σ 2 \sum_{i=1}^{m} \frac{||y^{(i)}-y^{\bullet(i)}||^2}{2\pi\sigma^2} ∑i=1m2πσ2∣∣y(i)−y∙(i)∣∣2。而 M S E t r a i n = ∑ i = 1 m ∣ ∣ y ( i ) − y ∙ ( i ) ∣ ∣ 2 2 π σ 2 MSE_{train} = \sum_{i=1}^{m} \frac{||y^{(i)}-y^{\bullet(i)}||^2}{2\pi\sigma^2} MSEtrain=i=1∑m2πσ2∣∣y(i)−y∙(i)∣∣2 由此便由最大似然推出了线性回归的最小均方误差。