利用梯度下降法实现简单的线性回归

xiaoxiao2021-02-28 88

最近做了好多个数据挖掘的小项目，使用并比较了N多算法，了解了很多机器学习的工具，如R语言、Spark机器学习库、Python、Tensorflow和RapidMiner等等。但是我感觉到自己没能深入下去，充其量也只是把别人的工具拿来玩玩而已。对算法本身的优劣及适用范围不甚了了，更谈不上改进优化算法了。

本着甘当小学生的精神，我最近在网上参加了机器学习牛人Andrew Ng在Coursera上主讲的《机器学习》课程（https://www.coursera.org/learn/machine-learning/home/welcome）。这门课程是我见到的最好的入门课程，由浅人深，循序渐进，不需要很深的数学基础，符合自然的学习规律。刚刚学习了三周，感觉受益颇多。下面以最简单的一元线性回归为例，简单介绍一下机器学习的思想及其在Matlab中的实现。

1. 基本概念假设函数假设模型 (也称为假设函数，Hypothesis function)，就是根据特征变量（feature或variable）拟合出目标变量(target variable)的公式或函数。例如在下面的表格中，根据假设函数，即公式 h θ (x) = θ0+ θ 1*x ，我们可以从房屋的面积估算出房屋的总价，这个公式即称之为假设函数，其中的θ0和θ1称为参数（parameter）。

选择不同的参数，一元线性回归的假设模型也会不同。下图展示了三个比较简单的一元线性回归模型。

代价函数代价函数或损失函数（Cost function或Loss function）是用来评价假设模型拟合的精确度。在训练数据集上，模型的拟合越好，代价函数就越小。在机器学习中，训练的目的就是要选择合适的参数（如上图中的θ0和θ1），让代价函数达到最小。例如，在线性回归中，代价函数J(θ0, θ1)的一般定义如下图所示，可理解为样本真实输出值和假设函数估算值之差平方和的平均值。

梯度下降梯度下降（Gradient descent）法是使得代价函数达到最小的经典方法之一。以前我老是用确定性的最小二乘法来求假设模型h θ (x) = θ0+ θ 1*x中的参数θ0和θ1，让训练集上的代价函数值最小。其实，代价函数J(θ0, θ1)可以看成两个变量θ0和θ1组成的函数，将这个函数可视化的结果如下图所示。我们学习的目的就是在图中找到合适的θ0和θ1，让J(θ0, θ1)接近并达到局部或全局最低点（下图中的红色圈中的点）。

对上图中梯度下降的一个直观的解释是：“比如我们在一座大山上的某处位置，由于我们不知道怎么下山，于是决定走一步算一步，也就是在每走到一个位置的时候，求解当前位置的梯度，沿着梯度的负方向，也就是当前最陡峭的位置向下走一步，然后继续求解当前位置梯度，向这一步所在位置沿着最陡峭最易下山的位置走一步。这样一步步的走下去，一直走到觉得我们已经到了山脚。当然这样走下去，有可能我们不能走到山脚，而是到了某一个局部的山峰低处。”(http://www.cnblogs.com/pinard/p/5970503.html). 梯度下降法不但适合线性回归，还可以用在逻辑回归以及其它算法上。

从几何上来说，在代价函数J(θ0, θ1)的曲面上，任意给定一个起始点（θ0, θ1），我们可沿着梯度下降的方向，即对代价函数J(θ0, θ1) 中的θ0和θ1分别求偏导数的方向，按照一定的步长（a, learning rate，学习效率）进行不断迭代，直到θ0, θ1收敛，也就是达到上图中的局部或全局最低点。对一个简单的线性回归，其算法描述如下：

2. 简单实现在Matlab或Octave中，我们可以通过损失函数和梯度下降的原理来简单实现一元线性回归。

假设原始数据如下, x表示城市的人口，y为城市的利润。我们可以通过线性回归，用某个城市的人口来预测该城市的利润。详细代码如下：

%% 在Matlab或者在Octave中实现梯度下降法，进行一元线性回归 %% 初始化 clear ; close all; clc; %%加载数据 % x表示城市的人口数量 x = [6.1101; 5.5277; 8.5186; 7.0032; 5.8598; 8.3829; 7.4764; 8.5781; 6.4862; 5.0546; 5.7107; 14.1640; 5.7340; 8.4084; 5.6407; 5.3794; 6.3654; 5.1301; 6.4296; 7.0708; 6.1891; 20.2700; 5.4901; 6.3261; 5.5649; 18.9450; 12.8280; 10.9570; 13.1760; 22.2030; 5.2524; 6.5894; 9.2482; 5.8918; 8.2111; 7.9334; 8.0959; 5.6063; 12.8360; 6.3534; 5.4069; 6.8825; 11.7080; 5.7737; 7.8247; 7.0931; 5.0702; 5.8014; 11.7000; 5.5416; 7.5402; 5.3077; 7.4239; 7.6031; 6.3328; 6.3589; 6.2742; 5.6397; 9.3102; 9.4536; 8.8254; 5.1793; 21.2790; 14.9080; 18.9590; 7.2182; 8.2951; 10.2360; 5.4994; 20.3410; 10.1360; 7.3345; 6.0062; 7.2259; 5.0269; 6.5479; 7.5386; 5.0365; 10.2740; 5.1077; 5.7292; 5.1884; 6.3557; 9.7687; 6.5159; 8.5172; 9.1802; 6.0020; 5.5204; 5.0594; 5.7077; 7.6366; 5.8707; 5.3054; 8.2934; 13.3940; 5.4369]; % y代表城市的利润 y = [ 17.5920;9.1302;13.6620;11.8540;6.8233;11.8860;4.3483;12.0000;6.5987;3.8166;3.2522;15.5050;3.1551;7.2258;0.7162;3.5129;5.3048;0.5608;3.6518;5.3893;3.1386;21.7670;4.2630;5.1875;3.0825;22.6380;13.5010;7.0467;14.6920;24.1470;-.2200;5.9966;12.1340;1.8495;6.5426;4.5623;4.1164;3.3928;10.1170;5.4974;0.5566;3.9115;5.3854;2.4406;6.7318;1.0463;5.1337;1.8440;8.0043;1.0179;6.7504;1.8396;4.2885;4.9981;1.4233;-1.4211;2.4756;4.6042;3.9624;5.4141;5.1694;-0.7428;17.9290;12.0540;17.0540;4.8852;5.7442;7.7754;1.0173;20.9920;6.6799;4.0259;1.2784;3.3411;-.6807;0.2968;3.8845;5.7014;6.7526;2.0576;0.4795;0.2042;0.6786;7.5435;5.3436;4.2415;6.7981;0.9270;0.1520;2.8214;1.8451;4.2959;7.2029;1.9869;0.1445;9.0551;0.6170]; % m是训练样本的个数 m = length(y); %% 对x, y进行作图 % 数据点以一个十字交叉的符号表示，大小设置为10 plot(x, y, 'rx', 'MarkerSize', 10); % 设置X轴 xlabel('人口'); % 设置Y轴 ylabel('利润');

然后，我们可在当前工作目录下新建一个computeCost.m文件，其中只包含一个computeCost函数，来计算代价函数的值，代码如下：

function J = computeCost(X, y, theta) %COMPUTECOST Compute cost for linear regression % J = COMPUTECOST(X, y, theta) computes the cost of using theta as the % parameter for linear regression to fit the data points in X and y % Initialize some useful values m = length(y); % number of training examples % You need to return the following variables correctly J = 0; hypothesis = X * theta; J = sum((hypothesis - y) .^ 2); J = J/(2*m); % ========================================================================= end

同样的，我们可以可在当前工作目录下新建另外一个gradientDescent.m文件，其中只包含一个gradientDescent函数，按照迭代次数和设定的alpha（学习效率）来进行梯度下降迭代。最后，我们还可视化了线性回归的结果。具体代码如下：

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha % Initialize some useful values m = length(y); % number of training examples J_history = zeros(num_iters, 1); temp = zeros(2, 1); for iter = 1:num_iters hypothesis = X * theta - y; n = length(theta); for j = 1:n theta(j) = theta(j) - alpha/m * (hypothesis' * X(:,j)); end % ============================================================ % Save the cost J in every iteration J_history(iter) = computeCost(X, y, theta); %fprintf('the cost is %f\n', J_history(iter)); end end

写完这两个函数后，我们就可以计算代价函数的值，并用梯度下降迭代法进行线性回归了。具体见代码：

%% =================== 计算代价函数并进行梯度下降迭代 =================== % 在x向量前面加上一列由m个1组成的向量，构成一个X矩阵 X = [ones(m, 1), x]; % Add a column of ones to x % 初始化theta参数 theta = zeros(2, 1); % initialize fitting parameters % 梯度下降的迭代次数为3000次，学习率步长alpha为0.01；如果不收敛，可减少步长。 iterations = 3000; alpha = 0.01; fprintf('\n测试代价函数 ...\n') % 计算初始代价函数的值 J = computeCost(X, y, theta); fprintf('With theta = [0 ; 0]\nCost computed = %f\n', J); fprintf('Expected cost value (approx) 32.07\n'); fprintf('\n开始梯度下降迭代 ...\n') % run gradient descent theta = gradientDescent(X, y, theta, alpha, iterations); % print theta to screen fprintf('Theta found by gradient descent:\n'); fprintf('%f\n', theta); fprintf('Expected theta values (approx)\n'); fprintf(' -3.6303\n 1.1664\n\n'); % Plot the linear fit hold on; % keep previous plot visible plot(X(:,2), X*theta, '-') legend('Training data', 'Linear regression') hold off % don't overlay any more plots on this figure % 根据拟合的线性回归曲线，对人口为35,000和70,000的两个城市的利润进行预测。 predict1 = [1, 3.5] *theta; fprintf('For population = 35,000, we predict a profit of %f\n',... predict1*10000); predict2 = [1, 7] * theta; fprintf('For population = 70,000, we predict a profit of %f\n',... predict2*10000); fprintf('Program paused. Press enter to continue.\n'); %% ============= 可视化 J(theta_0, theta_1) ============= fprintf('Visualizing J(theta_0, theta_1) ...\n') % Grid over which we will calculate J theta0_vals = linspace(-10, 10, 100); theta1_vals = linspace(-1, 4, 100); % initialize J_vals to a matrix of 0's J_vals = zeros(length(theta0_vals), length(theta1_vals)); % Fill out J_vals for i = 1:length(theta0_vals) for j = 1:length(theta1_vals) t = [theta0_vals(i); theta1_vals(j)]; J_vals(i,j) = computeCost(X, y, t); end end % Because of the way meshgrids work in the surf command, we need to % transpose J_vals before calling surf, or else the axes will be flipped J_vals = J_vals'; % Surface plot figure; surf(theta0_vals, theta1_vals, J_vals) xlabel('\theta_0'); ylabel('\theta_1'); % Contour plot figure; % Plot J_vals as 15 contours spaced logarithmically between 0.01 and 100 contour(theta0_vals, theta1_vals, J_vals, logspace(-2, 3, 20)) xlabel('\theta_0'); ylabel('\theta_1'); hold on; plot(theta(1), theta(2), 'rx', 'MarkerSize', 10, 'LineWidth', 2);

可视化的结果显示如下：

Matlab控制台的输出结果为：测试代价函数 … With theta = [0 ; 0] Cost computed = 32.030658 Expected cost value (approx) 32.07

开始梯度下降迭代 … Theta found by gradient descent: -3.795429 1.184909 Expected theta values (approx) -3.6303 1.1664

迭代3000次后的代价函数值=4.366376 For population = 35,000, we predict a profit of 3517.530646 For population = 70,000, we predict a profit of 44989.347534 Program paused. Press enter to continue. Visualizing J(theta_0, theta_1) …

根据控制台结果，我们可以知道，在迭代3000次后，θ0和θ1的值大约为-3.6303和1.1664。用最小二乘法，我们会得到θ0和θ1的精确结果为3.8128和1.1867，可见要得到更精确的结果，我们还需要更多次的迭代。

参考文献 Coursera Machine Learning, Programming Ex.1, https://www.coursera.org/learn/machine-learning/resources/O756o Coursera Machine Learning, Week 1 Lecture notes, https://www.coursera.org/learn/machine-learning/resources/JXWWS 刘建平, 2016, 梯度下降（Gradient Descent）小结, http://www.cnblogs.com/pinard/p/5970503.html

转载请注明原文地址: https://www.6miu.com/read-56217.html

技术

最新回复(0)