本文为吴恩达课程的编程作业。(这自动目录不知道为毛生成有问题,后面的标题层级都乱了)
对权重参数的不同初始化方法会对模型造成影响,选择一个好的初始化方法能够:
加速梯度下降的收敛速度
增加梯度收敛于一个较低测试错误率的点的可能性
其中自定义库代码见文末附录。
用于测试初始化效果的网络为一个三层模型,其形状结构为(X.shape[0], 10, 5, 1)。前两层使用Relu激活函数,最后一层使用SIGMOID激活函数:
def model(X, Y, learning_rate = 0.01, num_iterations = 15000, print_cost = True, initialization = "he"): """ Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID. Arguments: initialization -- flag to choose which initialization to use ("zeros","random" or "he") Returns: parameters -- parameters learnt by the model """ grads = {} costs = [] # to keep track of the loss m = X.shape[1] # number of examples layers_dims = [X.shape[0], 10, 5, 1] if initialization == "zeros": parameters = initialize_parameters_zeros(layers_dims) elif initialization == "random": parameters = initialize_parameters_random(layers_dims) elif initialization == "he": parameters = initialize_parameters_he(layers_dims) for i in range(0, num_iterations): a3, cache = forward_propagation(X, parameters) cost = compute_loss(a3, Y) grads = backward_propagation(X, Y, cache) parameters = update_parameters(parameters, grads, learning_rate) # Print the loss every 1000 iterations if print_cost and i % 1000 == 0: print("Cost after iteration {}: {}".format(i, cost)) costs.append(cost) # plot the loss plt.plot(costs) plt.ylabel('cost') plt.xlabel('iterations (per hundreds)') plt.title("Learning rate =" + str(learning_rate)) plt.show() return parameters将所有的 W[i] 与 b[i] 均初始化为0:
def initialize_parameters_zeros(layers_dims): parameters = {} L = len(layers_dims) # number of layers in the network for l in range(1, L): parameters['W' + str(l)] = np.zeros((layers_dims[l],layers_dims[l-1])) parameters['b' + str(l)] = np.zeros((layers_dims[l],1)) return parameters模型表现
训练模型:
parameters = model(train_X, train_Y, initialization = "zeros") print ("On the train set:") predictions_train = predict(train_X, train_Y, parameters) print ("On the test set:") predictions_test = predict(test_X, test_Y, parameters)由模型的学习过程可知,当权重参数被初始化为0时,整个模型就无法学习,因为所有层的输出均为0。
在前面的学习过程中使用的就是此方法,不过此处将初始随机值放大了10倍:
def initialize_parameters_random(layers_dims): np.random.seed(3) # This seed makes sure your "random" numbers will be the as ours parameters = {} L = len(layers_dims) # integer representing the number of layers for l in range(1, L): parameters['W' + str(l)] = np.random.randn(layers_dims[l],layers_dims[l-1])*10 parameters['b' + str(l)] = np.zeros((layers_dims[l],1)) return parameters模型表现
训练模型:
parameters = model(train_X, train_Y, initialization = "random") print ("On the train set:") predictions_train = predict(train_X, train_Y, parameters) print ("On the test set:") predictions_test = predict(test_X, test_Y, parameters)代价曲线与准确度:
可以看到最开始的代价值非常高,这是因为大的初始权重值导致激活函数的输出非常接近0或1,如果此时对样本判断错误的话就会产生很高的损失。特别是如果激活函数的输出接近0的话,那么损失值接近于无穷大。不好的初始化会导致梯度消失和梯度爆炸现象的发生,这会减慢算法的优化。拟合情况:
He初始化最初是由He等人在2015年提出的一种初始化方法,它是在随机初始值后面乘了一个 2layerdims[l−1])−−−−−−−−−−√ ,代码实现:
def initialize_parameters_he(layers_dims): np.random.seed(3) parameters = {} L = len(layers_dims) - 1 # integer representing the number of layers for l in range(1, L + 1): parameters['W' + str(l)] = np.random.randn(layers_dims[l],layers_dims[l-1])*np.sqrt(2/layers_dims[l-1]) parameters['b' + str(l)] = np.zeros((layers_dims[l],1)) return parameters模型表现
训练模型:
parameters = model(train_X, train_Y, initialization = "he") print ("On the train set:") predictions_train = predict(train_X, train_Y, parameters) print ("On the test set:") predictions_test = predict(test_X, test_Y, parameters)代价曲线与准确度:
拟合情况:
可以看到使用He初始化的模型可以很好地将两类数据分开。
正则化是一种用于消除模型过拟合风险的方法。
其中自定义库代码见文末附录。
可以看到数据是带些许噪声的(蓝点区域中的个别红点与红点区域中的个别蓝点):
增加模型函数的参数个数,使模型能够调用不同的正则化方法。lambd表示L2正则化方法的超参数,keep_prob则是Dropout正则化方法的超参数。
def model(X, Y, learning_rate = 0.3, num_iterations = 30000, print_cost = True, lambd = 0, keep_prob = 1): """ Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID. Arguments: lambd -- regularization hyperparameter, scalar keep_prob - probability of keeping a neuron active during drop-out, scalar. """ grads = {} costs = [] # to keep track of the cost m = X.shape[1] layers_dims = [X.shape[0], 20, 3, 1] parameters = initialize_parameters(layers_dims) for i in range(0, num_iterations): # Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID. if keep_prob == 1: a3, cache = forward_propagation(X, parameters) elif keep_prob < 1: a3, cache = forward_propagation_with_dropout(X, parameters, keep_prob) # Cost function if lambd == 0: cost = compute_cost(a3, Y) else: cost = compute_cost_with_regularization(a3, Y, parameters, lambd) # Backward propagation. assert(lambd==0 or keep_prob==1) # it is possible to use both L2 regularization and dropout, # but this assignment will only explore one at a time if lambd == 0 and keep_prob == 1: grads = backward_propagation(X, Y, cache) elif lambd != 0: grads = backward_propagation_with_regularization(X, Y, cache, lambd) elif keep_prob < 1: grads = backward_propagation_with_dropout(X, Y, cache, keep_prob) # Update parameters. parameters = update_parameters(parameters, grads, learning_rate) # Print the loss every 10000 iterations if print_cost and i % 10000 == 0: print("Cost after iteration {}: {}".format(i, cost)) if print_cost and i % 1000 == 0: costs.append(cost) # plot the cost plt.plot(costs) plt.ylabel('cost') plt.xlabel('iterations (x1,000)') plt.title("Learning rate =" + str(learning_rate)) plt.show() return parameters模型表现
训练模型:
parameters = model(train_X, train_Y) print ("On the training set:") predictions_train = predict(train_X, train_Y, parameters) print ("On the test set:") predictions_test = predict(test_X, test_Y, parameters)代价曲线与准确度:
拟合情况:
可以看到,对含噪声的数据,无正则化的模型出现了明显的过拟合现象。
采用L2正则化方法的损失函数需要计算所有层权重矩阵W的L2范数的平方和:
J(W,b)=1m∑i=1mL(a[l]i,y)+λ2m∑i=1l||W[i]||2F 对于某层权重矩阵的L2范数平方,可使用 numpy.sum(numpy.square(Wl))计算。由于损失函数加入了正则化项
L2_regularization_cost=λ2m∑i=1l||W[i]||2F 所以反向传播过程中的dWl需要加上 λmWl 项: def backward_propagation_with_regularization(X, Y, cache, lambd): """ Implements the backward propagation of our baseline model to which we added an L2 regularization. """ m = X.shape[1] (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache dZ3 = A3 - Y dW3 = 1./m * np.dot(dZ3, A2.T) + lambd/m*W3 db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True) dA2 = np.dot(W3.T, dZ3) dZ2 = np.multiply(dA2, np.int64(A2 > 0)) dW2 = 1./m * np.dot(dZ2, A1.T) + lambd/m*W2 db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True) dA1 = np.dot(W2.T, dZ2) dZ1 = np.multiply(dA1, np.int64(A1 > 0)) dW1 = 1./m * np.dot(dZ1, X.T) + lambd/m*W1 db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True) gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2, "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1} return gradients训练模型:
parameters = model(train_X, train_Y, lambd = 0.5) print ("On the train set:") predictions_train = predict(train_X, train_Y, parameters) print ("On the test set:") predictions_test = predict(test_X, test_Y, parameters)代价曲线与准确度:
拟合情况:
L2正则化会使模型的决策边界变得更平滑,但是如果超参数 λ 太大的话也会使边界过于平滑导致模型准确度不高。下图分别为 λ=0.1 与 λ=1 时的决策边界:
Dropout正则化是一种在模型的训练过程中随机屏蔽节点的方法,其超参数为keep_prob(保留概率),被屏蔽的节点在正向传播与反向传播中不会起作用,其主要思想是在每一次的迭代过程中只使用整个网络的一个子网络。
为实现随即屏蔽节点,可以创造一个随机掩码矩阵,然后将矩阵内的值按概率置换成0跟1,再将掩码矩阵与每层的输出矩阵按元素相乘即可。对输入输出层不进行Dropout。
这里需要注意的是,按概率来说在模型每次的迭代过程中只会有 n[l]∗keep_prob 个节点被使用。为了不改变损失函数的期望值,经过Dropout处理后的每一层输出值都应除以 keep_prob 以保持期望值不变。同时考虑到反向传播时需要屏蔽掉 与前向传播过程中相同的节点,因此将掩码矩阵加入缓存:
def forward_propagation_with_dropout(X, parameters, keep_prob = 0.5): """ Implements the forward propagation: LINEAR -> RELU + DROPOUT -> LINEAR -> RELU + DROPOUT -> LINEAR -> SIGMOID. Arguments: keep_prob - probability of keeping a neuron active during drop-out, scalar """ np.random.seed(1) W1 = parameters["W1"] b1 = parameters["b1"] W2 = parameters["W2"] b2 = parameters["b2"] W3 = parameters["W3"] b3 = parameters["b3"] # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID Z1 = np.dot(W1, X) + b1 A1 = relu(Z1) D1 = np.random.rand(A1.shape[0],A1.shape[1]) #产生值为[0,1]的随机矩阵 D1 = (D1<keep_prob) A1 = A1*D1 A1 = A1/keep_prob Z2 = np.dot(W2, A1) + b2 A2 = relu(Z2) D2 = np.random.rand(A2.shape[0],A2.shape[1]) D2 = (D2<keep_prob) A2 = A2*D2 A2 = A2/keep_prob Z3 = np.dot(W3, A2) + b3 A3 = sigmoid(Z3) cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) return A3, cache反向传播时,掩码矩阵作用于同层梯度矩阵,同样地,梯度矩阵也要除以 keep_prob 以保持损失期望不变:
def backward_propagation_with_dropout(X, Y, cache, keep_prob): """ Implements the backward propagation of our baseline model to which we added dropout. """ m = X.shape[1] (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache dZ3 = A3 - Y dW3 = 1./m * np.dot(dZ3, A2.T) db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True) dA2 = np.dot(W3.T, dZ3) dA2 = dA2*D2 dA2 = dA2/keep_prob dZ2 = np.multiply(dA2, np.int64(A2 > 0)) dW2 = 1./m * np.dot(dZ2, A1.T) db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True) dA1 = np.dot(W2.T, dZ2) dA1 = dA1*D1 dA1 = dA1/keep_prob dZ1 = np.multiply(dA1, np.int64(A1 > 0)) dW1 = 1./m * np.dot(dZ1, X.T db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True) gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2, "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1} return gradients训练模型:
parameters = model(train_X, train_Y, keep_prob = 0.86, learning_rate = 0.3) print ("On the train set:") predictions_train = predict(train_X, train_Y, parameters) print ("On the test set:") predictions_test = predict(test_X, test_Y, parameters)代价曲线与准确度:
拟合情况:
使用Dropout正则化的模型同样能很好的划分数据。
Dropout正则化是一种很常用的正则化方法,注意在正向传播与反向传播时屏蔽相同的节点。
正则化会降低模型在训练集上的表现,因为其抑制了网络对训练集的过拟合,不过它能提高测试准确度。
其中自定义库代码见文末附录。
在前文推导反向传播计算式的时候就可以发现,反向传播的计算比前向传播要复杂,可能一不小心就把反向传播的代码写错了。梯度检查就是用于检查反向传播代码是否存在错误。
正向传播:
def forward_propagation_n(X, Y, parameters): """ Implements the forward propagation (and computes the cost) presented in Figure 3. """ # retrieve parameters m = X.shape[1] W1 = parameters["W1"] b1 = parameters["b1"] W2 = parameters["W2"] b2 = parameters["b2"] W3 = parameters["W3"] b3 = parameters["b3"] # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID Z1 = np.dot(W1, X) + b1 A1 = relu(Z1) Z2 = np.dot(W2, A1) + b2 A2 = relu(Z2) Z3 = np.dot(W3, A2) + b3 A3 = sigmoid(Z3) # Cost logprobs = np.multiply(-np.log(A3),Y) + np.multiply(-np.log(1 - A3), 1 - Y) cost = 1./m * np.sum(logprobs) cache = (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) return cost, cache反向传播:
def backward_propagation_n(X, Y, cache): """ Implement the backward propagation presented in figure 2. Arguments: X -- input datapoint, of shape (input size, 1) Y -- true "label" cache -- cache output from forward_propagation_n() Returns: gradients -- A dictionary with the gradients of the cost with respect to each parameter, activation and pre-activation variables. """ m = X.shape[1] (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache dZ3 = A3 - Y dW3 = 1./m * np.dot(dZ3, A2.T) db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True) dA2 = np.dot(W3.T, dZ3) dZ2 = np.multiply(dA2, np.int64(A2 > 0)) dW2 = 1./m * np.dot(dZ2, A1.T) * 2 db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True) dA1 = np.dot(W2.T, dZ2) dZ1 = np.multiply(dA1, np.int64(A1 > 0)) dW1 = 1./m * np.dot(dZ1, X.T) db1 = 4./m * np.sum(dZ1, axis=1, keepdims = True) gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3, "dA2": dA2, "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1} return gradients根据导数的定义,有
f′(x)=limϵ→0f(x+ϵ)−f(x−ϵ)2ϵ 梯度检查就是利用导数定义所计算的值与反向传播代码计算的值进行比较,当两者的差异非常小时则可认为反向传播代码计算无误: def gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7): """ Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n Arguments: grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters. epsilon -- tiny shift to the input to compute approximated gradient with formula(1) Returns: difference -- difference (2) between the approximated gradient and the backward propagation gradient """ parameters_values, _ = dictionary_to_vector(parameters) grad = gradients_to_vector(gradients) num_parameters = parameters_values.shape[0] J_plus = np.zeros((num_parameters, 1)) J_minus = np.zeros((num_parameters, 1)) gradapprox = np.zeros((num_parameters, 1)) # Compute gradapprox for i in range(num_parameters): # "_" is used because the function you have to outputs two parameters but we only care about the first one thetaplus = np.copy(parameters_values) thetaplus[i][0] = thetaplus[i][0]+epsilon J_plus[i], _ = forward_propagation_n(X,Y,vector_to_dictionary(thetaplus)) thetaminus = np.copy(parameters_values) thetaminus[i][0] = thetaminus[i][0]-epsilon J_minus[i], _ = forward_propagation_n(X,Y,vector_to_dictionary(thetaminus)) # Compute gradapprox[i] gradapprox[i] = (J_plus[i]-J_minus[i])/(2*epsilon) # Compare gradapprox to backward propagation gradients by computing difference. numerator = np.linalg.norm(grad-gradapprox) denominator = np.linalg.norm(grad)+np.linalg.norm(gradapprox) difference = numerator/denominator if difference > 1e-7: print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m") else: print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m") return difference使用梯度检查函数来检查反向传播代码的正确性:
X, Y, parameters = gradient_check_n_test_case() cost, cache = forward_propagation_n(X, Y, parameters) gradients = backward_propagation_n(X, Y, cache) difference = gradient_check_n(parameters, gradients, X, Y)运行结果: 反向传播计算值与导数定义计算值差异很大,说明反向传播的计算代码有误。返回去检查给定的反向传播代码,发现dW2与db1的计算代码写错了:
dW2 = 1./m * np.dot(dZ2, A1.T) * 2 ... db1 = 4./m * np.sum(dZ1, axis=1, keepdims = True)将反向传播的代码更正之后再次运行梯度检查,发现其返回的差异值很小了:
梯度检查是检查计算正确性的手段,但是它的速度很慢。从上述代码可以看到,它做的不是矩阵运算,有而是多少个参数就进行多少次迭代,所以并不是在每一次训练迭代中都使用梯度检查。
梯度检查是不兼容Dropout正则化的,如果要使用Dropout正则化,则在使用之前先进行梯度检查。