[关闭]
@Perfect-Demo 2018-05-01T08:42:26.000000Z 字数 20984 阅读 1164

deep_learning_month2_week2_Optimization_methods

机器学习深度学习

代码已上传github:
https://github.com/PerfectDemoT/my_deeplearning_homework


这次的题目是优化算法,即使用 monmentum方法以及RMSprop方法,然后最终使用Adam方法(其实Adam算法更像是结合了前面两种算法)

其中很重要的一点是对于以及的初始化以及迭代运算,另外还有的选取(虽然我们会谈到这两个值一般可以不进行筛选,因为其实有两个"通用"的值,一般都直接用这两个值)

最后,我们还将:普通mini-batch下降 , 用momentum的mini-batch梯度下降 , 用Adma的mini-batch梯度下降。三个进行了比较(当然,这里主要是比较其对最后的预测准确度的影响,虽然平常这三种方法往往是看对训练速度的影响)

下面我们来看看实现过程:


1. 我们先导入包

  1. import numpy as np
  2. import matplotlib.pyplot as plt
  3. import scipy.io
  4. import math
  5. import sklearn
  6. import sklearn.datasets
  7. from opt_utils import load_params_and_grads, initialize_parameters, forward_propagation, backward_propagation
  8. from opt_utils import compute_cost, predict, predict_dec, plot_decision_boundary, load_dataset
  9. from testCases import *

在这里我还是建议在自己实现时,去看看有些写好的功能函数,并且里面其实有需要改动的地方(有几个地方直接用会报错的,是矩阵的大小不对等问题,如果你自己实现的话,一定会遇到,此不赘述)

2.就像之前说的,我们最后是对三种方法的训练效果进行比较。我们实现普通的参数更新操作

  1. #首先是更新参数的函数
  2. # GRADED FUNCTION: update_parameters_with_gd
  3. def update_parameters_with_gd(parameters, grads, learning_rate):
  4. """
  5. Update parameters using one step of gradient descent
  6. Arguments:
  7. parameters -- python dictionary containing your parameters to be updated:
  8. parameters['W' + str(l)] = Wl
  9. parameters['b' + str(l)] = bl
  10. grads -- python dictionary containing your gradients to update each parameters:
  11. grads['dW' + str(l)] = dWl
  12. grads['db' + str(l)] = dbl
  13. learning_rate -- the learning rate, scalar.
  14. Returns:
  15. parameters -- python dictionary containing your updated parameters
  16. """
  17. L = len(parameters) // 2 # number of layers in the neural networks
  18. # Update rule for each parameter
  19. for l in range(L):
  20. ### START CODE HERE ### (approx. 2 lines)
  21. parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * grads["dW" + str(l + 1)]
  22. parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * grads["db" + str(l + 1)]
  23. ### END CODE HERE ###
  24. return parameters

输出测试一下:

  1. #下面来输出看看
  2. parameters, grads, learning_rate = update_parameters_with_gd_test_case()
  3. parameters = update_parameters_with_gd(parameters, grads, learning_rate)
  4. print("W1 = " + str(parameters["W1"]))
  5. print("b1 = " + str(parameters["b1"]))
  6. print("W2 = " + str(parameters["W2"]))
  7. print("b2 = " + str(parameters["b2"]))
  8. print("=====================================")

结果是这样:

  1. W1 = [[ 1.63535156 -0.62320365 -0.53718766]
  2. [-1.07799357 0.85639907 -2.29470142]]
  3. b1 = [[ 1.74604067]
  4. [-0.75184921]]
  5. W2 = [[ 0.32171798 -0.25467393 1.46902454]
  6. [-2.05617317 -0.31554548 -0.3756023 ]
  7. [ 1.1404819 -1.09976462 -0.1612551 ]]
  8. b2 = [[-0.88020257]
  9. [ 0.02561572]
  10. [ 0.57539477]]

3. 接下来我们实现mini-batch算法,另外,在实现这个算法时,我们需要了解一下这些:(批量梯度下降和随机梯度下降不严格的说,其实就是mini-batch的特殊情况)

  1. #批量梯度下降和随机梯度下降(其实也就是B=1的mini-batch下法)的伪代码在.ipynb文件里
  2. # - ** (Batch)
  3. #
  4. # ``` python
  5. # X = data_input
  6. # Y = labels
  7. # parameters = initialize_parameters(layers_dims)
  8. # for i in range(0, num_iterations):
  9. # # Forward propagation
  10. # a, caches = forward_propagation(X, parameters)
  11. # # Compute cost.
  12. # cost = compute_cost(a, Y)
  13. # # Backward propagation.
  14. # grads = backward_propagation(a, caches, parameters)
  15. # # Update parameters.
  16. # parameters = update_parameters(parameters, grads)
  17. #
  18. # ```
  19. #
  20. # - ** Stochastic
  21. #
  22. # ```python
  23. # X = data_input
  24. # Y = labels
  25. # parameters = initialize_parameters(layers_dims)
  26. # for i in range(0, num_iterations):
  27. # for j in range(0, m):
  28. # # Forward propagation
  29. # a, caches = forward_propagation(X[:, j], parameters)
  30. # # Compute cost
  31. # cost = compute_cost(a, Y[:, j])
  32. # # Backward propagation
  33. # grads = backward_propagation(a, caches, parameters)
  34. # # Update parameters.
  35. # parameters = update_parameters(parameters, grads)
  36. # ```

接下来我们来看看Mini-batch的代码

  1. #下面我们来实现mini-batch
  2. # GRADED FUNCTION: random_mini_batches
  3. def random_mini_batches(X, Y, mini_batch_size=64, seed=0):
  4. """
  5. Creates a list of random minibatches from (X, Y)
  6. Arguments:
  7. X -- input data, of shape (input size, number of examples)
  8. Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
  9. mini_batch_size -- size of the mini-batches, integer
  10. Returns:
  11. mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)
  12. """
  13. np.random.seed(seed) # To make your "random" minibatches the same as ours
  14. m = X.shape[1] # number of training examples
  15. mini_batches = []
  16. # Step 1: Shuffle (X, Y) 打乱顺序
  17. permutation = list(np.random.permutation(m))
  18. shuffled_X = X[:, permutation]
  19. shuffled_Y = Y[:, permutation].reshape((1, m))
  20. # Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case.
  21. num_complete_minibatches = math.floor(
  22. m / mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning
  23. for k in range(0, num_complete_minibatches):
  24. ### START CODE HERE ### (approx. 2 lines)
  25. mini_batch_X = shuffled_X[: , (k * mini_batch_size) : ((k + 1) * mini_batch_size)]
  26. mini_batch_Y = shuffled_Y[: , (k * mini_batch_size) : ((k + 1) * mini_batch_size)]
  27. ### END CODE HERE ###
  28. mini_batch = (mini_batch_X, mini_batch_Y)
  29. mini_batches.append(mini_batch)
  30. # Handling the end case (last mini-batch < mini_batch_size)
  31. if m % mini_batch_size != 0:
  32. ### START CODE HERE ### (approx. 2 lines)
  33. mini_batch_X = shuffled_X[: , (num_complete_minibatches * mini_batch_size) : ]
  34. mini_batch_Y = shuffled_Y[: , (num_complete_minibatches * mini_batch_size) : ]
  35. ### END CODE HERE ###
  36. mini_batch = (mini_batch_X, mini_batch_Y)
  37. mini_batches.append(mini_batch)
  38. return mini_batches

下面我们来简单说明一下上方算法:

上面的思路是,先把X , Y 数据集打乱,然后根据每一个的大小,将m个数据分为个大小为的块,分的方法是用矩阵的划分。然后用装在一起,再用语句mini_batches.append(mini_batch)把所有变量装在一个

现在可以测试一下:

  1. #下面来输出看看效果
  2. X_assess, Y_assess, mini_batch_size = random_mini_batches_test_case()
  3. mini_batches = random_mini_batches(X_assess, Y_assess, mini_batch_size)
  4. print ("shape of the 1st mini_batch_X: " + str(mini_batches[0][0].shape))
  5. print ("shape of the 2nd mini_batch_X: " + str(mini_batches[1][0].shape))
  6. print ("shape of the 3rd mini_batch_X: " + str(mini_batches[2][0].shape))
  7. print ("shape of the 1st mini_batch_Y: " + str(mini_batches[0][1].shape))
  8. print ("shape of the 2nd mini_batch_Y: " + str(mini_batches[1][1].shape))
  9. print ("shape of the 3rd mini_batch_Y: " + str(mini_batches[2][1].shape))
  10. print ("mini batch sanity check: " + str(mini_batches[0][0][0][0:3]))
  11. print("=================================================")

结果是:

  1. shape of the 1st mini_batch_X: (12288, 64)
  2. shape of the 2nd mini_batch_X: (12288, 64)
  3. shape of the 3rd mini_batch_X: (12288, 20)
  4. shape of the 1st mini_batch_Y: (1, 64)
  5. shape of the 2nd mini_batch_Y: (1, 64)
  6. shape of the 3rd mini_batch_Y: (1, 20)
  7. mini batch sanity check: [ 0.90085595 -0.7612069 0.2344157 ]

4. 这里我改变一下原来的代码顺序(注意,这一段应该放在最后,只能在预测函数前面)

1. 这里写一下model函数,就是最后预测函数调用的汇总函数,先看代码,然后再讲解注意的地方

  1. #下面来看看这个model函数
  2. def model(X, Y, layers_dims, optimizer, learning_rate=0.0007, mini_batch_size=64, beta=0.9,
  3. beta1=0.9, beta2=0.999, epsilon=1e-8, num_epochs=10000, print_cost=True):
  4. """
  5. 3-layer neural network model which can be run in different optimizer modes.
  6. Arguments:
  7. X -- input data, of shape (2, number of examples)
  8. Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
  9. layers_dims -- python list, containing the size of each layer
  10. learning_rate -- the learning rate, scalar.
  11. mini_batch_size -- the size of a mini batch
  12. beta -- Momentum hyperparameter
  13. beta1 -- Exponential decay hyperparameter for the past gradients estimates
  14. beta2 -- Exponential decay hyperparameter for the past squared gradients estimates
  15. epsilon -- hyperparameter preventing division by zero in Adam updates
  16. num_epochs -- number of epochs
  17. print_cost -- True to print the cost every 1000 epochs
  18. Returns:
  19. parameters -- python dictionary containing your updated parameters
  20. """
  21. L = len(layers_dims) # number of layers in the neural networks
  22. costs = [] # to keep track of the cost
  23. t = 0 # initializing the counter required for Adam update
  24. seed = 10 # For grading purposes, so that your "random" minibatches are the same as ours
  25. # Initialize parameters
  26. parameters = initialize_parameters(layers_dims)
  27. # Initialize the optimizer
  28. if optimizer == "gd":
  29. pass # no initialization required for gradient descent
  30. elif optimizer == "momentum":
  31. v = initialize_velocity(parameters)
  32. elif optimizer == "adam":
  33. v, s = initialize_adam(parameters)
  34. # Optimization loop
  35. for i in range(num_epochs):
  36. # Define the random minibatches. We increment the seed to reshuffle differently the dataset after each epoch
  37. seed = seed + 1
  38. minibatches = random_mini_batches(X, Y, mini_batch_size, seed)
  39. for minibatch in minibatches:
  40. # Select a minibatch
  41. (minibatch_X, minibatch_Y) = minibatch
  42. # Forward propagation
  43. a3, caches = forward_propagation(minibatch_X, parameters)
  44. # Compute cost
  45. cost = compute_cost(a3, minibatch_Y)
  46. # Backward propagation
  47. grads = backward_propagation(minibatch_X, minibatch_Y, caches)
  48. # Update parameters
  49. if optimizer == "gd":
  50. parameters = update_parameters_with_gd(parameters, grads, learning_rate)
  51. elif optimizer == "momentum":
  52. parameters, v = update_parameters_with_momentum(parameters, grads, v, beta, learning_rate)
  53. elif optimizer == "adam":
  54. t = t + 1 # Adam counter
  55. parameters, v, s = update_parameters_with_adam(parameters, grads, v, s,
  56. t, learning_rate, beta1, beta2, epsilon)
  57. # Print the cost every 1000 epoch
  58. if print_cost and i % 1000 == 0:
  59. print("Cost after epoch %i: %f" % (i, cost))
  60. if print_cost and i % 100 == 0:
  61. costs.append(cost)
  62. # plot the cost
  63. plt.plot(costs)
  64. plt.ylabel('cost')
  65. plt.xlabel('epochs (per 100)')
  66. plt.title("Learning rate = " + str(learning_rate))
  67. plt.show()
  68. return parameters

没错,这个函数就更新参数用的,然后由于会用于三种方法,所以里面有选择语句(比如32到38行)。
另外,还需要注意的地方是,mini-batch的前向传播与反向传播(47到69行)。47行上方的mini-batchs参数的划分,然后对每一个batch循环来更新参数paramrter。

2. 然后记住,跑model函数之前需要先导入数据

  1. #跑一下这个model函数
  2. #先导入数据
  3. train_X, train_Y = load_dataset()

3. 现在来看看普通的mini-batch算法,同样,这里我调换了一下顺序

其实就是调用之前写的函数(model),直接先上代码(注意,这里要先导入数据,上方model函数的后面写了导入代码)

  1. #普通mini-batch下降
  2. # train 3-layer model
  3. layers_dims = [train_X.shape[0], 5, 2, 1]
  4. parameters = model(train_X, train_Y, layers_dims, optimizer = "gd")
  5. # Predict
  6. predictions = predict(train_X, train_Y, parameters)
  7. # Plot decision boundary
  8. plt.title("Model with Gradient Descent optimization")
  9. axes = plt.gca()
  10. axes.set_xlim([-1.5,2.5])
  11. axes.set_ylim([-1,1.5])
  12. plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)
  13. print("=================================")

大家可以看到,就是调用model函数得到参数,然后调用predict函数来得到易看懂的结果,结果如下

  1. Cost after epoch 0: 0.690736
  2. Cost after epoch 1000: 0.685273
  3. Cost after epoch 2000: 0.647072
  4. Cost after epoch 3000: 0.619525
  5. Cost after epoch 4000: 0.576584
  6. Cost after epoch 5000: 0.607243
  7. Cost after epoch 6000: 0.529403
  8. Cost after epoch 7000: 0.460768
  9. Cost after epoch 8000: 0.465586
  10. Cost after epoch 9000: 0.464518
  11. Accuracy: 0.796666666667

cost曲线为
mini-batch-cost

然后我们来看看边界划分:
boundary

5. 现在我们开始写monmentum(动态梯度下降)算法

1.首先进行参数初始化

  1. # GRADED FUNCTION: initialize_velocity
  2. def initialize_velocity(parameters):
  3. """
  4. Initializes the velocity as a python dictionary with:
  5. - keys: "dW1", "db1", ..., "dWL", "dbL"
  6. - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
  7. Arguments:
  8. parameters -- python dictionary containing your parameters.
  9. parameters['W' + str(l)] = Wl
  10. parameters['b' + str(l)] = bl
  11. Returns:
  12. v -- python dictionary containing the current velocity.
  13. v['dW' + str(l)] = velocity of dWl
  14. v['db' + str(l)] = velocity of dbl
  15. """
  16. L = len(parameters) // 2 # number of layers in the neural networks
  17. v = {}
  18. # Initialize velocity
  19. for l in range(L):
  20. ### START CODE HERE ### (approx. 2 lines)
  21. v["dW" + str(l + 1)] = np.zeros((parameters["W" + str(l + 1)]).shape)
  22. v["db" + str(l + 1)] = np.zeros((parameters["b" + str(l + 1)]).shape)
  23. ### END CODE HERE ###
  24. return v

没有什么惊奇的地方,初始化时v,b都为0(就不上输出结果了,一大堆0),如果要输出看看,代码是这样的

  1. #输出看看效果
  2. parameters = initialize_velocity_test_case()
  3. v = initialize_velocity(parameters)
  4. print("v[\"dW1\"] = " + str(v["dW1"]))
  5. print("v[\"db1\"] = " + str(v["db1"]))
  6. print("v[\"dW2\"] = " + str(v["dW2"]))
  7. print("v[\"db2\"] = " + str(v["db2"]))
  8. print("==================================")

2. 参数初始化完毕后,现在开始用momentum算法的更新参数W , b的值(注意,这里还有v的计算)

  1. # GRADED FUNCTION: update_parameters_with_momentum
  2. def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):
  3. """
  4. Update parameters using Momentum
  5. Arguments:
  6. parameters -- python dictionary containing your parameters:
  7. parameters['W' + str(l)] = Wl
  8. parameters['b' + str(l)] = bl
  9. grads -- python dictionary containing your gradients for each parameters:
  10. grads['dW' + str(l)] = dWl
  11. grads['db' + str(l)] = dbl
  12. v -- python dictionary containing the current velocity:
  13. v['dW' + str(l)] = ...
  14. v['db' + str(l)] = ...
  15. beta -- the momentum hyperparameter, scalar
  16. learning_rate -- the learning rate, scalar
  17. Returns:
  18. parameters -- python dictionary containing your updated parameters
  19. v -- python dictionary containing your updated velocities
  20. """
  21. L = len(parameters) // 2 # number of layers in the neural networks
  22. # Momentum update for each parameter
  23. for l in range(L):
  24. ### START CODE HERE ### (approx. 4 lines)
  25. # compute velocities
  26. v["dW" + str(l + 1)] = beta * v["dW" + str(l + 1)] + (1 - beta) * grads["dW" + str(l + 1)]
  27. v["db" + str(l + 1)] = beta * v["db" + str(l + 1)] + (1 - beta) * grads["db" + str(l + 1)]
  28. # update parameters
  29. parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * v["dW" + str(l + 1)]
  30. parameters["b" + str(l + 1)] = parameters["b" + str(l +1 )] - learning_rate * v["db" + str(l + 1)]
  31. ### END CODE HERE ###
  32. return parameters, v

现在来看看执行效果

  1. #现在来看看效果
  2. parameters, grads, v = update_parameters_with_momentum_test_case()
  3. parameters, v = update_parameters_with_momentum(parameters, grads, v, beta = 0.9, learning_rate = 0.01)
  4. print("W1 = " + str(parameters["W1"]))
  5. print("b1 = " + str(parameters["b1"]))
  6. print("W2 = " + str(parameters["W2"]))
  7. print("b2 = " + str(parameters["b2"]))
  8. print("v[\"dW1\"] = " + str(v["dW1"]))
  9. print("v[\"db1\"] = " + str(v["db1"]))
  10. print("v[\"dW2\"] = " + str(v["dW2"]))
  11. print("v[\"db2\"] = " + str(v["db2"]))
  12. #其实v["dW"]啥的,都是在一次次循环里更新的,,,确实就是只要初始化为0他的更新是和parameters里的W与b一起的
  13. print("================================")

输出长这样:

  1. W1 = [[ 1.62544598 -0.61290114 -0.52907334]
  2. [-1.07347112 0.86450677 -2.30085497]]
  3. b1 = [[ 1.74493465]
  4. [-0.76027113]]
  5. W2 = [[ 0.31930698 -0.24990073 1.4627996 ]
  6. [-2.05974396 -0.32173003 -0.38320915]
  7. [ 1.13444069 -1.0998786 -0.1713109 ]]
  8. b2 = [[-0.87809283]
  9. [ 0.04055394]
  10. [ 0.58207317]]
  11. v["dW1"] = [[-0.11006192 0.11447237 0.09015907]
  12. [ 0.05024943 0.09008559 -0.06837279]]
  13. v["db1"] = [[-0.01228902]
  14. [-0.09357694]]
  15. v["dW2"] = [[-0.02678881 0.05303555 -0.06916608]
  16. [-0.03967535 -0.06871727 -0.08452056]
  17. [-0.06712461 -0.00126646 -0.11173103]]
  18. v["db2"] = [[ 0.02344157]
  19. [ 0.16598022]
  20. [ 0.07420442]]

3. 好了,现在开始用momentum的mini-batch梯度下降

代码如下:

  1. # train 3-layer model
  2. layers_dims = [train_X.shape[0], 5, 2, 1]
  3. parameters = model(train_X, train_Y, layers_dims, beta = 0.9, optimizer = "momentum")
  4. # Predict
  5. predictions = predict(train_X, train_Y, parameters)
  6. # Plot decision boundary
  7. plt.title("Model with Momentum optimization")
  8. axes = plt.gca()
  9. axes.set_xlim([-1.5,2.5])
  10. axes.set_ylim([-1,1.5])
  11. plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)
  12. print("======================================")

输出长这样:

  1. Cost after epoch 0: 0.690741
  2. Cost after epoch 1000: 0.685341
  3. Cost after epoch 2000: 0.647145
  4. Cost after epoch 3000: 0.619594
  5. Cost after epoch 4000: 0.576665
  6. Cost after epoch 5000: 0.607324
  7. Cost after epoch 6000: 0.529476
  8. Cost after epoch 7000: 0.460936
  9. Cost after epoch 8000: 0.465780
  10. Cost after epoch 9000: 0.464740
  11. Accuracy: 0.796666666667

cost曲线
cost曲线
然后现在看看边界划分
boundary

6. 接下来看Adam优化方法(则种方法像结合monmentum方法以及RMSprop方法,所以这里没有单独谈RMSprop方法)

1. 同样,先是初始化参数(全都为0)

  1. # GRADED FUNCTION: initialize_adam
  2. def initialize_adam(parameters):
  3. """
  4. Initializes v and s as two python dictionaries with:
  5. - keys: "dW1", "db1", ..., "dWL", "dbL"
  6. - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
  7. Arguments:
  8. parameters -- python dictionary containing your parameters.
  9. parameters["W" + str(l)] = Wl
  10. parameters["b" + str(l)] = bl
  11. Returns:
  12. v -- python dictionary that will contain the exponentially weighted average of the gradient.
  13. v["dW" + str(l)] = ...
  14. v["db" + str(l)] = ...
  15. s -- python dictionary that will contain the exponentially weighted average of the squared gradient.
  16. s["dW" + str(l)] = ...
  17. s["db" + str(l)] = ...
  18. """
  19. L = len(parameters) // 2 # number of layers in the neural networks
  20. v = {}
  21. s = {}
  22. # Initialize v, s. Input: "parameters". Outputs: "v, s".
  23. for l in range(L):
  24. ### START CODE HERE ### (approx. 4 lines)
  25. v["dW" + str(l+1)] = np.zeros(parameters["W" + str(l+1)].shape)
  26. v["db" + str(l+1)] = np.zeros(parameters["b" + str(l+1)].shape)
  27. s["dW" + str(l+1)] = np.zeros(parameters["W" + str(l+1)].shape)
  28. s["db" + str(l+1)] = np.zeros(parameters["b" + str(l+1)].shape)
  29. ### END CODE HERE ###
  30. return v, s

和之前类似,输代码是(这里就不输出了,全是0):

  1. #上面是为Adam算法进行了v["dW"],v["db"],s["dW"],s["sb"]的初始化
  2. #来看看效果
  3. parameters = initialize_adam_test_case()
  4. v, s = initialize_adam(parameters)
  5. print("v[\"dW1\"] = " + str(v["dW1"]))
  6. print("v[\"db1\"] = " + str(v["db1"]))
  7. print("v[\"dW2\"] = " + str(v["dW2"]))
  8. print("v[\"db2\"] = " + str(v["db2"]))
  9. print("s[\"dW1\"] = " + str(s["dW1"]))
  10. print("s[\"db1\"] = " + str(s["db1"]))
  11. print("s[\"dW2\"] = " + str(s["dW2"]))
  12. print("s[\"db2\"] = " + str(s["db2"]))
  13. print("==============================")

2. 接下来是更新Adam算法的参数

主要是注意s参数的更新,v的更新和之前的类似

  1. # GRADED FUNCTION: update_parameters_with_adam
  2. def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate=0.01,
  3. beta1=0.9, beta2=0.999, epsilon=1e-8):
  4. """
  5. Update parameters using Adam
  6. Arguments:
  7. parameters -- python dictionary containing your parameters:
  8. parameters['W' + str(l)] = Wl
  9. parameters['b' + str(l)] = bl
  10. grads -- python dictionary containing your gradients for each parameters:
  11. grads['dW' + str(l)] = dWl
  12. grads['db' + str(l)] = dbl
  13. v -- Adam variable, moving average of the first gradient, python dictionary
  14. s -- Adam variable, moving average of the squared gradient, python dictionary
  15. learning_rate -- the learning rate, scalar.
  16. beta1 -- Exponential decay hyperparameter for the first moment estimates
  17. beta2 -- Exponential decay hyperparameter for the second moment estimates
  18. epsilon -- hyperparameter preventing division by zero in Adam updates
  19. Returns:
  20. parameters -- python dictionary containing your updated parameters
  21. v -- Adam variable, moving average of the first gradient, python dictionary
  22. s -- Adam variable, moving average of the squared gradient, python dictionary
  23. """
  24. L = len(parameters) // 2 # number of layers in the neural networks
  25. v_corrected = {} # Initializing first moment estimate, python dictionary
  26. s_corrected = {} # Initializing second moment estimate, python dictionary
  27. # Perform Adam update on all parameters
  28. for l in range(L):
  29. # Moving average of the gradients. Inputs: "v, grads, beta1". Output: "v".
  30. ### START CODE HERE ### (approx. 2 lines)
  31. v["dW" + str(l + 1)] = beta1 * v["dW" + str(l + 1)] + (1 - beta1) * grads["dW" + str(l + 1)]
  32. v["db" + str(l + 1)] = beta1 * v["db" + str(l + 1)] + (1 - beta1) * grads["db" + str(l + 1)]
  33. ### END CODE HERE ###
  34. # Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected".
  35. ### START CODE HERE ### (approx. 2 lines)
  36. #下面进行偏差修正
  37. v_corrected["dW" + str(l + 1)] = v["dW" + str(l + 1)] / (1 - beta1)
  38. v_corrected["db" + str(l + 1)] = v["db" + str(l + 1)] / (1 - beta1)
  39. # 这里有个疑问,和Ng课上讲的不太一样感觉,beta2不用该有一个t次方么?
  40. ### END CODE HERE ###
  41. # Moving average of the squared gradients. Inputs: "s, grads, beta2". Output: "s".
  42. ### START CODE HERE ### (approx. 2 lines)
  43. s["dW" + str(l + 1)] = beta2 * s["dW" + str(l + 1)] + (1 - beta2) * grads["dW" + str(l + 1)]**2
  44. s["db" + str(l + 1)] = beta2 * s["db" + str(l + 1)] + (1 - beta2) * grads["db" + str(l + 1)]**2
  45. ### END CODE HERE ###
  46. # Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected".
  47. ### START CODE HERE ### (approx. 2 lines)
  48. s_corrected["dW" + str(l + 1)] = s["dW" + str(l + 1)] / (1 - beta2)
  49. s_corrected["db" + str(l + 1)] = s["db" + str(l + 1)] / (1 - beta2)
  50. #这里有个疑问,和Ng课上讲的不太一样感觉,beta2不用该有一个t次方么?
  51. ### END CODE HERE ###
  52. # Update parameters. Inputs: "parameters, learning_rate, v_corrected, s_corrected, epsilon". Output: "parameters".
  53. ### START CODE HERE ### (approx. 2 lines)
  54. parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * (v_corrected["dW" + str(l + 1)] / (np.sqrt(s_corrected["dW" + str(l + 1)]) + epsilon))
  55. parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * (v_corrected["db" + str(l + 1)] / (np.sqrt(s_corrected["db" + str(l + 1)]) + epsilon))
  56. ### END CODE HERE ###
  57. return parameters, v, s

另外说明一点,对于42,43,55,56行是在进行偏差修正。
还有就是,就像我57行中注释的一样,在课程中讲到过有一个beta2^t,但是这里却没有这个,我还没有找到为什么,但是由于课程中页提到,偏差修正很多时候不进行,对最后结果一般也没有太大影响,所以这里我貌似没有发现什么问题。

现在我们来看看检测代码:

  1. #测试一下
  2. parameters, grads, v, s = update_parameters_with_adam_test_case()
  3. parameters, v, s = update_parameters_with_adam(parameters, grads, v, s, t = 2)
  4. print("W1 = " + str(parameters["W1"]))
  5. print("b1 = " + str(parameters["b1"]))
  6. print("W2 = " + str(parameters["W2"]))
  7. print("b2 = " + str(parameters["b2"]))
  8. print("v[\"dW1\"] = " + str(v["dW1"]))
  9. print("v[\"db1\"] = " + str(v["db1"]))
  10. print("v[\"dW2\"] = " + str(v["dW2"]))
  11. print("v[\"db2\"] = " + str(v["db2"]))
  12. print("s[\"dW1\"] = " + str(s["dW1"]))
  13. print("s[\"db1\"] = " + str(s["db1"]))
  14. print("s[\"dW2\"] = " + str(s["dW2"]))
  15. print("s[\"db2\"] = " + str(s["db2"]))
  16. print("=============================")

测试结果是这样:

  1. W1 = [[ 1.63434536 -0.62175641 -0.53817175]
  2. [-1.08296862 0.85540763 -2.2915387 ]]
  3. b1 = [[ 1.75481176]
  4. [-0.7512069 ]]
  5. W2 = [[ 0.3290391 -0.25937038 1.47210794]
  6. [-2.05014071 -0.3124172 -0.37405435]
  7. [ 1.14376944 -1.08989128 -0.16242821]]
  8. b2 = [[-0.88785842]
  9. [ 0.03221375]
  10. [ 0.57281521]]
  11. v["dW1"] = [[-0.11006192 0.11447237 0.09015907]
  12. [ 0.05024943 0.09008559 -0.06837279]]
  13. v["db1"] = [[-0.01228902]
  14. [-0.09357694]]
  15. v["dW2"] = [[-0.02678881 0.05303555 -0.06916608]
  16. [-0.03967535 -0.06871727 -0.08452056]
  17. [-0.06712461 -0.00126646 -0.11173103]]
  18. v["db2"] = [[ 0.02344157]
  19. [ 0.16598022]
  20. [ 0.07420442]]
  21. s["dW1"] = [[ 0.00121136 0.00131039 0.00081287]
  22. [ 0.0002525 0.00081154 0.00046748]]
  23. s["db1"] = [[ 1.51020075e-05]
  24. [ 8.75664434e-04]]
  25. s["dW2"] = [[ 7.17640232e-05 2.81276921e-04 4.78394595e-04]
  26. [ 1.57413361e-04 4.72206320e-04 7.14372576e-04]
  27. [ 4.50571368e-04 1.60392066e-07 1.24838242e-03]]
  28. s["db2"] = [[ 5.49507194e-05]
  29. [ 2.75494327e-03]
  30. [ 5.50629536e-04]]

3. 现在放大招,开始用Adma的mini-batch梯度下降(同样,不要忘记导入数据哦)

  1. # train 3-layer model
  2. layers_dims = [train_X.shape[0], 5, 2, 1]
  3. parameters = model(train_X, train_Y, layers_dims, optimizer = "adam")
  4. # Predict
  5. predictions = predict(train_X, train_Y, parameters)
  6. # Plot decision boundary
  7. plt.title("Model with Adam optimization")
  8. axes = plt.gca()
  9. axes.set_xlim([-1.5,2.5])
  10. axes.set_ylim([-1,1.5])
  11. plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

结果是这样:

  1. Cost after epoch 0: 0.690468
  2. Cost after epoch 1000: 0.325328
  3. Cost after epoch 2000: 0.223535
  4. Cost after epoch 3000: 0.109833
  5. Cost after epoch 4000: 0.140489
  6. Cost after epoch 5000: 0.111570
  7. Cost after epoch 6000: 0.128548
  8. Cost after epoch 7000: 0.036306
  9. Cost after epoch 8000: 0.128252
  10. Cost after epoch 9000: 0.211592
  11. Accuracy: 0.943333333333

cost曲线
cost

边界图片
boundary
感叹一下,这个优化方法真的让分类效果好了不止一点点。

7. 总结一下:

三种方法,分类准确度一个比一个高,最后,Adam算法胜出,前两个都是:
0.796666666667
到了Adam算法,直接飙升到:
0.943333333333
可见优化算法不止对训练速度有大大的改善,还对分类效果有不小的影响。

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注