Deep Learning (2) - Regularization

Edit

在机器学习中,有两种常见的问题:

  • bias (或者hIgh bias, underfitting,欠拟合)
  • variance (或者high variance, overfitting, 过拟合)

前者通常因为模型不够复杂,不足以通过现有数据预测出正确的结果。training error过大。解决的方法可以通过增大训练网络。
而后者是因为模型过于依赖训练数据,而对测试数据效果不佳。training error不大,但是dev error却比较大。改善的办法是正则化,即Regularization。

Regularization

加入正则化因素的损失函数如下:

(1)式中没有对b进行正则化,即没有。原因是b只是一个常数,而w通常是一个大矩阵,数据比较多,容易出现过拟合。添加b也可以,但通常对运算结果影响不大。

(1)式的正则项,称为L2 regularization。通常有下面几种:

  • L2 regularization (又叫weight decay):
  • L1 regularization:
  • Frobenius regularization:
  • Dropout regularization: 在另一章单独阐述。

L2 regularization = weight decay
因为当我们使用正则化的损失函数求导时有:


我的理解,因为w在减小,所以叫权重衰减。

Dropout regularization

Dropout就是随机删除一些hidden unit来实现正则化。

Implementation Example (Inverted Dropout)

这是最常见的Dropout实现方法,叫”Inverted Dropout” (反向随机Dropout)。计算步骤如下:

d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep-prob
a3 = np.multiply(a3, d3)
a3 /= keep-prob

解释一下:

  1. keep-prob代表保留hidden unit的概率。如果keep-prob = 0.8,则有80%的unit被保留,20%的unit被删除
  2. 之所以最后一行要re-scale,因为a3被删除掉一些unit,他的期望值就只有原先的keep-prob的比例,所以这里需要re-scale,这样在验证test set的时候不会有scaling的问题
  3. 注意:第一行用的是np.random.rand而不是np.random.randn。因为randn获取的随机数是正态分步,所以并不能保证小于keep-prob的概率为keep-probrand产生的是均匀分步,则能保证。

Make Prediction at Test Time

验证的时候,不使用Dropout。因为引入Dropout就是引入随机噪声,会导致预测值即不稳定。所以验证时通常不用Dropout。

Other Regularization

  • Augment data set.
    • 将原有图片进行处理得到更大的training set来消除variance。例如旋转,反转图片
  • Early Stopping
    • 在variance问题还不是很明显的时候stop,看下图,在中间的点处就停止迭代。

Vanishing/Exploding Gradients

梯度消失和梯度爆炸。这是为什么神经网络为什么不能很深。

以这样的网络举例。采用线性激活函数,即g(z)=z,则

正比于,如果w中有小于1的数,乘n次后就会接近于0,此为梯度消失,反之则为梯度爆炸。至今仍没有很好的办法可以消除这两种问题。但是可以通过随机初始化权重矩阵的方式来改进。具体做法如下:

当采用ReLU激活函数时,期望方差为,(2)式中的常数变成2即可:
当采用tanh激活函数时,公式同(2)。
(2)式中的平方项又叫Xavier Initialization,即
(3)式中的平方项叫He Initialization,即

Conclusion for random weight initialization

  • Different initializations lead to different results
  • Random initialization is used to break symmetry and make sure different hidden units can learn different things
  • Don’t intialize to values that are too large
  • He initialization works well for networks with ReLU activations.

%23%20Deep%20Learning%20%282%29%20-%20Regularization%0A@%28myblog%29%5Bdeep%20learning%5D%0A%0A%u5728%u673A%u5668%u5B66%u4E60%u4E2D%uFF0C%u6709%u4E24%u79CD%u5E38%u89C1%u7684%u95EE%u9898%uFF1A%0A-%20bias%20%28%u6216%u8005hIgh%20bias%2C%20underfitting%uFF0C%u6B20%u62DF%u5408%29%0A-%20variance%20%28%u6216%u8005high%20variance%2C%20overfitting%2C%20%u8FC7%u62DF%u5408%29%0A%0A%u524D%u8005%u901A%u5E38%u56E0%u4E3A%u6A21%u578B%u4E0D%u591F%u590D%u6742%uFF0C%u4E0D%u8DB3%u4EE5%u901A%u8FC7%u73B0%u6709%u6570%u636E%u9884%u6D4B%u51FA%u6B63%u786E%u7684%u7ED3%u679C%u3002training%20error%u8FC7%u5927%u3002%u89E3%u51B3%u7684%u65B9%u6CD5%u53EF%u4EE5%u901A%u8FC7%u589E%u5927%u8BAD%u7EC3%u7F51%u7EDC%u3002%0A%u800C%u540E%u8005%u662F%u56E0%u4E3A%u6A21%u578B%u8FC7%u4E8E%u4F9D%u8D56%u8BAD%u7EC3%u6570%u636E%uFF0C%u800C%u5BF9%u6D4B%u8BD5%u6570%u636E%u6548%u679C%u4E0D%u4F73%u3002training%20error%u4E0D%u5927%uFF0C%u4F46%u662Fdev%20error%u5374%u6BD4%u8F83%u5927%u3002%u6539%u5584%u7684%u529E%u6CD5%u662F%u6B63%u5219%u5316%uFF0C%u5373Regularization%u3002%0A%0A%23%23%20Regularization%0A%u52A0%u5165%u6B63%u5219%u5316%u56E0%u7D20%u7684%u635F%u5931%u51FD%u6570%u5982%u4E0B%uFF1A%0A%24%24J%28w%2C%20b%29%20%3D%20%20%5Cdfrac%7B1%7D%7Bm%7D%20%5CSigma_%7Bi%3D1%7D%5E%7Bm%7D%20%5Cmathscr%20L%28%5Chat%20y%5E%7B%28i%29%7D%2C%20y%5E%7B%28i%29%7D%29%20+%20%5Cfrac%7B%5Clambda%7D%7B2m%7D%20%5Cbegin%7BVmatrix%7Dx%5Cend%7BVmatrix%7D_2%5E2%20%5Ctag%201%24%24%0A%0A%3E%20%281%29%u5F0F%u4E2D%u6CA1%u6709%u5BF9b%u8FDB%u884C%u6B63%u5219%u5316%uFF0C%u5373%u6CA1%u6709%24%5Cfrac%7B%5Clambda%7D%7B2m%7Db%5E2%24%u3002%u539F%u56E0%u662Fb%u53EA%u662F%u4E00%u4E2A%u5E38%u6570%uFF0C%u800Cw%u901A%u5E38%u662F%u4E00%u4E2A%u5927%u77E9%u9635%uFF0C%u6570%u636E%u6BD4%u8F83%u591A%uFF0C%u5BB9%u6613%u51FA%u73B0%u8FC7%u62DF%u5408%u3002%u6DFB%u52A0b%u4E5F%u53EF%u4EE5%uFF0C%u4F46%u901A%u5E38%u5BF9%u8FD0%u7B97%u7ED3%u679C%u5F71%u54CD%u4E0D%u5927%u3002%0A%0A%281%29%u5F0F%u7684%u6B63%u5219%u9879%24%20%5Cfrac%7B%5Clambda%7D%7B2m%7D%20%5Cbegin%7BVmatrix%7Dx%5Cend%7BVmatrix%7D_2%5E2%24%uFF0C%u79F0%u4E3A***L2%20regularization***%u3002%u901A%u5E38%u6709%u4E0B%u9762%u51E0%u79CD%uFF1A%0A-%20L2%20regularization%20%28%u53C8%u53EBweight%20decay%29%3A%20%20%20%24%5Cfrac%7B%5Clambda%7D%7B2m%7D%20%5Cbegin%7BVmatrix%7Dx%5Cend%7BVmatrix%7D_2%5E2%20%3D%20%5CSigma_%7Bj%3D1%7D%5E%7Bn_x%7Dw%5ETw%24%20%0A-%20L1%20regularization%3A%20%24%20%5Cfrac%7B%5Clambda%7D%7B2m%7D%20%5Cbegin%7Bvmatrix%7Dx%5Cend%7Bvmatrix%7D_1%20%20%3D%20%5CSigma_%7Bj%3D1%7D%5E%7Bn_x%7D%20%5Cbegin%7Bvmatrix%7Dx%5Cend%7Bvmatrix%7D%24%0A-%20Frobenius%20regularization%3A%20%24%5Cfrac%7B%5Clambda%7D%7B2m%7D%20%5Cbegin%7BVmatrix%7Dx%5Cend%7BVmatrix%7D_F%5E2%20%3D%20%5CSigma_%7Bi%3D1%7D%5E%7Bn%5E%7B%5Bl-1%5D%7D%7D%5CSigma_%7Bj%3D1%7D%5E%7Bn%5El%7D%28w_%7Bij%7D%5E%7B%5Bl%5D%7D%29%5E2%24%20%0A-%20Dropout%20regularization%3A%20%u5728%u53E6%u4E00%u7AE0%u5355%u72EC%u9610%u8FF0%u3002%0A%0A%3E%20**L2%20regularization%20%3D%20weight%20decay**%0A%3E%20%u56E0%u4E3A%u5F53%u6211%u4EEC%u4F7F%u7528%u6B63%u5219%u5316%u7684%u635F%u5931%u51FD%u6570%u6C42%u5BFC%u65F6%u6709%uFF1A%0A%3E%20%24dw%5E%7B%5Bl%5D%7D%20%3D%20%5Cfrac%20%7B%5Cpartial%20J%7D%7B%5Cpartial%20w%7D%20%3D%20%28from%20%5Cspace%20backprop%29%20+%20%5Cfrac%20%7B%5Clambda%7D%7Bm%7Dw%5E%7B%5Bl%5D%7D%24%0A%3E%20%24w%5E%7B%5Bl%5D%7D%20%3D%20w%5E%7B%5Bl%5D%7D%20-%20%5Calpha%20%5Cspace%20dw%5E%7B%5Bl%5D%7D%20%24%0A%u6211%u7684%u7406%u89E3%uFF0C%u56E0%u4E3Aw%u5728%u51CF%u5C0F%uFF0C%u6240%u4EE5%u53EB%u6743%u91CD%u8870%u51CF%u3002%0A%0A%23%23%20Dropout%20regularization%0ADropout%u5C31%u662F%u968F%u673A%u5220%u9664%u4E00%u4E9Bhidden%20unit%u6765%u5B9E%u73B0%u6B63%u5219%u5316%u3002%0A%21%5BAlt%20text%5D%28./1534490558916.png%29%0A%0A%23%23%23%20Implementation%20Example%20%28Inverted%20Dropout%29%0A%u8FD9%u662F%u6700%u5E38%u89C1%u7684Dropout%u5B9E%u73B0%u65B9%u6CD5%uFF0C%u53EB%22Inverted%20Dropout%22%20%28%u53CD%u5411%u968F%u673ADropout%29%u3002%u8BA1%u7B97%u6B65%u9AA4%u5982%u4E0B%uFF1A%0A%60%60%60python%0Ad3%20%3D%20np.random.rand%28a3.shape%5B0%5D%2C%20a3.shape%5B1%5D%29%20%3C%20keep-prob%0Aa3%20%3D%20np.multiply%28a3%2C%20d3%29%0Aa3%20/%3D%20keep-prob%0A%60%60%60%0A%u89E3%u91CA%u4E00%u4E0B%uFF1A%0A1.%20keep-prob%u4EE3%u8868%u4FDD%u7559hidden%20unit%u7684%u6982%u7387%u3002%u5982%u679Ckeep-prob%20%3D%200.8%uFF0C%u5219%u670980%25%u7684unit%u88AB%u4FDD%u7559%uFF0C20%25%u7684unit%u88AB%u5220%u9664%0A2.%20%u4E4B%u6240%u4EE5%u6700%u540E%u4E00%u884C%u8981re-scale%uFF0C%u56E0%u4E3Aa3%u88AB%u5220%u9664%u6389%u4E00%u4E9Bunit%uFF0C%u4ED6%u7684%u671F%u671B%u503C%u5C31%u53EA%u6709%u539F%u5148%u7684keep-prob%u7684%u6BD4%u4F8B%uFF0C%u6240%u4EE5%u8FD9%u91CC%u9700%u8981re-scale%uFF0C%u8FD9%u6837%u5728%u9A8C%u8BC1test%20set%u7684%u65F6%u5019%u4E0D%u4F1A%u6709scaling%u7684%u95EE%u9898%0A3.%20%u6CE8%u610F%uFF1A%u7B2C%u4E00%u884C%u7528%u7684%u662F%60np.random.rand%60%u800C%u4E0D%u662F%60np.random.randn%60%u3002%u56E0%u4E3A%60randn%60%u83B7%u53D6%u7684%u968F%u673A%u6570%u662F%u6B63%u6001%u5206%u6B65%uFF0C%u6240%u4EE5%u5E76%u4E0D%u80FD%u4FDD%u8BC1%u5C0F%u4E8E%60keep-prob%60%u7684%u6982%u7387%u4E3A%60keep-prob%60%u3002%60rand%60%u4EA7%u751F%u7684%u662F%u5747%u5300%u5206%u6B65%uFF0C%u5219%u80FD%u4FDD%u8BC1%u3002%0A%23%23%23%20Make%20Prediction%20at%20Test%20Time%0A%u9A8C%u8BC1%u7684%u65F6%u5019%uFF0C%u4E0D%u4F7F%u7528Dropout%u3002%u56E0%u4E3A%u5F15%u5165Dropout%u5C31%u662F%u5F15%u5165%u968F%u673A%u566A%u58F0%uFF0C%u4F1A%u5BFC%u81F4%u9884%u6D4B%u503C%u5373%24%5Chat%20y%24%u4E0D%u7A33%u5B9A%u3002%u6240%u4EE5%u9A8C%u8BC1%u65F6%u901A%u5E38%u4E0D%u7528Dropout%u3002%0A%21%5BAlt%20text%7C350x0%5D%28./1534496834899.png%29%0A%0A%23%23%20Other%20Regularization%0A-%20Augment%20data%20set.%0A%09-%20%u5C06%u539F%u6709%u56FE%u7247%u8FDB%u884C%u5904%u7406%u5F97%u5230%u66F4%u5927%u7684training%20set%u6765%u6D88%u9664variance%u3002%u4F8B%u5982%u65CB%u8F6C%uFF0C%u53CD%u8F6C%u56FE%u7247%0A-%20Early%20Stopping%0A%09-%20%u5728variance%u95EE%u9898%u8FD8%u4E0D%u662F%u5F88%u660E%u663E%u7684%u65F6%u5019stop%uFF0C%u770B%u4E0B%u56FE%uFF0C%u5728%u4E2D%u95F4%u7684%u70B9%u5904%u5C31%u505C%u6B62%u8FED%u4EE3%u3002%0A%21%5BAlt%20text%7C500x0%5D%28./1534497836025.png%29%0A%0A%23%23%20Vanishing/Exploding%20Gradients%0A%u68AF%u5EA6%u6D88%u5931%u548C%u68AF%u5EA6%u7206%u70B8%u3002%u8FD9%u662F%u4E3A%u4EC0%u4E48%u795E%u7ECF%u7F51%u7EDC%u4E3A%u4EC0%u4E48%u4E0D%u80FD%u5F88%u6DF1%u3002%0A%21%5BAlt%20text%5D%28./1534498747912.png%29%0A%0A%u4EE5%u8FD9%u6837%u7684%u7F51%u7EDC%u4E3E%u4F8B%u3002%u91C7%u7528%u7EBF%u6027%u6FC0%u6D3B%u51FD%u6570%uFF0C%u5373g%28z%29%3Dz%uFF0C%u5219%0A%24%24%5Chat%20y%3Dw%5E%7B%5Bl%5D%7Dw%5E%7B%5Bl-1%5D%7D%5Cdots%20w%5E1x%24%24%0A%24%5Cfrac%7B%5Cpartial%20J%7D%7Bw%5E%7B%5Bl%5D%7D%7D%24%u6B63%u6BD4%u4E8E%24w%5E%7B%5Bl-1%5D%7D%5Cdots%20w%5E1x%24%uFF0C%u5982%u679Cw%u4E2D%u6709%u5C0F%u4E8E1%u7684%u6570%uFF0C%u4E58n%u6B21%u540E%u5C31%u4F1A%u63A5%u8FD1%u4E8E0%uFF0C%u6B64%u4E3A%u68AF%u5EA6%u6D88%u5931%uFF0C%u53CD%u4E4B%u5219%u4E3A%u68AF%u5EA6%u7206%u70B8%u3002%u81F3%u4ECA%u4ECD%u6CA1%u6709%u5F88%u597D%u7684%u529E%u6CD5%u53EF%u4EE5%u6D88%u9664%u8FD9%u4E24%u79CD%u95EE%u9898%u3002%u4F46%u662F%u53EF%u4EE5%u901A%u8FC7%u968F%u673A%u521D%u59CB%u5316%u6743%u91CD%u77E9%u9635%u7684%u65B9%u5F0F%u6765%u6539%u8FDB%u3002%u5177%u4F53%u505A%u6CD5%u5982%u4E0B%uFF1A%0A%21%5BAlt%20text%7C250x0%5D%28./1534546536710.png%29%0A%u5BF9%u4E8E%u8FD9%u6837%u4E00%u4E2A%u5355%u5C42%u7F51%u7EDC%uFF0C%u5E76%u91C7%u7528%u7EBF%u6027%u6FC0%u6D3B%u51FD%u6570%uFF0C%u6709%24z%3Dw_1x_1%20+%20w_2x_2+%5Cdots+w_nx_n%24%uFF0Cn%u8D8A%u5927%u5219%u5E0C%u671B%u6BCF%u4E00%u9879%u90FD%u8D8A%u5C0F%uFF0C%u6240%u4EE5%u5E0C%u671B%u65B9%u5DEE%u503C%u4E3A%24Var%28w_i%29%20%3D%20%5Cfrac%20%7B1%7D%7Bn%7D%24%u3002%u6240%u4EE5%u968F%u673A%u521D%u59CB%u5316w%u65F6%u53EF%u4EE5%u5982%u4E0B%uFF1A%0A%24%24w%5E%7B%5Bl%5D%7D%20%3D%20%5Cmathtt%7Bnp.random.rand%28shape%29%7D%20%5Cast%20%5Cmathtt%7Bnp.sqrt%7D%28%5Cfrac%20%7B1%7D%7Bn%5E%7B%5Bl-1%5D%7D%7D%29%20%5Ctag%202%24%24%0A%u5F53%u91C7%u7528ReLU%u6FC0%u6D3B%u51FD%u6570%u65F6%uFF0C%u671F%u671B%u65B9%u5DEE%u4E3A%24Var%28w_i%29%20%3D%20%5Cfrac%20%7B2%7D%7Bn%7D%24%uFF0C%282%29%u5F0F%u4E2D%u7684%u5E38%u6570%u53D8%u62102%u5373%u53EF%uFF1A%0A%24%24w%5E%7B%5Bl%5D%7D%20%3D%20%5Cmathtt%7Bnp.random.rand%28shape%29%7D%20%5Cast%20%5Cmathtt%7Bnp.sqrt%7D%28%5Cfrac%20%7B2%7D%7Bn%5E%7B%5Bl-1%5D%7D%7D%29%20%5Ctag%203%24%24%0A%u5F53%u91C7%u7528tanh%u6FC0%u6D3B%u51FD%u6570%u65F6%uFF0C%u516C%u5F0F%u540C%282%29%u3002%0A%282%29%u5F0F%u4E2D%u7684%u5E73%u65B9%u9879%u53C8%u53EBXavier%20Initialization%uFF0C%u5373%24%5Csqrt%20%7B%5Cfrac%20%7B1%7D%7Bn%5E%7B%5Bl-1%5D%7D%7D%7D%24%0A%283%29%u5F0F%u4E2D%u7684%u5E73%u65B9%u9879%u53EBHe%20Initialization%uFF0C%u5373%24%5Csqrt%20%7B%5Cfrac%20%7B2%7D%7Bn%5E%7B%5Bl-1%5D%7D%7D%7D%24%0A%0A%3E**Conclusion%20for%20random%20weight%20initialization**%0A-%20Different%20initializations%20lead%20to%20different%20results%0A-%20Random%20initialization%20is%20used%20to%20break%20symmetry%20and%20make%20sure%20different%20hidden%20units%20can%20learn%20different%20things%0A-%20Don%27t%20intialize%20to%20values%20that%20are%20too%20large%0A-%20He%20initialization%20works%20well%20for%20networks%20with%20ReLU%20activations.