Deep Learning (3) - Optimization Algorithms

Edit

Batch. vs. Mini-Batch

机器学习算法通常会有大量的training sample需要处理,这个数字可能是百万量级,甚至更高。这会导致矩阵迭代的时候,计算时间过长。当sample数达到一定数量的时候,通常会采用将整个训练集,分成若干小集合,即mini-batch,采用迭代的方法来计算。这样就将一次矩阵运算串行化,降低运算复杂度,速度会大幅提升。缺点是收敛会困难一些。看下图:

How to choose mini-batch size?

  • If small training set (m 2000), use batch gradient descent.
  • Typical mini-batch size: 64, 128, 256, 512, -> ~
  • Make sure mini-batch size fit your CPU/GPU memory.

Learning Rate Decay

动态调整learning rate的算法有很多种,下面这种是比较常用的

epoch_num = 1,2,3,
随着epoch_num越来越大,越来越小。

Other learning rate decay methods:

  • discrete stair case
  • manual decay

指数加权平均相关算法

Momentum, RMSprop, Adam都是和指数加权平均算法(Exponential Weighted Average)相关的。下面一一介绍:

Exponential Weighted Average

公式是:

其用意就是对过去个数据进行平均,并添加权重,对本轮数据采用权重。例如,则会加权平均前10个数据,并添加0.1本轮数据。其效果如下图:

  • 蓝点是原始数据
  • 红线采用,累积前10个数据的影响
  • 绿线采用,累积前20个数据的影响
  • 黄线采用,累积前2个数据的影响
    可见,采用指数加权平均后,数据变平滑,且越大,数据越平滑,越小,数据越趋近于原始数据。

How it works?

其工作原理如下图:

Bias Correction Exponential Weighted Average

之所以有这个bias correction,因为这个指数加权的初始值选的是0,造成开始一段的迭代,会偏离真实值较远。例如下面绿线和紫线。

当开始阶段,t比较小,也比较小,这样可以让计算出来的没那么小,以达到纠正偏差的效果。

Momentum

Momentum就是指数加权平均算法的一种应用。

RMSprop (Root Mean Square prop)

RMSprop是根据Momentum衍生出来的算法。具体做法如下:

On iteration t:

  • Compute dw and db on current mini-batch

为了区分Momentum和Exponential Weighted Average中的hyper parameter ,此处用

Adam optimization algorithm

Adam = Adaptive Moment Estimation
Adam = Momentum + Bias Correction + RMSprop
具体算法如下:


On iteration t:

  • Compute dw and db on current mini-batch
  • ,
  • ,
  • ,
  • ,

Hyper parameters (通常去经验值)

Local Optima Problem

Andrew解释了,在现代的机器学习问题中,不太可能出现local optima

%23%20Deep%20Learning%20%283%29%20-%20Optimization%20Algorithms%0A@%28myblog%29%5Bdeep%20learning%2C%20machine%20learning%5D%0A%0A%23%23%20Batch.%20vs.%20Mini-Batch%0A%u673A%u5668%u5B66%u4E60%u7B97%u6CD5%u901A%u5E38%u4F1A%u6709%u5927%u91CF%u7684training%20sample%u9700%u8981%u5904%u7406%uFF0C%u8FD9%u4E2A%u6570%u5B57%u53EF%u80FD%u662F%u767E%u4E07%u91CF%u7EA7%uFF0C%u751A%u81F3%u66F4%u9AD8%u3002%u8FD9%u4F1A%u5BFC%u81F4%u77E9%u9635%u8FED%u4EE3%u7684%u65F6%u5019%uFF0C%u8BA1%u7B97%u65F6%u95F4%u8FC7%u957F%u3002%u5F53sample%u6570%u8FBE%u5230%u4E00%u5B9A%u6570%u91CF%u7684%u65F6%u5019%uFF0C%u901A%u5E38%u4F1A%u91C7%u7528%u5C06%u6574%u4E2A%u8BAD%u7EC3%u96C6%uFF0C%u5206%u6210%u82E5%u5E72%u5C0F%u96C6%u5408%uFF0C%u5373mini-batch%uFF0C%u91C7%u7528%u8FED%u4EE3%u7684%u65B9%u6CD5%u6765%u8BA1%u7B97%u3002%u8FD9%u6837%u5C31%u5C06%u4E00%u6B21%u77E9%u9635%u8FD0%u7B97%u4E32%u884C%u5316%uFF0C%u964D%u4F4E%u8FD0%u7B97%u590D%u6742%u5EA6%uFF0C%u901F%u5EA6%u4F1A%u5927%u5E45%u63D0%u5347%u3002%u7F3A%u70B9%u662F%u6536%u655B%u4F1A%u56F0%u96BE%u4E00%u4E9B%u3002%u770B%u4E0B%u56FE%uFF1A%0A%21%5BAlt%20text%7C300x0%5D%28./1534979648469.png%29%0A%u84DD%u7EBF%u662Fbatch%20gradient%20descent%uFF0C%u7D2B%u7EBF%u662Fstochastic%20gradient%20descent%20%28mini-batch%20size%20%3D%201%29%uFF0C%u7EFF%u7EBF%u662Fmini-batch%20gradient%20descent%20%281%24%5Clt%24mini-batch%20size%20%24%5Clt%24%20m%29%u3002%0A%u84DD%u7EBF%u53EF%u4EE5%u5E73%u6ED1%u7684%u6536%u655B%u5230%u6781%u503C%u70B9%uFF0C%u7D2B%u7EBF%u5728%u6781%u503C%u70B9%u9644%u8FD1%u5267%u70C8%u632F%u52A8%uFF0C%u7EFF%u7EBF%u4ECB%u4E8E%u7D2B%u7EBF%u548C%u84DD%u7EBF%u4E4B%u95F4%uFF0C%u5728%u6781%u503C%u70B9%u9644%u8FD1%u5C0F%u5E45%u7684%u5F98%u5F8A%u3002%u53EF%u4EE5%u901A%u8FC7%u52A8%u6001%u8C03%u6574learning%20rate%u6765%u8BA9%u4ED6%u8FDB%u4E00%u6B65%u6536%u655B%u5230%u6781%u503C%u70B9%u3002%0A%0A%23%23%23%20How%20to%20choose%20mini-batch%20size%3F%0A-%20If%20small%20training%20set%20%28m%20%24%5Cle%24%202000%29%2C%20use%20batch%20gradient%20descent.%0A-%20Typical%20mini-batch%20size%3A%2064%2C%20128%2C%20256%2C%20512%2C%20%24%5Cdots%24%20-%3E%20%242%5E6%24%7E%242%5E9%24%0A-%20Make%20sure%20mini-batch%20size%20fit%20your%20CPU/GPU%20memory.%0A%0A%23%23%23%20Learning%20Rate%20Decay%0A%u52A8%u6001%u8C03%u6574learning%20rate%u7684%u7B97%u6CD5%u6709%u5F88%u591A%u79CD%uFF0C%u4E0B%u9762%u8FD9%u79CD%u662F%u6BD4%u8F83%u5E38%u7528%u7684%0A%24%24%5Calpha%20%3D%20%5Cfrac%20%7B1%7D%7B1+decay%5C_rate%20*%20epoch%5C_num%7D%20%5Calpha_0%24%24%0Aepoch_num%20%3D%201%2C2%2C3%2C%24%5Cdots%24%0A%u968F%u7740epoch_num%u8D8A%u6765%u8D8A%u5927%uFF0C%24%5Calpha%24%u8D8A%u6765%u8D8A%u5C0F%u3002%0A%3E%20Other%20learning%20rate%20decay%20methods%3A%0A%3E%20-%20%24%5Calpha%20%3D%200.95%5E%7Bepoch%5C_num%7D%5Calpha0%24%0A%3E%20-%20%24%5Calpha%20%3D%20%5Cfrac%20%7Bk%7D%20%7B%5Csqrt%20%7Bepoch%5C_num%7D%7D%20%5Calpha_0%24%0A%3E%20-%20discrete%20stair%20case%0A%3E%20-%20manual%20decay%0A%0A%23%23%20%u6307%u6570%u52A0%u6743%u5E73%u5747%u76F8%u5173%u7B97%u6CD5%0AMomentum%2C%20RMSprop%2C%20Adam%u90FD%u662F%u548C%u6307%u6570%u52A0%u6743%u5E73%u5747%u7B97%u6CD5%28Exponential%20Weighted%20Average%29%u76F8%u5173%u7684%u3002%u4E0B%u9762%u4E00%u4E00%u4ECB%u7ECD%uFF1A%0A%0A%23%23%23%20Exponential%20Weighted%20Average%0A%u516C%u5F0F%u662F%3A%0A%24%24v_t%20%3D%20%5Cbeta%20v_%7Bt-1%7D%20+%20%281-%5Cbeta%29%5Ctheta_t%24%24%0A%u5176%u7528%u610F%u5C31%u662F%u5BF9%u8FC7%u53BB%24%5Cfrac%20%7B1%7D%7B1-%5Cbeta%7D%24%u4E2A%u6570%u636E%u8FDB%u884C%u5E73%u5747%uFF0C%u5E76%u6DFB%u52A0%u6743%u91CD%24%5Cbeta%24%uFF0C%u5BF9%u672C%u8F6E%u6570%u636E%u91C7%u7528%u6743%u91CD%241-%5Cbeta%24%u3002%u4F8B%u5982%24%5Cbeta%20%3D%200.9%24%uFF0C%u5219%u4F1A%u52A0%u6743%u5E73%u5747%u524D10%u4E2A%u6570%u636E%uFF0C%u5E76%u6DFB%u52A00.1%u672C%u8F6E%u6570%u636E%u3002%u5176%u6548%u679C%u5982%u4E0B%u56FE%uFF1A%0A%21%5BAlt%20text%7C450x0%5D%28./1534980671272.png%29%0A-%20%u84DD%u70B9%u662F%u539F%u59CB%u6570%u636E%0A-%20%u7EA2%u7EBF%u91C7%u7528%24%5Cbeta%20%3D%200.9%24%uFF0C%u7D2F%u79EF%u524D10%u4E2A%u6570%u636E%u7684%u5F71%u54CD%0A-%20%u7EFF%u7EBF%u91C7%u7528%24%5Cbeta%20%3D%200.98%24%uFF0C%u7D2F%u79EF%u524D20%u4E2A%u6570%u636E%u7684%u5F71%u54CD%0A-%20%u9EC4%u7EBF%u91C7%u7528%24%5Cbeta%20%3D%200.5%24%uFF0C%u7D2F%u79EF%u524D2%u4E2A%u6570%u636E%u7684%u5F71%u54CD%0A%u53EF%u89C1%uFF0C%u91C7%u7528%u6307%u6570%u52A0%u6743%u5E73%u5747%u540E%uFF0C%u6570%u636E%u53D8%u5E73%u6ED1%uFF0C%u4E14%24%5Cbeta%24%u8D8A%u5927%uFF0C%u6570%u636E%u8D8A%u5E73%u6ED1%uFF0C%24%5Cbeta%24%u8D8A%u5C0F%uFF0C%u6570%u636E%u8D8A%u8D8B%u8FD1%u4E8E%u539F%u59CB%u6570%u636E%u3002%0A%0A%23%23%23%23%20How%20it%20works%3F%0A%u5176%u5DE5%u4F5C%u539F%u7406%u5982%u4E0B%u56FE%uFF1A%0A%21%5BAlt%20text%5D%28./1534981358549.png%29%0A%u6839%u636E%u7814%u7A76%u6709%u516C%u5F0F%24%281-%5Cepsilon%29%5E%7B%5Cfrac%7B1%7D%7B%5Cepsilon%7D%7D%20%5Capprox%200.35%24%uFF0C%24%5Cepsilon%20%3D%201-%5Cbeta%24%uFF0C%241-%5Cbeta%24%u540E%u7684%u6570%u636E%u6743%u503C%u5C31%u5F88%u5C0F%u4E86%uFF0C%u4F8B%u5982%24%5Cbeta%3D0.9%24%uFF0Ct%3D10%u4EE5%u540E%u6743%u503C%u7EA6%u4E3A%24%5Cbeta%281-%5Cbeta%29%5Et%20%3D%200.1*0.9%5E%7B10%7D%20%5Capprox%200.035%24%uFF0C%u82E5%24%5Cbeta%20%3D%200.98%24%uFF0Ct%3D50%u4EE5%u540E%u6743%u503C%u4E3A%24%5Cbeta%281-%5Cbeta%29%5Et%20%3D%200.02*0.98%5E%7B50%7D%20%5Capprox%200.007%24%uFF0C%u6240%u4EE5%u901A%u5E38%u53EA%u8003%u8651%u524D%24%5Cfrac%20%7B1%7D%7B1-%5Cbeta%7D%24%u4E2A%u6570%u636E%u5E26%u6765%u7684%u5F71%u54CD%u3002%0A%0A%23%23%23%23%20Bias%20Correction%20Exponential%20Weighted%20Average%0A%u4E4B%u6240%u4EE5%u6709%u8FD9%u4E2Abias%20correction%uFF0C%u56E0%u4E3A%u8FD9%u4E2A%u6307%u6570%u52A0%u6743%u7684%u521D%u59CB%u503C%u9009%u7684%u662F0%uFF0C%u9020%u6210%u5F00%u59CB%u4E00%u6BB5%u7684%u8FED%u4EE3%uFF0C%u4F1A%u504F%u79BB%u771F%u5B9E%u503C%u8F83%u8FDC%u3002%u4F8B%u5982%u4E0B%u9762%u7EFF%u7EBF%u548C%u7D2B%u7EBF%u3002%0A%21%5BAlt%20text%5D%28./1535013368575.png%29%0A%u5177%u4F53%u7B97%u6CD5%u5982%u4E0B%uFF1A%0A%24%24v_t%20%3D%20%5Cfrac%20%7Bv_t%7D%7B1-%5Cbeta%5Et%7D%24%24%0A%u5F53%u5F00%u59CB%u9636%u6BB5%uFF0Ct%u6BD4%u8F83%u5C0F%uFF0C%24%5Cfrac%20%7B1%7D%7B1-%5Cbeta%5Et%7D%24%u4E5F%u6BD4%u8F83%u5C0F%uFF0C%u8FD9%u6837%u53EF%u4EE5%u8BA9%u8BA1%u7B97%u51FA%u6765%u7684%24v_t%24%u6CA1%u90A3%u4E48%u5C0F%uFF0C%u4EE5%u8FBE%u5230%u7EA0%u6B63%u504F%u5DEE%u7684%u6548%u679C%u3002%0A%0A%0A%23%23%23%20Momentum%0AMomentum%u5C31%u662F%u6307%u6570%u52A0%u6743%u5E73%u5747%u7B97%u6CD5%u7684%u4E00%u79CD%u5E94%u7528%u3002%0A%21%5BAlt%20text%5D%28./1535016214569.png%29%0A%u901A%u5E38%u6211%u4EEC%u60F3%u52A0%u5FEB%u6A2A%u5411%u7684%u79FB%u52A8%uFF0C%u51CF%u5C11%u7EB5%u5411%u7684%u632F%u52A8%u3002%u6240%u4EE5%u6307%u6570%u52A0%u6743%u5E73%u5747%u53EF%u4EE5%u5E2E%u5FD9%u3002%0A%21%5BAlt%20text%7C450x0%5D%28./1535013718754.png%29%0AMomentum%u901A%u5E38%u4E0D%u7528bias%20correction%u3002%u56E0%u4E3Abias%u53EA%u4F1A%u5F71%u54CD%u524D%u51E0%u8F6E%u8FED%u4EE3%uFF0C%u4E0D%u4F1A%u5F71%u54CD%u5230%u540E%u9762%u8FED%u4EE3%u7684%u6536%u655B%u3002%0A%0A%23%23%23%20RMSprop%20%28Root%20Mean%20Square%20prop%29%0ARMSprop%u662F%u6839%u636EMomentum%u884D%u751F%u51FA%u6765%u7684%u7B97%u6CD5%u3002%u5177%u4F53%u505A%u6CD5%u5982%u4E0B%uFF1A%0A%3E%20On%20iteration%20t%3A%0A-%20Compute%20dw%20and%20db%20on%20current%20mini-batch%0A-%20%24S_%7Bdw%7D%20%3D%20%5Cbeta_2%20S_%7Bdw%7D%20+%20%281-%5Cbeta_2%29dw%5E2%24%0A-%20%24S_%7Bdb%7D%20%3D%20%5Cbeta_2%20S_%7Bdb%7D%20+%20%281-%5Cbeta_2%29db%5E2%24%0A-%20%24w%20%3A%3D%20w-%5Calpha%20%5Cfrac%7Bdw%7D%7B%5Csqrt%20%7BS_%7Bdw%7D%7D%20+%20%5Cepsilon%7D%24%0A-%20%24b%20%3A%3D%20b-%5Calpha%20%5Cfrac%7Bdb%7D%7B%5Csqrt%20%7BS_%7Bdb%7D%7D%20+%20%5Cepsilon%7D%24%0A-%20%24%5Cepsilon%20%3D%2010%5E%7B-8%7D%24%0A%0A%u4E3A%u4E86%u533A%u5206Momentum%u548CExponential%20Weighted%20Average%u4E2D%u7684hyper%20parameter%20%24%5Cbeta%24%uFF0C%u6B64%u5904%u7528%24%5Cbeta_2%24%0A%0A%23%23%23%20Adam%20optimization%20algorithm%0AAdam%20%3D%20Adaptive%20Moment%20Estimation%0AAdam%20%3D%20Momentum%20+%20Bias%20Correction%20+%20RMSprop%0A%u5177%u4F53%u7B97%u6CD5%u5982%u4E0B%uFF1A%0A%3E%20%24v_%7Bdw%7D%20%3D%200%2C%20S_%7Bdw%7D%20%3D%200%2C%20v_%7Bdb%7D%20%3D%200%2C%20S_%7Bdb%7D%20%3D%200%24%0AOn%20iteration%20t%3A%0A-%20Compute%20dw%20and%20db%20on%20current%20mini-batch%0A-%20%24v_%7Bdw%7D%20%3D%20%5Cbeta_1%20v_%7Bdw%7D%20+%20%281-%5Cbeta_1%29dw%24%2C%20%24S_%7Bdw%7D%20%3D%20%5Cbeta_2%20S_%7Bdw%7D%20+%20%281-%5Cbeta_2%29dw%5E2%24%0A-%20%24v_%7Bdb%7D%20%3D%20%5Cbeta_1%20v_%7Bdb%7D%20+%20%281-%5Cbeta_1%29db%24%2C%20%24S_%7Bdb%7D%20%3D%20%5Cbeta_2%20S_%7Bdb%7D%20+%20%281-%5Cbeta_2%29db%5E2%24%0A-%20%24v_%7Bdw%7D%5E%7Bcorrected%7D%20%3D%20%5Cfrac%20%7Bv_%7Bdw%7D%7D%7B1-%5Cbeta_1%5Et%7D%24%2C%20%24v_%7Bdb%7D%5E%7Bcorrected%7D%20%3D%20%5Cfrac%20%7Bv_%7Bdb%7D%7D%7B1-%5Cbeta_1%5Et%7D%24%0A-%20%24S_%7Bdw%7D%5E%7Bcorrected%7D%20%3D%20%5Cfrac%20%7BS_%7Bdw%7D%7D%7B1-%5Cbeta_2%5Et%7D%24%2C%20%24S_%7Bdb%7D%5E%7Bcorrected%7D%20%3D%20%5Cfrac%20%7BS_%7Bdb%7D%7D%7B1-%5Cbeta_2%5Et%7D%24%0A-%20%24w%20%3A%3D%20w-%5Calpha%20%5Cfrac%7Bv_%7Bdw%7D%5E%7Bcorrected%7D%7D%7B%5Csqrt%20%7BS_%7Bdw%7D%5E%7Bcorrected%7D%7D%20+%20%5Cepsilon%7D%24%0A-%20%24b%20%3A%3D%20b-%5Calpha%20%5Cfrac%7Bv_%7Bdb%7D%5E%7Bcorrected%7D%7D%7B%5Csqrt%20%7BS_%7Bdb%7D%5E%7Bcorrected%7D%7D%20+%20%5Cepsilon%7D%24%0A%0A%3E%20Hyper%20parameters%20%28%u901A%u5E38%u53BB%u7ECF%u9A8C%u503C%29%0A%3E%20%21%5BAlt%20text%7C300x0%5D%28./1535017116124.png%29%0A%0A%23%23%20Local%20Optima%20Problem%0AAndrew%u89E3%u91CA%u4E86%uFF0C%u5728%u73B0%u4EE3%u7684%u673A%u5668%u5B66%u4E60%u95EE%u9898%u4E2D%uFF0C%u4E0D%u592A%u53EF%u80FD%u51FA%u73B0local%20optima%0A%21%5BAlt%20text%5D%28./1535017928729.png%29%0A%u4F8B%u5982%u4E0A%u56FE%uFF0C%u770B%u8D77%u6765%u6709%u5F88%u591Alocal%20optima%uFF0C%u6709%u4E00%u4E2Aglobal%20optima%u3002%u4F46%u5176%u5B9E%u73B0%u5B9E%u5E76%u4E0D%u662F%u8FD9%u6837%uFF0C%u56E0%u4E3A%u8FD9%u91CC%u53EA%u662F%u4E09%u7EF4%u56FE%uFF0C%u53EA%u80FD%u753B%u51FA%u5173%u4E8E%u635F%u5931%u51FD%u6570J%u7684%u4E24%u4E2A%u7EF4%u5EA6%u3002%u4F46%u4E8B%u5B9E%u4E0AJ%u4E0Ew%u548Cb%u6709%u5173%uFF0C%u800Cw%u662F%u4E00%u4E2A%u77E9%u9635%uFF0C%u6709%u5F88%u591A%u5143%u7D20%uFF0Clocal%20optima%u8981%u6C42%u6240%u6709%u5143%u7D20%u7684%u5206%u6B65%u90FD%u8981%u6EE1%u8DB3%u6781%u503C%u70B9%u7684%u8981%u6C42%uFF0C%u5728w%u7EF4%u5EA6%u5F88%u9AD8%u65F6%uFF0C%u5E76%u4E0D%u592A%u53EF%u80FD%u51FA%u73B0%u3002%u6240%u4EE5%u4E0D%u9700%u8981%u62C5%u5FC3local%20optima%u7684%u95EE%u9898%u3002%0A%0A%0A