Deep Learning (5) - Batch Normalization

Edit

在机器学习建模时,通常会对输入参数X进行Normalize,即




Normalize的好处是可以加速收敛。看下图,当正规化后,Contour从椭圆变成圆,不管起始点落在圆的哪里,最后都可以收敛到中心最优点。而左图,可能有一些随机噪声,导致方向偏离,就会导致最终结果发散。所以需要很小心的选择learning rate。

Implement

Batch Norm要做的就是对神经网络的每一层的中间变量使用正规化:

  • 参数控制着Z的均值和方差,它们和W,b一样也是模型求解的参数(learnable parameter)。

为什么要有?
正规化成均值为0,方差为1时,当采用类似sigmoid的激活函数的时候,则激活函数输出,或者说该节点输出均集中在中心线性区域,则该节点退化成线性激活函数,所有的节点退化成线性节点,神经网络就退化成了logistic regression。为了保持非线性,为了保持随机性,要通过来调整每个状态量的分布函数。

Batch Norm in Neural Network

for 1…num of Mini-batches
   compute forward path on
       In each hidden layer, use BN to repair with
   Use backprop to compute
   Update params
       
       
       
Work with momentum, RMSprop, Adam

这里注意省略了, 因为b是常量,与状态输入无关,所以在正规化的时候,会被计入状态量的期望值

Batch Norm at test time

在测试阶段,对test set本身不做Batch Norm,因为test set和training set的分布可能不同。但是在做正向传播求预测输出的时候,因为各个hidden unit的参数都是根据Batch Norm迭代出来的,所以折中的办法就是,采用exponential weighted average来记录training set的,在测试阶段使用。具体步骤:

  • 针对每个mini batch,每一层记录
  • 使用exponential weighted average across mini batches,更新,
  • 结束training的时候,记录使用在test set里
%23%20Deep%20Learning%20%285%29%20-%20Batch%20Normalization%0A@%28myblog%29%5Bdeep%20learning%2C%20machine%20learning%5D%0A%0A%u5728%u673A%u5668%u5B66%u4E60%u5EFA%u6A21%u65F6%uFF0C%u901A%u5E38%u4F1A%u5BF9%u8F93%u5165%u53C2%u6570X%u8FDB%u884CNormalize%uFF0C%u5373%0A%24%5Cmu%20%3D%20%5Cfrac%20%7B1%7D%7Bm%7D%20%5CSigma_i%20x%5E%7B%28i%29%7D%24%0A%24X%20%3D%20X-%5Cmu%24%0A%24%5Csigma%5E2%20%3D%20%5Cfrac%20%7B1%7D%7Bm%7D%20%5CSigma_i%20%28x%5E%7B%28i%29%7D-%5Cmu%29%5E2%24%0A%24X%3D%5Cfrac%20%7BX%7D%7B%5Csigma%7D%24%0ANormalize%u7684%u597D%u5904%u662F%u53EF%u4EE5%u52A0%u901F%u6536%u655B%u3002%u770B%u4E0B%u56FE%uFF0C%u5F53%u6B63%u89C4%u5316%u540E%uFF0CContour%u4ECE%u692D%u5706%u53D8%u6210%u5706%uFF0C%u4E0D%u7BA1%u8D77%u59CB%u70B9%u843D%u5728%u5706%u7684%u54EA%u91CC%uFF0C%u6700%u540E%u90FD%u53EF%u4EE5%u6536%u655B%u5230%u4E2D%u5FC3%u6700%u4F18%u70B9%u3002%u800C%u5DE6%u56FE%uFF0C%u53EF%u80FD%u6709%u4E00%u4E9B%u968F%u673A%u566A%u58F0%uFF0C%u5BFC%u81F4%u65B9%u5411%u504F%u79BB%uFF0C%u5C31%u4F1A%u5BFC%u81F4%u6700%u7EC8%u7ED3%u679C%u53D1%u6563%u3002%u6240%u4EE5%u9700%u8981%u5F88%u5C0F%u5FC3%u7684%u9009%u62E9learning%20rate%u3002%0A%21%5BAlt%20text%7C300x0%5D%28./1535670653390.png%29%0A%0A%23%23%20Implement%0ABatch%20Norm%u8981%u505A%u7684%u5C31%u662F%u5BF9%u795E%u7ECF%u7F51%u7EDC%u7684%u6BCF%u4E00%u5C42%u7684%u4E2D%u95F4%u53D8%u91CF%24Z%5E%7B%5Bl%5D%7D%24%u4F7F%u7528%u6B63%u89C4%u5316%uFF1A%0A-%20%24%5Cmu%20%3D%20%5Cfrac%7B1%7D%7Bm%7D%5CSigma_i%20Z%5E%7B%28i%29%7D%24%0A-%20%24%5Csigma%5E2%20%3D%20%5Cfrac%20%7B1%7D%7Bm%7D%20%5CSigma_i%20%28Z%5E%7B%28i%29%7D%20-%20%5Cmu%29%5E2%24%0A-%20%24Z_%7Bnorm%7D%5E%7B%28i%29%7D%20%3D%20%5Cfrac%20%7BZ%5E%7B%28i%29%7D-%5Cmu%7D%7B%5Csqrt%20%7B%5Csigma%5E2%20+%20%5Cepsilon%7D%7D%24%0A-%20%24%5Ctilde%7BZ%7D%5E%7B%28i%29%7D%20%3D%20%5Cgamma%20Z_%7Bnorm%7D%5E%7B%28i%29%7D%20+%20%5Cbeta%24%0A-%20%u53C2%u6570%24%5Cgamma%24%uFF0C%24%5Cbeta%24%u63A7%u5236%u7740Z%u7684%u5747%u503C%u548C%u65B9%u5DEE%uFF0C%u5B83%u4EEC%u548CW%uFF0Cb%u4E00%u6837%u4E5F%u662F%u6A21%u578B%u6C42%u89E3%u7684%u53C2%u6570%28learnable%20parameter%29%u3002%0A%0A%3E%20%u4E3A%u4EC0%u4E48%u8981%u6709%24%5Cgamma%2C%20%5Cbeta%24%3F%0A%3E%20%u6B63%u89C4%u5316%u6210%u5747%u503C%u4E3A0%uFF0C%u65B9%u5DEE%u4E3A1%u65F6%uFF0C%u5F53%u91C7%u7528%u7C7B%u4F3Csigmoid%u7684%u6FC0%u6D3B%u51FD%u6570%u7684%u65F6%u5019%uFF0C%u5219%u6FC0%u6D3B%u51FD%u6570%u8F93%u51FA%uFF0C%u6216%u8005%u8BF4%u8BE5%u8282%u70B9%u8F93%u51FA%u5747%u96C6%u4E2D%u5728%u4E2D%u5FC3%u7EBF%u6027%u533A%u57DF%uFF0C%u5219%u8BE5%u8282%u70B9%u9000%u5316%u6210%u7EBF%u6027%u6FC0%u6D3B%u51FD%u6570%uFF0C%u6240%u6709%u7684%u8282%u70B9%u9000%u5316%u6210%u7EBF%u6027%u8282%u70B9%uFF0C%u795E%u7ECF%u7F51%u7EDC%u5C31%u9000%u5316%u6210%u4E86logistic%20regression%u3002%u4E3A%u4E86%u4FDD%u6301%u975E%u7EBF%u6027%uFF0C%u4E3A%u4E86%u4FDD%u6301%u968F%u673A%u6027%uFF0C%u8981%u901A%u8FC7%24%5Cgamma%24%u548C%24%5Cbeta%24%u6765%u8C03%u6574%u6BCF%u4E2A%u72B6%u6001%u91CF%u7684%u5206%u5E03%u51FD%u6570%u3002%0A%3E%20%21%5BAlt%20text%5D%28./1535683830394.png%29%0A%0A%23%23%20Batch%20Norm%20in%20Neural%20Network%0Afor%201...num%20of%20Mini-batches%0A%u3000%u3000%u3000compute%20forward%20path%20on%20%24X%5E%7B%28t%29%7D%24%0A%u3000%u3000%u3000%u3000%u3000%u3000%u3000In%20each%20hidden%20layer%2C%20use%20BN%20to%20repair%20%24Z%5E%7B%5Bl%5D%7D%24with%20%24%5Ctilde%20%7BZ%7D%5E%7B%5Bl%5D%7D%24%0A%u3000%u3000%u3000Use%20backprop%20to%20compute%20%24dw%5E%7B%5Bl%5D%7D%2C%20d%5Cbeta%5E%7B%5Bl%5D%7D%2C%20d%5Cgamma%5E%7B%5Bl%5D%7D%24%0A%u3000%u3000%u3000Update%20params%20%0A%u3000%u3000%u3000%u3000%u3000%u3000%u3000%24w%5E%7B%5Bl%5D%7D%20%3A%3D%20w%5E%7B%5Bl%5D%7D%20-%20%5Calpha%20%5Ccdot%20dw%5E%7B%28l%29%7D%24%0A%u3000%u3000%u3000%u3000%u3000%u3000%u3000%24%5Cbeta%5E%7B%5Bl%5D%7D%20%3A%3D%20%5Cbeta%5E%7B%5Bl%5D%7D%20-%20%5Calpha%20%5Ccdot%20d%20%5Cbeta%5E%7B%5Bl%5D%7D%24%0A%u3000%u3000%u3000%u3000%u3000%u3000%u3000%24%5Cgamma%5E%7B%5Bl%5D%7D%20%3A%3D%20%5Cgamma%5E%7B%5Bl%5D%7D%20-%20%5Calpha%20%5Ccdot%20d%20%5Cgamma%5E%7B%5Bl%5D%7D%24%0AWork%20with%20momentum%2C%20RMSprop%2C%20Adam%0A%0A%u8FD9%u91CC%u6CE8%u610F%u7701%u7565%u4E86%24db%5E%7B%5Bl%5D%7D%24%2C%20%u56E0%u4E3Ab%u662F%u5E38%u91CF%uFF0C%u4E0E%u72B6%u6001%u8F93%u5165%u65E0%u5173%uFF0C%u6240%u4EE5%u5728%u6B63%u89C4%u5316%u7684%u65F6%u5019%uFF0C%u4F1A%u88AB%u8BA1%u5165%u72B6%u6001%u91CF%u7684%u671F%u671B%u503C%24%5Cbeta%24%0A%0A%23%23%20Batch%20Norm%20at%20test%20time%0A%u5728%u6D4B%u8BD5%u9636%u6BB5%uFF0C%u5BF9test%20set%u672C%u8EAB%u4E0D%u505ABatch%20Norm%uFF0C%u56E0%u4E3Atest%20set%u548Ctraining%20set%u7684%u5206%u5E03%u53EF%u80FD%u4E0D%u540C%u3002%u4F46%u662F%u5728%u505A%u6B63%u5411%u4F20%u64AD%u6C42%u9884%u6D4B%u8F93%u51FA%u7684%u65F6%u5019%uFF0C%u56E0%u4E3A%u5404%u4E2Ahidden%20unit%u7684%u53C2%u6570%u90FD%u662F%u6839%u636EBatch%20Norm%u8FED%u4EE3%u51FA%u6765%u7684%uFF0C%u6240%u4EE5%u6298%u4E2D%u7684%u529E%u6CD5%u5C31%u662F%uFF0C%u91C7%u7528exponential%20weighted%20average%u6765%u8BB0%u5F55training%20set%u7684%24%5Cmu%24%u548C%24%5Csigma%5E2%24%uFF0C%u5728%u6D4B%u8BD5%u9636%u6BB5%u4F7F%u7528%u3002%u5177%u4F53%u6B65%u9AA4%uFF1A%0A-%20%u9488%u5BF9%u6BCF%u4E2Amini%20batch%uFF0C%u6BCF%u4E00%u5C42%u8BB0%u5F55%24%5Cmu%5E%7B%5C%7Bi%5C%7D%5Bl%5D%7D%24%u548C%24%5Csigma%5E%7B2%5C%7Bi%5C%7D%5Bl%5D%7D%24%0A-%20%u4F7F%u7528exponential%20weighted%20average%20across%20mini%20batches%uFF0C%u66F4%u65B0%24%5Cmu%24%2C%20%24%5Csigma%5E2%24%0A-%20%u7ED3%u675Ftraining%u7684%u65F6%u5019%uFF0C%u8BB0%u5F55%24%5Cmu%24%u548C%24%5Csigma%5E2%24%u4F7F%u7528%u5728test%20set%u91CC