Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Motivation
每一层输入的分布改变(covariate shift) -> 学习速率不能太快, 参数初始化需要很小心 使用BN -> 改善以上问题, 同时可以减少对Dropout的依赖. 文中对internal covariate shift的定义是
the change in the distribution of network activations due to the change in net- work parameters during training.
其实covariate shift也可以用频域调整(domain adaptation)的方法缓解. 考虑一个例子, 假设网络某一层为sigmoid层, 有
\[z=g(Wu+b)\]其中 \(g(x)=\frac{1}{1+exp(-x)}\) 知道 \(g'(x)\) 随着\(x\)增大而减小, 因此如果某一层的结果使得 \(x\) 分布在较大的区域, 则容易出现梯度饱和. 而且这种效应随着网络深度的增加, 很容易被扩大. 其实这个问题也可以通过使用ReLU来解决.
Batch Normalization
Algorithm BN comes as follows:
Input: Values of \(x\) over a mini-batch: \(\mathcal{B}=\{x_{1...m}\}\); Parameters to be learned: \(\gamma\), \(\beta\);
Output: \(\{y_i=BN_{\gamma,\beta}(x_i)\}\)
\[\begin{aligned} & \mu_{\mathcal{B}}=\frac{1}{m}\sum_{i=1}^{m}x_i \\ & \sigma_{\mathcal{B}}^2=\frac{1}{m}\sum_{i=1}^m(x_i-\mu_B)^2 \\ & \hat{x_{i}}=\frac{x_i-\mu_\mathcal{B}}{\sqrt{\sigma_{\mathcal{B}}^2+\epsilon}} \\ & y_i=\gamma\hat{x_i}+\beta \equiv BN_{\gamma,\beta}(x_i) \end{aligned}\]Seems to be very easy. As we can see, the internal variable (we can cosider it as output of a layer) $\hat{x_i}$ has 0 mean and 1 variance, which makes it work. Later, we use $\beta$ and $\gamma$ to reconstruct the data, preserving the presence capability of the network.
Inference Network
BN algorithm is used for accelerating training process, so it’s unnecessary during inference (or generation). We use \(E[X] \leftarrow E[\mu_{\mathcal{B}}]\), \(Var[x] \leftarrow \frac{m}{m-1}E_{\mathcal{B}}[\sigma_{\mathcal{B}}^2]\) to estimate the mean and variance of the input.
Convolutional Networks Case
When using batchnorm in convolutional networks, we should note:
- different feature maps are batch normalized seperately and have their own parameters(\(\beta^{(k)}, \gamma^{(k)}\))
- different locations in a single feature map are jointly normalized.
Something More
In the original paper, the authors added batchnorm before an activation, and analyze based on this fact. However, some other researchers’ experiment argued that batchnorm after activation seems to have better performance.
理论依据
LeCun曾经证明过, 如果输入是经过白化(Whitened, 就是均值为0, 方差为1)网络会收敛得更快.