Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Motivation

每一层输入的分布改变(covariate shift) -> 学习速率不能太快, 参数初始化需要很小心使用BN -> 改善以上问题, 同时可以减少对Dropout的依赖. 文中对internal covariate shift的定义是

the change in the distribution of network activations due to the change in net- work parameters during training.

其实covariate shift也可以用频域调整(domain adaptation)的方法缓解. 考虑一个例子, 假设网络某一层为sigmoid层, 有

\[z=g(Wu+b)\]

其中 $g(x)=\frac{1}{1+exp(-x)}$ 知道 $g'(x)$ 随着$x$增大而减小, 因此如果某一层的结果使得 $x$ 分布在较大的区域, 则容易出现梯度饱和. 而且这种效应随着网络深度的增加, 很容易被扩大. 其实这个问题也可以通过使用ReLU来解决.

Batch Normalization

Algorithm BN comes as follows:

Input: Values of $x$ over a mini-batch: $\mathcal{B}=\{x_{1...m}\}$; Parameters to be learned: $\gamma$, $\beta$;

Output: $\{y_i=BN_{\gamma,\beta}(x_i)\}$

\[\begin{aligned} & \mu_{\mathcal{B}}=\frac{1}{m}\sum_{i=1}^{m}x_i \\ & \sigma_{\mathcal{B}}^2=\frac{1}{m}\sum_{i=1}^m(x_i-\mu_B)^2 \\ & \hat{x_{i}}=\frac{x_i-\mu_\mathcal{B}}{\sqrt{\sigma_{\mathcal{B}}^2+\epsilon}} \\ & y_i=\gamma\hat{x_i}+\beta \equiv BN_{\gamma,\beta}(x_i) \end{aligned}\]

Seems to be very easy. As we can see, the internal variable (we can cosider it as output of a layer) $\hat{x_i}$ has 0 mean and 1 variance, which makes it work. Later, we use $\beta$ and $\gamma$ to reconstruct the data, preserving the presence capability of the network.

Inference Network

BN algorithm is used for accelerating training process, so it’s unnecessary during inference (or generation). We use $E[X] \leftarrow E[\mu_{\mathcal{B}}]$, $Var[x] \leftarrow \frac{m}{m-1}E_{\mathcal{B}}[\sigma_{\mathcal{B}}^2]$ to estimate the mean and variance of the input.

Convolutional Networks Case

When using batchnorm in convolutional networks, we should note:

different feature maps are batch normalized seperately and have their own parameters($\beta^{(k)}, \gamma^{(k)}$)
different locations in a single feature map are jointly normalized.
Something More

In the original paper, the authors added batchnorm before an activation, and analyze based on this fact. However, some other researchers’ experiment argued that batchnorm after activation seems to have better performance.

理论依据

LeCun曾经证明过, 如果输入是经过白化(Whitened, 就是均值为0, 方差为1)网络会收敛得更快.

phreer

Summary of Batchnorm Algorithm