Scaling up Neural Network Training with Large Batch Size

Data parallelism is one of the widely used way to accelerate the training process of neural network. It attempts to utilize more computational nodes to calculate gradients so as to reduce computation time of each epoch. If we assign fixed amount of training samples to each node in every single batch, called mini-batch, then the whole batch size of the model will increase linearly with number of nodes involved. However, it is widely observed that training with large batch size always leads to lower accuracy of model. So how to train neural network with large batch size efficiently is crucial for its scalability.

The current efforts of this problem mainly focus on the adaption of learning rate, like linearly scaling up learning rate with batch size [1], warming up the learning rate gradually [2], and layer-wise adaptive rate scaling [3]. In this project, we aim to understand the large batch training problem thoroughly and propose some more general and effective approaches to address this problem.

Supervised by Shaoduo Gan

sgan@inf.ethz.ch

References

  1. [1] Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.
  2. [2] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
  3. [3] Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, Kurt Keutzer. ImageNet Training in Minutes. arXiv preprint arXiv:1709.05011, 2017