Processing math: 100%

Paper Review: A Disciplined Approach To Neural Network Hyper-Parameters: Part I - Learning Rate, Batch Size, Momentum, And Weight Decay

Paper Review: A Disciplined Approach To Neural Network Hyper-Parameters: Part I - Learning Rate, Batch Size, Momentum, And Weight Decay
Jithin James
January 16th, 2019
jithinjk.github.io/blog
Contents

Abstract
Introduction
The Unreasonable Effectiveness Of Validation/Test Loss
  3.1  The Underfitting And Overfitting Trade-off
  3.2  Underfitting
  3.3  Overfitting
Cyclical Learning Rates, Batch Sizes, Cyclical Momentum, And Weight Decay
  4.1  Cyclical Learning Rates And Super-convergence
  4.2  Batch Size
  4.3  Cyclical Momentum
  4.4  Weight Decay
Recipe For Finding A Good Set Of Hyper-Parameters With A Given Dataset And Architecture
Wrapping Up
Bibliography

In this paper review, we will dive into Leslie N. Smith's: A Disciplined Approach To Neural Network Hyper-Parameters: Part I - Learning Rate, Batch Size, Momentum, And Weight Decay. In this paper, the author discusses about how various neural network hyper-parameters and how they can be set with significant reduction in training time and performation improvement.

I'll try to summarize important points from this paper.

Code to replicate the results : https://github.com/lnsmith54/hyperParam1

   

Abstract

   

Introduction

   

The Unreasonable Effectiveness Of Validation/Test Loss

 Figure 1: Comparing the training loss, validation accuracy, validation loss, and generalization error

Remark 1. The test/validation loss is a good indicator of the network’s convergence

   

The Underfitting And Overfitting Trade-off

 Figure 2: Tradeoff between underfitting and overfitting

Remark 2. Achieving the horizontal part of the test loss is the goal of hyperparameter tuning

   

Underfitting

 Figure 3: Continuously decreasing test loss, rather than a horizontal plateau is a characteristic of underfitting.

   

Overfitting

 Figure 4: Increasing validation/test loss is a characteristic of overfitting

The art of setting the network’s hyper-parameters amounts to ending up at the balance point between underfitting and overfitting — Leslie N. Smith

   

Cyclical Learning Rates, Batch Sizes, Cyclical Momentum, And Weight Decay

Hyper-parameters are tightly coupled with each other, the data, and architecture.

   

Cyclical Learning Rates And Super-convergence

We've seen that too small learning rate (LR) can cause overfitting. Large LRs help to regularize the training but if the LR is too large, the training will diverge. The author proposes Cyclical Learning Rates (CLR) and the learning rate range test (LR range test) [Smith15] for choosing the learning rate.

Brief review on how to use CLR:

 Figure 5: Faster training by allowing the LR to become large and by reducing other regularization methods.

Remark 3. The amount of regularization must be balanced for each dataset and architecture

   

Batch Size

 Figure 6: The effects of total batch size (TBS) on validation accuracy/loss

Remark 4. Goal: Obtain highest performance while minimizing computational time

   

Cyclical Momentum

 Figure 7: Cyclical momentum tests

θiter+1=θiterϵδL(F(x,θ),θ) where θ represents all the network parameters, ϵ is the learning rate, and δ L(F(x, θ); θ) is the gradient.

viter+1=αviterϵδL(F(x;θ),θ) θiter+1=θiter+v where v is velocity and α is the momentum coefficient.

- is cyclical momentum useful and if so, when?

Remark 5. Optimal momentum value(s) will improve network training

   

Weight Decay

 Figure 8: weight decay search using a 3-layer network on the Cifar-10 dataset
- For WD, the best value should remain constant through the training. Test with 10e-3; 10e-4; 10e-5; and 0. - Smaller datasets and architectures seem to require larger values for weight decay while larger datasets and deeper architectures seem to require smaller values. - The author supposes that complex data provides its own regularization and other regularization should be reduced.

Remark 6. Value of WD is key knob to turn for tuning regularization against the regularization from an increasing LR.

- can the value for the weight decay, learning rate and momentum all be determined simultaneously?

   

Recipe For Finding A Good Set Of Hyper-Parameters With A Given Dataset And Architecture

 Figure 9: Training resnet and inception arch. on the imagenet dataset with the standard LR policy (blue curve) versus a 1cycle policy that displays super-convergence

  1. Learning rate (LR): Perform a LR range test to a “large” LR. The max LR depends on the architecture. Using the 1cycle LR policy with a maximum LR determined from an LR range test, a minimum LR as a tenth of the maximum appears to work well but other factors are relevant, such as the rate of LR increase.
  2. Total batch size (TBS): A large batch size works well but the magnitude is typically constrained by the GPU memory. In addition, small batch sizes add regularization while large batch sizes add less. Better to use a larger batch size so a larger LR can be used.
  3. Momentum: Short runs with momentum values of 0.99, 0.97, 0.95, and 0.9 will quickly show the best value for momentum. If using the 1cycle LR schedule, it is better to use a cyclical momentum (CM) that starts at this maximum momentum value and decreases with increasing LR to a value of 0.8 or 0.85. Using cyclical momentum along with the LR range test stabilizes the convergence when using large LR values more than a constant momentum does.
  4. Weight decay (WD): Requires a grid search to determine the proper magnitude but usually does not require more than one significant figure accuracy. Use your knowledge of the dataset and arch. to decide which values to test. For example, a more complex dataset requires less regularization so test smaller weight decay values, such as 10e-4; 10e-5; 10e-6; 0. A shallow architecture requires more regularization so test larger weight decay values, such as 10e-2; 10e-3; 10e-4.

   

Wrapping Up

I hope that this paper review was helpful for you. If you think something's missing, feel free to refer to the original paper to clarify your doubts.

More blog posts could be found at https://jithinjk.github.io/blog

   

Bibliography

[ Aarts88] Emile Aarts and Jan Korst. 1989. Simulated Annealing and Boltzmann Machines: a Stochastic Approach to Combinatorial Optimization and Neural Computing. John Wiley & Sons, Inc., New York, NY, USA.
[ Bengio12] Yoshua Bengio. Practical recommendations for gradient-based training of deep architectures. In Neural networks: Tricks of the trade, pp. 437–478. Springer, 2012.
[ Bengio09] Yoshua Bengio, Jerome Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. ACM, 2009.
[ Bergstra12] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
[ Goodfellow16] Ian Goodfellow, Yoshua Bengio and Aaron Courville. Deep learning, volume 1.MIT press Cambridge, 2016.
[ He16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
[ Kingma14] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint, 2014.
[ Kulacka17] Jan Kukacka, Vladimir Golkov, and Daniel Cremers. Regularization for deep learning: A taxonomy. arXiv preprint, 2017.
[ Orr03] Genevieve B Orr and Klaus-Robert M¨uller. Neural networks: tricks of the trade. Springer, 2003.
[ Smith15] Leslie N Smith. No more pesky learning rate guessing games. arXiv preprint, 2015.
[ Smith17] Leslie N Smith. Cyclical learning rates for training neural networks. In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, pp. 464–472. IEEE, 2017.
[ SmithTopin17] Leslie N Smith and Nicholay Topin. Super-convergence: Very fast training of residual networks using large learning rates. arXiv preprint, 2017.
[ SmithLe17] Samuel L Smith and Quoc V Le. Understanding generalization and stochastic gradient descent. arXiv preprint, 2017.
[ SmithKindermansLe17] Samuel L Smith, Pieter-Jan Kindermans, and Quoc V Le. Don’t decay the learning rate, increase the batch size. arXiv preprint, 2017.
[ Srivastava14] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
[ Szegedy17] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, volume 4, pp. 12, 2017.
[ Wilson03] D Randall Wilson and Tony R Martinez. The general inefficiency of batch training for gradient descent learning. Neural Networks, 16(10):1429–1451, 2003.
[ Xing18] Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd. arXiv preprint, 2018.

formatted by Markdeep 1.03