The Unreasonable Effectiveness Of Early Discarding After One Epoch In Neural Network Hyperparameter Optimization
- URL: http://arxiv.org/abs/2404.04111v1
- Date: Fri, 5 Apr 2024 14:08:57 GMT
- Title: The Unreasonable Effectiveness Of Early Discarding After One Epoch In Neural Network Hyperparameter Optimization
- Authors: Romain Egele, Felix Mohr, Tom Viering, Prasanna Balaprakash,
- Abstract summary: We study the trade-off between the aggressiveness of discarding and the loss of predictive performance.
We call this approach i-Epoch (i being the constant number of epochs with which neural networks are trained) and suggest to assess the quality of early discarding techniques.
- Score: 10.93405937763835
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To reach high performance with deep learning, hyperparameter optimization (HPO) is essential. This process is usually time-consuming due to costly evaluations of neural networks. Early discarding techniques limit the resources granted to unpromising candidates by observing the empirical learning curves and canceling neural network training as soon as the lack of competitiveness of a candidate becomes evident. Despite two decades of research, little is understood about the trade-off between the aggressiveness of discarding and the loss of predictive performance. Our paper studies this trade-off for several commonly used discarding techniques such as successive halving and learning curve extrapolation. Our surprising finding is that these commonly used techniques offer minimal to no added value compared to the simple strategy of discarding after a constant number of epochs of training. The chosen number of epochs depends mostly on the available compute budget. We call this approach i-Epoch (i being the constant number of epochs with which neural networks are trained) and suggest to assess the quality of early discarding techniques by comparing how their Pareto-Front (in consumed training epochs and predictive performance) complement the Pareto-Front of i-Epoch.
Related papers
- Learning Rate Optimization for Deep Neural Networks Using Lipschitz Bandits [9.361762652324968]
A properly tuned learning rate leads to faster training and higher test accuracy.
We propose a Lipschitz bandit-driven approach for tuning the learning rate of neural networks.
arXiv Detail & Related papers (2024-09-15T16:21:55Z) - Approximated Likelihood Ratio: A Forward-Only and Parallel Framework for Boosting Neural Network Training [30.452060061499523]
We introduce an approximation technique for the likelihood ratio (LR) method to alleviate computational and memory demands in gradient estimation.
Experiments demonstrate the effectiveness of the approximation technique in neural network training.
arXiv Detail & Related papers (2024-03-18T23:23:50Z) - Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking [81.57031092474625]
Recent work by Power et al. highlighted a surprising "grokking" phenomenon in learning arithmetic tasks.
A neural net first "memorizes" the training set, resulting in perfect training accuracy but near-random test accuracy, and after training for sufficiently longer, it suddenly transitions to perfect test accuracy.
This paper studies the grokking phenomenon in theoretical setups and shows that it can be induced by a dichotomy of early and late phase implicit biases.
arXiv Detail & Related papers (2023-11-30T18:55:38Z) - Analytically Tractable Inference in Deep Neural Networks [0.0]
Tractable Approximate Inference (TAGI) algorithm was shown to be a viable and scalable alternative to backpropagation for shallow fully-connected neural networks.
We are demonstrating how TAGI matches or exceeds the performance of backpropagation, for training classic deep neural network architectures.
arXiv Detail & Related papers (2021-03-09T14:51:34Z) - Statistically Significant Stopping of Neural Network Training [0.0]
We introduce a statistical significance test to determine if a neural network has stopped learning.
We use this as the basis of a new learning rate scheduler.
arXiv Detail & Related papers (2021-03-01T18:51:16Z) - Gradient Descent on Neural Networks Typically Occurs at the Edge of
Stability [94.4070247697549]
Full-batch gradient descent on neural network training objectives operates in a regime we call the Edge of Stability.
In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value $2 / text(step size)$, and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales.
arXiv Detail & Related papers (2021-02-26T22:08:19Z) - Sparse approximation in learning via neural ODEs [0.0]
We study the impact of the final time horizon $T$ in training.
In practical terms, a shorter time-horizon in the training problem can be interpreted as considering a shallower residual neural network.
arXiv Detail & Related papers (2021-02-26T16:23:02Z) - A Simple Fine-tuning Is All You Need: Towards Robust Deep Learning Via
Adversarial Fine-tuning [90.44219200633286]
We propose a simple yet very effective adversarial fine-tuning approach based on a $textitslow start, fast decay$ learning rate scheduling strategy.
Experimental results show that the proposed adversarial fine-tuning approach outperforms the state-of-the-art methods on CIFAR-10, CIFAR-100 and ImageNet datasets.
arXiv Detail & Related papers (2020-12-25T20:50:15Z) - Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function.
We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model.
We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z) - Robust Pruning at Initialization [61.30574156442608]
A growing need for smaller, energy-efficient, neural networks to be able to use machine learning applications on devices with limited computational resources.
For Deep NNs, such procedures remain unsatisfactory as the resulting pruned networks can be difficult to train and, for instance, they do not prevent one layer from being fully pruned.
arXiv Detail & Related papers (2020-02-19T17:09:50Z) - Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications.
In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training.
Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.