GOALS: Gradient-Only Approximations for Line Searches Towards Robust and
Consistent Training of Deep Neural Networks
- URL: http://arxiv.org/abs/2105.10915v1
- Date: Sun, 23 May 2021 11:21:01 GMT
- Title: GOALS: Gradient-Only Approximations for Line Searches Towards Robust and
Consistent Training of Deep Neural Networks
- Authors: Younghwan Chae, Daniel N. Wilke, Dominic Kafka
- Abstract summary: Mini-batch sub-sampling (MBSS) is favored in deep neural network training to reduce the computational cost.
We propose a gradient-only approximation line search (GOALS) with strong convergence characteristics with defined optimality criterion.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mini-batch sub-sampling (MBSS) is favored in deep neural network training to
reduce the computational cost. Still, it introduces an inherent sampling error,
making the selection of appropriate learning rates challenging. The sampling
errors can manifest either as a bias or variances in a line search. Dynamic
MBSS re-samples a mini-batch at every function evaluation. Hence, dynamic MBSS
results in point-wise discontinuous loss functions with smaller bias but larger
variance than static sampled loss functions. However, dynamic MBSS has the
advantage of having larger data throughput during training but requires the
complexity regarding discontinuities to be resolved. This study extends the
gradient-only surrogate (GOS), a line search method using quadratic
approximation models built with only directional derivative information, for
dynamic MBSS loss functions. We propose a gradient-only approximation line
search (GOALS) with strong convergence characteristics with defined optimality
criterion. We investigate GOALS's performance by applying it on various
optimizers that include SGD, RMSprop and Adam on ResNet-18 and EfficientNetB0.
We also compare GOALS's against the other existing learning rate methods. We
quantify both the best performing and most robust algorithms. For the latter,
we introduce a relative robust criterion that allows us to quantify the
difference between an algorithm and the best performing algorithm for a given
problem. The results show that training a model with the recommended learning
rate for a class of search directions helps to reduce the model errors in
multimodal cases.
Related papers
- Accelerated zero-order SGD under high-order smoothness and overparameterized regime [79.85163929026146]
We present a novel gradient-free algorithm to solve convex optimization problems.
Such problems are encountered in medicine, physics, and machine learning.
We provide convergence guarantees for the proposed algorithm under both types of noise.
arXiv Detail & Related papers (2024-11-21T10:26:17Z) - Bayes-optimal learning of an extensive-width neural network from quadratically many samples [28.315491743569897]
We consider the problem of learning a target function corresponding to a single hidden layer neural network.
We consider the limit where the input dimension and the network width are proportionally large.
arXiv Detail & Related papers (2024-08-07T12:41:56Z) - Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters.
In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z) - Training Artificial Neural Networks by Coordinate Search Algorithm [0.20971479389679332]
We propose an efficient version of the gradient-free Coordinate Search (CS) algorithm for training neural networks.
The proposed algorithm can be used with non-differentiable activation functions and tailored to multi-objective/multi-loss problems.
Finding the optimal values for weights of ANNs is a large-scale optimization problem.
arXiv Detail & Related papers (2024-02-20T01:47:25Z) - Querying Easily Flip-flopped Samples for Deep Active Learning [63.62397322172216]
Active learning is a machine learning paradigm that aims to improve the performance of a model by strategically selecting and querying unlabeled data.
One effective selection strategy is to base it on the model's predictive uncertainty, which can be interpreted as a measure of how informative a sample is.
This paper proposes the it least disagree metric (LDM) as the smallest probability of disagreement of the predicted label.
arXiv Detail & Related papers (2024-01-18T08:12:23Z) - Towards Automated Imbalanced Learning with Deep Hierarchical
Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class.
Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class.
We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z) - Least Squares Regression with Markovian Data: Fundamental Limits and
Algorithms [69.45237691598774]
We study the problem of least squares linear regression where the data-points are dependent and are sampled from a Markov chain.
We establish sharp information theoretic minimax lower bounds for this problem in terms of $tau_mathsfmix$.
We propose an algorithm based on experience replay--a popular reinforcement learning technique--that achieves a significantly better error rate.
arXiv Detail & Related papers (2020-06-16T04:26:50Z) - Distributionally Robust Weighted $k$-Nearest Neighbors [21.537952410507483]
Learning a robust classifier from a few samples remains a key challenge in machine learning.
In this paper, we study a minimax distributionally robust formulation of weighted $k$-nearest neighbors.
We develop an algorithm, textttDr.k-NN, that efficiently solves this functional optimization problem.
arXiv Detail & Related papers (2020-06-07T00:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.