Combining resampling and reweighting for faithful stochastic
optimization
- URL: http://arxiv.org/abs/2105.14694v1
- Date: Mon, 31 May 2021 04:21:25 GMT
- Title: Combining resampling and reweighting for faithful stochastic
optimization
- Authors: Jing An, Lexing Ying
- Abstract summary: When the loss function is a sum of multiple terms, a popular method is gradient descent.
We show that the difference in the Lipschitz constants of multiple terms in the loss function causes gradient descent to different variances at different minimums.
- Score: 1.52292571922932
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many machine learning and data science tasks require solving non-convex
optimization problems. When the loss function is a sum of multiple terms, a
popular method is stochastic gradient descent. Viewed as a process for sampling
the loss function landscape, the stochastic gradient descent is known to prefer
flat local minimums. Though this is desired for certain optimization problems
such as in deep learning, it causes issues when the goal is to find the global
minimum, especially if the global minimum resides in a sharp valley.
Illustrated with a simple motivating example, we show that the fundamental
reason is that the difference in the Lipschitz constants of multiple terms in
the loss function causes stochastic gradient descent to experience different
variances at different minimums. In order to mitigate this effect and perform
faithful optimization, we propose a combined resampling-reweighting scheme to
balance the variance at local minimums and extend to general loss functions. We
also explain from the stochastic asymptotics perspective how the proposed
scheme is more likely to select the true global minimum when compared with the
vanilla stochastic gradient descent. Experiments from robust statistics,
computational chemistry, and neural network training are provided to
demonstrate the theoretical findings.
Related papers
- An Adaptive Stochastic Gradient Method with Non-negative Gauss-Newton Stepsizes [17.804065824245402]
In machine learning applications, each loss function is non-negative and can be expressed as the composition of a square and its real-valued square root.
We show how to apply the Gauss-Newton method or the Levssquardt method to minimize the average of smooth but possibly non-negative functions.
arXiv Detail & Related papers (2024-07-05T08:53:06Z) - A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Convex Relaxations of ReLU Neural Networks Approximate Global Optima in Polynomial Time [45.72323731094864]
In this paper, we study the optimality gap between two-layer ReLULU networks regularized with weight decay and their convex relaxations.
Our study sheds new light on understanding why local methods work well.
arXiv Detail & Related papers (2024-02-06T01:29:35Z) - Training trajectories, mini-batch losses and the curious role of the
learning rate [13.848916053916618]
We show that validated gradient descent plays a fundamental role in nearly all applications of deep learning.
We propose a simple model and a geometric interpretation that allows to analyze the relationship between the gradients of mini-batches and the full batch.
In particular, a very low loss value can be reached just one step of descent with large enough learning rate.
arXiv Detail & Related papers (2023-01-05T21:58:46Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Gradient flows and randomised thresholding: sparse inversion and
classification [0.0]
Sparse inversion and classification problems are ubiquitous in modern data science and imaging.
In classification, we consider, e.g., the sum of a data fidelity term and a non-smooth Ginzburg--Landau energy.
Standard (sub)gradient descent methods have shown to be inefficient when approaching such problems.
arXiv Detail & Related papers (2022-03-22T09:21:14Z) - Questions for Flat-Minima Optimization of Modern Neural Networks [28.12506392321345]
Two methods for finding flat minima stand out: 1. Averaging methods (i.e. Weight Averaging, SWA) and 2. Minimax methods (i.e. Aware, Sharpness Minimization, SAM)
We investigate the loss surfaces from a systematic benchmarking of these approaches across computer vision, natural language processing, and graph learning tasks.
arXiv Detail & Related papers (2022-02-01T18:56:15Z) - Why Do Local Methods Solve Nonconvex Problems? [54.284687261929115]
Non-used optimization is ubiquitous in modern machine learning.
We rigorously formalize it for instances of machine learning problems.
We hypothesize a unified explanation for this phenomenon.
arXiv Detail & Related papers (2021-03-24T19:34:11Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - Gradient Free Minimax Optimization: Variance Reduction and Faster
Convergence [120.9336529957224]
In this paper, we denote the non-strongly setting on the magnitude of a gradient-free minimax optimization problem.
We show that a novel zeroth-order variance reduced descent algorithm achieves the best known query complexity.
arXiv Detail & Related papers (2020-06-16T17:55:46Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.