Related papers: Gradient Equilibrium in Online Learning: Theory and Applications

Gradient Equilibrium in Online Learning: Theory and Applications

URL: http://arxiv.org/abs/2501.08330v3
Date: Tue, 18 Feb 2025 16:39:54 GMT
Title: Gradient Equilibrium in Online Learning: Theory and Applications
Authors: Anastasios N. Angelopoulos, Michael I. Jordan, Ryan J. Tibshirani,
Abstract summary: gradient equilibrium is achieved by standard online learning methods.<n> gradient equilibrium translates into an interpretable and meaningful property in online prediction problems.<n>We show that gradient equilibrium framework can be used to develop a debiasing scheme for black-box predictions.
Score: 56.02856551198923
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a new perspective on online learning that we refer to as gradient equilibrium: a sequence of iterates achieves gradient equilibrium if the average of gradients of losses along the sequence converges to zero. In general, this condition is not implied by, nor implies, sublinear regret. It turns out that gradient equilibrium is achievable by standard online learning methods such as gradient descent and mirror descent with constant step sizes (rather than decaying step sizes, as is usually required for no regret). Further, as we show through examples, gradient equilibrium translates into an interpretable and meaningful property in online prediction problems spanning regression, classification, quantile estimation, and others. Notably, we show that the gradient equilibrium framework can be used to develop a debiasing scheme for black-box predictions under arbitrary distribution shift, based on simple post hoc online descent updates. We also show that post hoc gradient updates can be used to calibrate predicted quantiles under distribution shift, and that the framework leads to unbiased Elo scores for pairwise preference prediction.

Related papers

Comparing regularisation paths of (conjugate) gradient estimators in ridge regression [0.0]
We consider gradient descent, gradient flow and conjugate gradients as iterative algorithms for minimizing a penalized ridge criterion in linear regression. In particular, the oracle conjugate gradient iterate shares the optimality properties of the gradient flow and ridge regression oracles up to a constant factor.
arXiv Detail & Related papers (2025-03-07T16:14:06Z)
Parallel Momentum Methods Under Biased Gradient Estimations [11.074080383657453]
Parallel gradient methods are gaining prominence in solving large-scale machine learning problems that involve data distributed across multiple nodes.<n>However, obtaining unbiased bounds, which have been the focus of most theoretical research, is challenging in many machine learning applications.<n>In this paper we work out the implications for special gradient where estimates are biased, i.e. in meta-learning and when gradients are compressed or clipped.
arXiv Detail & Related papers (2024-02-29T18:03:03Z)
Estimator Meets Equilibrium Perspective: A Rectified Straight Through Estimator for Binary Neural Networks Training [35.090598013305275]
Binarization of neural networks is a dominant paradigm in neural networks compression. We propose Rectified Straight Through Estimator (ReSTE) to balance the estimating error and the gradient stability. ReSTE has excellent performance and surpasses the state-of-the-art methods without any auxiliary modules or losses.
arXiv Detail & Related papers (2023-08-13T05:38:47Z)
The Implicit Bias of Batch Normalization in Linear Models and Two-layer Linear Convolutional Neural Networks [117.93273337740442]
We show that gradient descent converges to a uniform margin classifier on the training data with an $exp(-Omega(log2 t))$ convergence rate. We also show that batch normalization has an implicit bias towards a patch-wise uniform margin.
arXiv Detail & Related papers (2023-06-20T16:58:00Z)
Why is parameter averaging beneficial in SGD? An objective smoothing perspective [13.863368438870562]
gradient descent (SGD) and its implicit bias are often characterized in terms of the sharpness of the minima. We study the commonly-used averaged SGD algorithm, which has been empirically observed in Izmailov et al. We prove that averaged SGD can efficiently optimize the smoothed objective which avoids sharp local minima.
arXiv Detail & Related papers (2023-02-18T16:29:06Z)
The Equalization Losses: Gradient-Driven Training for Long-tailed Object Recognition [84.51875325962061]
We propose a gradient-driven training mechanism to tackle the long-tail problem. We introduce a new family of gradient-driven loss functions, namely equalization losses. Our method consistently outperforms the baseline models.
arXiv Detail & Related papers (2022-10-11T16:00:36Z)
Coupled Gradient Estimators for Discrete Latent Variables [41.428359609999326]
Training models with discrete latent variables is challenging due to the high variance of unbiased gradient estimators. We introduce a novel derivation of their estimator based on importance sampling and statistical couplings. We show that our proposed categorical gradient estimators provide state-of-the-art performance.
arXiv Detail & Related papers (2021-06-15T11:28:44Z)
Implicit Gradient Regularization [18.391141066502644]
gradient descent can be surprisingly good at optimizing deep neural networks without overfitting and without explicit regularization. We call this Implicit Gradient Regularization (IGR) and we use backward error analysis to calculate the size of this regularization.
arXiv Detail & Related papers (2020-09-23T14:17:53Z)
Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy [71.25689267025244]
We show how the transition is controlled by the relationship between the scale and how accurately we minimize the training loss. Our results indicate that some limit behaviors of gradient descent only kick in at ridiculous training accuracies.
arXiv Detail & Related papers (2020-07-13T23:49:53Z)
A Study of Gradient Variance in Deep Learning [56.437755740715396]
We introduce a method, Gradient Clustering, to minimize the variance of average mini-batch gradient with stratified sampling. We measure the gradient variance on common deep learning benchmarks and observe that, contrary to common assumptions, gradient variance increases during training.
arXiv Detail & Related papers (2020-07-09T03:23:10Z)
Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated. We propose a new method for this estimation problem combining sampling and analytic approximation steps. We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.