Related papers: Better scalability under potentially heavy-tailed gradients

Better scalability under potentially heavy-tailed gradients

URL: http://arxiv.org/abs/2006.00784v2
Date: Tue, 15 Dec 2020 04:45:58 GMT
Title: Better scalability under potentially heavy-tailed gradients
Authors: Matthew J. Holland
Abstract summary: We study a scalable alternative to robust gradient descent (RGD) techniques that can be used when the gradients can be heavy-tailed. The core technique is simple: instead of trying to robustly aggregate gradients at each step, we choose a candidate which does not diverge too far from the majority of cheap sub-processes run for a single pass over partitioned data.
Score: 9.36599317326032
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study a scalable alternative to robust gradient descent (RGD) techniques that can be used when the gradients can be heavy-tailed, though this will be unknown to the learner. The core technique is simple: instead of trying to robustly aggregate gradients at each step, which is costly and leads to sub-optimal dimension dependence in risk bounds, we choose a candidate which does not diverge too far from the majority of cheap stochastic sub-processes run for a single pass over partitioned data. In addition to formal guarantees, we also provide empirical analysis of robustness to perturbations to experimental conditions, under both sub-Gaussian and heavy-tailed data. The result is a procedure that is simple to implement, trivial to parallelize, which keeps the formal strength of RGD methods but scales much better to large learning problems.

Related papers

Dealing with unbounded gradients in stochastic saddle-point optimization [9.983014605039658]
We study the performance of first-order methods for finding saddle points of convex-concave functions. A notorious challenge is that the gradients can grow arbitrarily large during optimization. We propose a simple and effective regularization technique that stabilizes the iterates and yields meaningful performance guarantees.
arXiv Detail & Related papers (2024-02-21T16:13:49Z)
Riemannian stochastic optimization methods avoid strict saddle points [68.80251170757647]
We show that policies under study avoid strict saddle points / submanifolds with probability 1. This result provides an important sanity check as it shows that, almost always, the limit state of an algorithm can only be a local minimizer.
arXiv Detail & Related papers (2023-11-04T11:12:24Z)
Implicit Manifold Gaussian Process Regression [49.0787777751317]
Gaussian process regression is widely used to provide well-calibrated uncertainty estimates. It struggles with high-dimensional data because of the implicit low-dimensional manifold upon which the data actually lies. In this paper we propose a technique capable of inferring implicit structure directly from data (labeled and unlabeled) in a fully differentiable way.
arXiv Detail & Related papers (2023-10-30T09:52:48Z)
Zeroth-Order Hard-Thresholding: Gradient Error vs. Expansivity [34.84170466506512]
We propose a new zeroth-order hard-thresholding (SZOHT) algorithm with a general ZO gradient estimator powered by a novel random sampling. We find that the query complexity of SZOHT is independent or weakly dependent on the dimensionality under different settings.
arXiv Detail & Related papers (2022-10-11T09:23:53Z)
Adaptive Sketches for Robust Regression with Importance Sampling [64.75899469557272]
We introduce data structures for solving robust regression through gradient descent (SGD) Our algorithm effectively runs $T$ steps of SGD with importance sampling while using sublinear space and just making a single pass over the data.
arXiv Detail & Related papers (2022-07-16T03:09:30Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective. We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
Better scalability under potentially heavy-tailed feedback [6.903929927172917]
We study scalable alternatives to robust gradient descent (RGD) techniques that can be used when the losses and/or gradients can be heavy-tailed. We focus computational effort on robustly choosing a strong candidate based on a collection of cheap sub-processes which can be run in parallel. The exact selection process depends on the convexity of the underlying objective, but in all cases, our selection technique amounts to a robust form of boosting the confidence of weak learners.
arXiv Detail & Related papers (2020-12-14T08:56:04Z)
Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning. Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch. ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z)
SSGD: A safe and efficient method of gradient descent [0.5099811144731619]
gradient descent method plays an important role in solving various optimization problems. Super gradient descent approach to update parameters by concealing the length of gradient. Our algorithm can defend against attacks on the gradient.
arXiv Detail & Related papers (2020-12-03T17:09:20Z)
A Bregman Method for Structure Learning on Sparse Directed Acyclic Graphs [84.7328507118758]
We develop a Bregman proximal gradient method for structure learning. We measure the impact of curvature against a highly nonlinear iteration. We test our method on various synthetic and real sets.
arXiv Detail & Related papers (2020-11-05T11:37:44Z)
Carath\'eodory Sampling for Stochastic Gradient Descent [79.55586575988292]
We present an approach that is inspired by classical results of Tchakaloff and Carath'eodory about measure reduction. We adaptively select the descent steps where the measure reduction is carried out. We combine this with Block Coordinate Descent so that measure reduction can be done very cheaply.
arXiv Detail & Related papers (2020-06-02T17:52:59Z)
Improved scalability under heavy tails, without strong convexity [9.36599317326032]
We study a simple algorithmic strategy that can be leveraged when both losses and gradients can be heavy-tailed. We show that under heavy-tailed losses, the proposed procedure cannot simply be replaced with naive cross-validation. We have a scalable method with transparent guarantees, which performs well without prior knowledge of how "convenient" the feedback it receives will be.
arXiv Detail & Related papers (2020-06-02T03:12:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.