Related papers: Adaptive Sketches for Robust Regression with Importance Sampling

Adaptive Sketches for Robust Regression with Importance Sampling

URL: http://arxiv.org/abs/2207.07822v1
Date: Sat, 16 Jul 2022 03:09:30 GMT
Title: Adaptive Sketches for Robust Regression with Importance Sampling
Authors: Sepideh Mahabadi, David P. Woodruff, Samson Zhou
Abstract summary: We introduce data structures for solving robust regression through gradient descent (SGD) Our algorithm effectively runs $T$ steps of SGD with importance sampling while using sublinear space and just making a single pass over the data.
Score: 64.75899469557272
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce data structures for solving robust regression through stochastic gradient descent (SGD) by sampling gradients with probability proportional to their norm, i.e., importance sampling. Although SGD is widely used for large scale machine learning, it is well-known for possibly experiencing slow convergence rates due to the high variance from uniform sampling. On the other hand, importance sampling can significantly decrease the variance but is usually difficult to implement because computing the sampling probabilities requires additional passes over the data, in which case standard gradient descent (GD) could be used instead. In this paper, we introduce an algorithm that approximately samples $T$ gradients of dimension $d$ from nearly the optimal importance sampling distribution for a robust regression problem over $n$ rows. Thus our algorithm effectively runs $T$ steps of SGD with importance sampling while using sublinear space and just making a single pass over the data. Our techniques also extend to performing importance sampling for second-order optimization.

Related papers

Improving the Convergence Rates of Forward Gradient Descent with Repeated Sampling [5.448070998907116]
Forward gradient descent (FGD) has been proposed as a biologically more plausible alternative of gradient descent. In this paper we show that by computing $ell$ FGD steps based on each training sample, this suboptimality factor becomes $d/(ell wedge d)$. We also show that FGD with repeated sampling can adapt to low-dimensional structure in the input distribution.
arXiv Detail & Related papers (2024-11-26T16:28:16Z)
Efficient Gradient Estimation via Adaptive Sampling and Importance Sampling [34.50693643119071]
adaptive or importance sampling reduces noise in gradient estimation. We present an algorithm that can incorporate existing importance functions into our framework. We observe improved convergence in classification and regression tasks with minimal computational overhead.
arXiv Detail & Related papers (2023-11-24T13:21:35Z)
Preferential Subsampling for Stochastic Gradient Langevin Dynamics [3.158346511479111]
gradient MCMC offers an unbiased estimate of the gradient of the log-posterior with a small, uniformly-weighted subsample of the data. The resulting gradient estimator may exhibit a high variance and impact sampler performance. We demonstrate that such an approach can maintain the same level of accuracy while substantially reducing the average subsample size that is used.
arXiv Detail & Related papers (2022-10-28T14:56:18Z)
SIMPLE: A Gradient Estimator for $k$-Subset Sampling [42.38652558807518]
In this work, we fall back to discrete $k$-subset sampling on the forward pass. We show that our gradient estimator, SIMPLE, exhibits lower bias and variance compared to state-of-the-art estimators. Empirical results show improved performance on learning to explain and sparse linear regression.
arXiv Detail & Related papers (2022-10-04T22:33:16Z)
Heavy-tailed Streaming Statistical Estimation [58.70341336199497]
We consider the task of heavy-tailed statistical estimation given streaming $p$ samples. We design a clipped gradient descent and provide an improved analysis under a more nuanced condition on the noise of gradients.
arXiv Detail & Related papers (2021-08-25T21:30:27Z)
Towards Sample-Optimal Compressive Phase Retrieval with Sparse and Generative Priors [59.33977545294148]
We show that $O(k log L)$ samples suffice to guarantee that the signal is close to any vector that minimizes an amplitude-based empirical loss function. We adapt this result to sparse phase retrieval, and show that $O(s log n)$ samples are sufficient for a similar guarantee when the underlying signal is $s$-sparse and $n$-dimensional.
arXiv Detail & Related papers (2021-06-29T12:49:54Z)
Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning. Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch. ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z)
Least Squares Regression with Markovian Data: Fundamental Limits and Algorithms [69.45237691598774]
We study the problem of least squares linear regression where the data-points are dependent and are sampled from a Markov chain. We establish sharp information theoretic minimax lower bounds for this problem in terms of $tau_mathsfmix$. We propose an algorithm based on experience replay--a popular reinforcement learning technique--that achieves a significantly better error rate.
arXiv Detail & Related papers (2020-06-16T04:26:50Z)
Non-Adaptive Adaptive Sampling on Turnstile Streams [57.619901304728366]
We give the first relative-error algorithms for column subset selection, subspace approximation, projective clustering, and volume on turnstile streams that use space sublinear in $n$. Our adaptive sampling procedure has a number of applications to various data summarization problems that either improve state-of-the-art or have only been previously studied in the more relaxed row-arrival model.
arXiv Detail & Related papers (2020-04-23T05:00:21Z)
Online stochastic gradient descent on non-convex losses from high-dimensional inference [2.2344764434954256]
gradient descent (SGD) is a popular algorithm for optimization problems in high-dimensional tasks. In this paper we produce an estimator of non-trivial correlation from data. We illustrate our approach by applying it to a set of tasks such as phase retrieval, and estimation for generalized models.
arXiv Detail & Related papers (2020-03-23T17:34:06Z)
Choosing the Sample with Lowest Loss makes SGD Robust [19.08973384659313]
We propose a simple variant of the simple gradient descent (SGD) method in each step. Vanilla represents a new algorithm that is however effectively minimizing a non-current sum with the smallest loss. Our theoretical analysis of this idea for ML problems is backed up with small-scale neural network experiments.
arXiv Detail & Related papers (2020-01-10T05:39:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.