Adaptive Sketches for Robust Regression with Importance Sampling
- URL: http://arxiv.org/abs/2207.07822v1
- Date: Sat, 16 Jul 2022 03:09:30 GMT
- Title: Adaptive Sketches for Robust Regression with Importance Sampling
- Authors: Sepideh Mahabadi, David P. Woodruff, Samson Zhou
- Abstract summary: We introduce data structures for solving robust regression through gradient descent (SGD)
Our algorithm effectively runs $T$ steps of SGD with importance sampling while using sublinear space and just making a single pass over the data.
- Score: 64.75899469557272
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce data structures for solving robust regression through stochastic
gradient descent (SGD) by sampling gradients with probability proportional to
their norm, i.e., importance sampling. Although SGD is widely used for large
scale machine learning, it is well-known for possibly experiencing slow
convergence rates due to the high variance from uniform sampling. On the other
hand, importance sampling can significantly decrease the variance but is
usually difficult to implement because computing the sampling probabilities
requires additional passes over the data, in which case standard gradient
descent (GD) could be used instead. In this paper, we introduce an algorithm
that approximately samples $T$ gradients of dimension $d$ from nearly the
optimal importance sampling distribution for a robust regression problem over
$n$ rows. Thus our algorithm effectively runs $T$ steps of SGD with importance
sampling while using sublinear space and just making a single pass over the
data. Our techniques also extend to performing importance sampling for
second-order optimization.
Related papers
- Efficient Gradient Estimation via Adaptive Sampling and Importance
Sampling [34.50693643119071]
adaptive or importance sampling reduces noise in gradient estimation.
We present an algorithm that can incorporate existing importance functions into our framework.
We observe improved convergence in classification and regression tasks with minimal computational overhead.
arXiv Detail & Related papers (2023-11-24T13:21:35Z) - Preferential Subsampling for Stochastic Gradient Langevin Dynamics [3.158346511479111]
gradient MCMC offers an unbiased estimate of the gradient of the log-posterior with a small, uniformly-weighted subsample of the data.
The resulting gradient estimator may exhibit a high variance and impact sampler performance.
We demonstrate that such an approach can maintain the same level of accuracy while substantially reducing the average subsample size that is used.
arXiv Detail & Related papers (2022-10-28T14:56:18Z) - SIMPLE: A Gradient Estimator for $k$-Subset Sampling [42.38652558807518]
In this work, we fall back to discrete $k$-subset sampling on the forward pass.
We show that our gradient estimator, SIMPLE, exhibits lower bias and variance compared to state-of-the-art estimators.
Empirical results show improved performance on learning to explain and sparse linear regression.
arXiv Detail & Related papers (2022-10-04T22:33:16Z) - Heavy-tailed Streaming Statistical Estimation [58.70341336199497]
We consider the task of heavy-tailed statistical estimation given streaming $p$ samples.
We design a clipped gradient descent and provide an improved analysis under a more nuanced condition on the noise of gradients.
arXiv Detail & Related papers (2021-08-25T21:30:27Z) - Towards Sample-Optimal Compressive Phase Retrieval with Sparse and
Generative Priors [59.33977545294148]
We show that $O(k log L)$ samples suffice to guarantee that the signal is close to any vector that minimizes an amplitude-based empirical loss function.
We adapt this result to sparse phase retrieval, and show that $O(s log n)$ samples are sufficient for a similar guarantee when the underlying signal is $s$-sparse and $n$-dimensional.
arXiv Detail & Related papers (2021-06-29T12:49:54Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z) - Least Squares Regression with Markovian Data: Fundamental Limits and
Algorithms [69.45237691598774]
We study the problem of least squares linear regression where the data-points are dependent and are sampled from a Markov chain.
We establish sharp information theoretic minimax lower bounds for this problem in terms of $tau_mathsfmix$.
We propose an algorithm based on experience replay--a popular reinforcement learning technique--that achieves a significantly better error rate.
arXiv Detail & Related papers (2020-06-16T04:26:50Z) - Non-Adaptive Adaptive Sampling on Turnstile Streams [57.619901304728366]
We give the first relative-error algorithms for column subset selection, subspace approximation, projective clustering, and volume on turnstile streams that use space sublinear in $n$.
Our adaptive sampling procedure has a number of applications to various data summarization problems that either improve state-of-the-art or have only been previously studied in the more relaxed row-arrival model.
arXiv Detail & Related papers (2020-04-23T05:00:21Z) - Online stochastic gradient descent on non-convex losses from
high-dimensional inference [2.2344764434954256]
gradient descent (SGD) is a popular algorithm for optimization problems in high-dimensional tasks.
In this paper we produce an estimator of non-trivial correlation from data.
We illustrate our approach by applying it to a set of tasks such as phase retrieval, and estimation for generalized models.
arXiv Detail & Related papers (2020-03-23T17:34:06Z) - Choosing the Sample with Lowest Loss makes SGD Robust [19.08973384659313]
We propose a simple variant of the simple gradient descent (SGD) method in each step.
Vanilla represents a new algorithm that is however effectively minimizing a non-current sum with the smallest loss.
Our theoretical analysis of this idea for ML problems is backed up with small-scale neural network experiments.
arXiv Detail & Related papers (2020-01-10T05:39:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.