Choosing the Sample with Lowest Loss makes SGD Robust
- URL: http://arxiv.org/abs/2001.03316v1
- Date: Fri, 10 Jan 2020 05:39:17 GMT
- Title: Choosing the Sample with Lowest Loss makes SGD Robust
- Authors: Vatsal Shah, Xiaoxia Wu, Sujay Sanghavi
- Abstract summary: We propose a simple variant of the simple gradient descent (SGD) method in each step.
Vanilla represents a new algorithm that is however effectively minimizing a non-current sum with the smallest loss.
Our theoretical analysis of this idea for ML problems is backed up with small-scale neural network experiments.
- Score: 19.08973384659313
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The presence of outliers can potentially significantly skew the parameters of
machine learning models trained via stochastic gradient descent (SGD). In this
paper we propose a simple variant of the simple SGD method: in each step, first
choose a set of k samples, then from these choose the one with the smallest
current loss, and do an SGD-like update with this chosen sample. Vanilla SGD
corresponds to k = 1, i.e. no choice; k >= 2 represents a new algorithm that is
however effectively minimizing a non-convex surrogate loss. Our main
contribution is a theoretical analysis of the robustness properties of this
idea for ML problems which are sums of convex losses; these are backed up with
linear regression and small-scale neural network experiments
Related papers
- Escaping mediocrity: how two-layer networks learn hard generalized
linear models with SGD [29.162265194920522]
This study explores the sample complexity for two-layer neural networks to learn a generalized linear target function under Gradient Descent (SGD)
We show that overfactorization can only enhance convergence by a constant factor within this problem class.
Yet, we demonstrate that a deterministic approximation of this process adequately represents the escape time, implying that the role of SGDity may be minimal in this scenario.
arXiv Detail & Related papers (2023-05-29T14:40:56Z) - Adaptive Sketches for Robust Regression with Importance Sampling [64.75899469557272]
We introduce data structures for solving robust regression through gradient descent (SGD)
Our algorithm effectively runs $T$ steps of SGD with importance sampling while using sublinear space and just making a single pass over the data.
arXiv Detail & Related papers (2022-07-16T03:09:30Z) - A Conditional Randomization Test for Sparse Logistic Regression in
High-Dimension [36.00360315353985]
emphCRT-logit is an algorithm that combines a variable-distillation step and a decorrelation step.
We provide a theoretical analysis of this procedure, and demonstrate its effectiveness on simulations, along with experiments on large-scale brain-imaging and genomics datasets.
arXiv Detail & Related papers (2022-05-29T09:37:16Z) - KL-Entropy-Regularized RL with a Generative Model is Minimax Optimal [70.15267479220691]
We consider and analyze the sample complexity of model reinforcement learning with a generative variance-free model.
Our analysis shows that it is nearly minimax-optimal for finding an $varepsilon$-optimal policy when $varepsilon$ is sufficiently small.
arXiv Detail & Related papers (2022-05-27T19:39:24Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Stochastic Gradient Descent Meets Distribution Regression [0.0]
gradient descent (SGD) provides a simple and efficient way to solve a broad range of machine learning problems.
We focus on distribution regression (DR), involving two stages of sampling: Firstly, we regress from probability measures to real-valued responses.
arXiv Detail & Related papers (2020-10-24T09:03:00Z) - Least Squares Regression with Markovian Data: Fundamental Limits and
Algorithms [69.45237691598774]
We study the problem of least squares linear regression where the data-points are dependent and are sampled from a Markov chain.
We establish sharp information theoretic minimax lower bounds for this problem in terms of $tau_mathsfmix$.
We propose an algorithm based on experience replay--a popular reinforcement learning technique--that achieves a significantly better error rate.
arXiv Detail & Related papers (2020-06-16T04:26:50Z) - Breaking the Sample Size Barrier in Model-Based Reinforcement Learning
with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator)
We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$.
We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.