Learning with Gradient Descent and Weakly Convex Losses
- URL: http://arxiv.org/abs/2101.04968v1
- Date: Wed, 13 Jan 2021 09:58:06 GMT
- Title: Learning with Gradient Descent and Weakly Convex Losses
- Authors: Dominic Richards, Mike Rabbat
- Abstract summary: We study the learning performance of gradient descent when the empirical risk is weakly convex.
In the case of a two layer neural network, we demonstrate that the empirical risk can satisfy a notion of local weak convexity.
- Score: 14.145079120746614
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We study the learning performance of gradient descent when the empirical risk
is weakly convex, namely, the smallest negative eigenvalue of the empirical
risk's Hessian is bounded in magnitude. By showing that this eigenvalue can
control the stability of gradient descent, generalisation error bounds are
proven that hold under a wider range of step sizes compared to previous work.
Out of sample guarantees are then achieved by decomposing the test error into
generalisation, optimisation and approximation errors, each of which can be
bounded and traded off with respect to algorithmic parameters, sample size and
magnitude of this eigenvalue. In the case of a two layer neural network, we
demonstrate that the empirical risk can satisfy a notion of local weak
convexity, specifically, the Hessian's smallest eigenvalue during training can
be controlled by the normalisation of the layers, i.e., network scaling. This
allows test error guarantees to then be achieved when the population risk
minimiser satisfies a complexity assumption. By trading off the network
complexity and scaling, insights are gained into the implicit bias of neural
network scaling, which are further supported by experimental findings.
Related papers
- The Surprising Harmfulness of Benign Overfitting for Adversarial
Robustness [13.120373493503772]
We prove a surprising result that even if the ground truth itself is robust to adversarial examples, the benignly overfitted model is benign in terms of the standard'' out-of-sample risk objective.
Our finding provides theoretical insights into the puzzling phenomenon observed in practice, where the true target function (e.g., human) is robust against adverasrial attack, while beginly overfitted neural networks lead to models that are not robust.
arXiv Detail & Related papers (2024-01-19T15:40:46Z) - Fine-grained analysis of non-parametric estimation for pairwise learning [9.676007573960383]
We are concerned with the generalization performance of non-parametric estimation for pairwise learning.
Our results can be used to handle a wide range of pairwise learning problems including ranking, AUC, pairwise regression and metric and similarity learning.
arXiv Detail & Related papers (2023-05-31T08:13:14Z) - On the ISS Property of the Gradient Flow for Single Hidden-Layer Neural
Networks with Linear Activations [0.0]
We investigate the effects of overfitting on the robustness of gradient-descent training when subject to uncertainty on the gradient estimation.
We show that the general overparametrized formulation introduces a set of spurious equilibria which lay outside the set where the loss function is minimized.
arXiv Detail & Related papers (2023-05-17T02:26:34Z) - Joint Edge-Model Sparse Learning is Provably Efficient for Graph Neural
Networks [89.28881869440433]
This paper provides the first theoretical characterization of joint edge-model sparse learning for graph neural networks (GNNs)
It proves analytically that both sampling important nodes and pruning neurons with the lowest-magnitude can reduce the sample complexity and improve convergence without compromising the test accuracy.
arXiv Detail & Related papers (2023-02-06T16:54:20Z) - The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer
Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data.
We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z) - Predicting Unreliable Predictions by Shattering a Neural Network [145.3823991041987]
Piecewise linear neural networks can be split into subfunctions.
Subfunctions have their own activation pattern, domain, and empirical error.
Empirical error for the full network can be written as an expectation over subfunctions.
arXiv Detail & Related papers (2021-06-15T18:34:41Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Asymptotic Risk of Overparameterized Likelihood Models: Double Descent
Theory for Deep Neural Networks [12.132641563193582]
We investigate the risk of a general class of overvisibilityized likelihood models, including deep models.
We demonstrate that several explicit models, such as parallel deep neural networks and ensemble learning, are in agreement with our theory.
arXiv Detail & Related papers (2021-02-28T13:02:08Z) - Unbiased Risk Estimators Can Mislead: A Case Study of Learning with
Complementary Labels [92.98756432746482]
We study a weakly supervised problem called learning with complementary labels.
We show that the quality of gradient estimation matters more in risk minimization.
We propose a novel surrogate complementary loss(SCL) framework that trades zero bias with reduced variance.
arXiv Detail & Related papers (2020-07-05T04:19:37Z) - Learning Rates as a Function of Batch Size: A Random Matrix Theory
Approach to Neural Network Training [2.9649783577150837]
We study the effect of mini-batching on the loss landscape of deep neural networks using spiked, field-dependent random matrix theory.
We derive analytical expressions for the maximal descent and adaptive training regimens for smooth, non-Newton deep neural networks.
We validate our claims on the VGG/ResNet and ImageNet datasets.
arXiv Detail & Related papers (2020-06-16T11:55:45Z) - Log-Likelihood Ratio Minimizing Flows: Towards Robust and Quantifiable
Neural Distribution Alignment [52.02794488304448]
We propose a new distribution alignment method based on a log-likelihood ratio statistic and normalizing flows.
We experimentally verify that minimizing the resulting objective results in domain alignment that preserves the local structure of input domains.
arXiv Detail & Related papers (2020-03-26T22:10:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.