Learning with Gradient Descent and Weakly Convex Losses
- URL: http://arxiv.org/abs/2101.04968v1
- Date: Wed, 13 Jan 2021 09:58:06 GMT
- Title: Learning with Gradient Descent and Weakly Convex Losses
- Authors: Dominic Richards, Mike Rabbat
- Abstract summary: We study the learning performance of gradient descent when the empirical risk is weakly convex.
In the case of a two layer neural network, we demonstrate that the empirical risk can satisfy a notion of local weak convexity.
- Score: 14.145079120746614
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We study the learning performance of gradient descent when the empirical risk
is weakly convex, namely, the smallest negative eigenvalue of the empirical
risk's Hessian is bounded in magnitude. By showing that this eigenvalue can
control the stability of gradient descent, generalisation error bounds are
proven that hold under a wider range of step sizes compared to previous work.
Out of sample guarantees are then achieved by decomposing the test error into
generalisation, optimisation and approximation errors, each of which can be
bounded and traded off with respect to algorithmic parameters, sample size and
magnitude of this eigenvalue. In the case of a two layer neural network, we
demonstrate that the empirical risk can satisfy a notion of local weak
convexity, specifically, the Hessian's smallest eigenvalue during training can
be controlled by the normalisation of the layers, i.e., network scaling. This
allows test error guarantees to then be achieved when the population risk
minimiser satisfies a complexity assumption. By trading off the network
complexity and scaling, insights are gained into the implicit bias of neural
network scaling, which are further supported by experimental findings.
Related papers
- Guiding Two-Layer Neural Network Lipschitzness via Gradient Descent Learning Rate Constraints [7.373617024876726]
We show that applying an eventual decay to the learning rate in empirical risk minimization does not hinder the empirical risk.
We observe that networks trained with constant step size gradient GD exhibit similar learning properties to those trained with a decaying LR.
This suggests that neural networks trained with standard GD may already be highly regular learners.
arXiv Detail & Related papers (2025-02-06T05:43:04Z) - A New Formulation of Lipschitz Constrained With Functional Gradient Learning for GANs [52.55025869932486]
This paper introduces a promising alternative method for training Generative Adversarial Networks (GANs) on large-scale datasets with clear theoretical guarantees.
We propose a novel Lipschitz-constrained Functional Gradient GANs learning (Li-CFG) method to stabilize the training of GAN.
We demonstrate that the neighborhood size of the latent vector can be reduced by increasing the norm of the discriminator gradient.
arXiv Detail & Related papers (2025-01-20T02:48:07Z) - The Surprising Harmfulness of Benign Overfitting for Adversarial
Robustness [13.120373493503772]
We prove a surprising result that even if the ground truth itself is robust to adversarial examples, the benignly overfitted model is benign in terms of the standard'' out-of-sample risk objective.
Our finding provides theoretical insights into the puzzling phenomenon observed in practice, where the true target function (e.g., human) is robust against adverasrial attack, while beginly overfitted neural networks lead to models that are not robust.
arXiv Detail & Related papers (2024-01-19T15:40:46Z) - Fine-grained analysis of non-parametric estimation for pairwise learning [9.676007573960383]
We are concerned with the generalization performance of non-parametric estimation for pairwise learning.
Our results can be used to handle a wide range of pairwise learning problems including ranking, AUC, pairwise regression and metric and similarity learning.
arXiv Detail & Related papers (2023-05-31T08:13:14Z) - On the ISS Property of the Gradient Flow for Single Hidden-Layer Neural
Networks with Linear Activations [0.0]
We investigate the effects of overfitting on the robustness of gradient-descent training when subject to uncertainty on the gradient estimation.
We show that the general overparametrized formulation introduces a set of spurious equilibria which lay outside the set where the loss function is minimized.
arXiv Detail & Related papers (2023-05-17T02:26:34Z) - The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer
Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data.
We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z) - Predicting Unreliable Predictions by Shattering a Neural Network [145.3823991041987]
Piecewise linear neural networks can be split into subfunctions.
Subfunctions have their own activation pattern, domain, and empirical error.
Empirical error for the full network can be written as an expectation over subfunctions.
arXiv Detail & Related papers (2021-06-15T18:34:41Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Asymptotic Risk of Overparameterized Likelihood Models: Double Descent
Theory for Deep Neural Networks [12.132641563193582]
We investigate the risk of a general class of overvisibilityized likelihood models, including deep models.
We demonstrate that several explicit models, such as parallel deep neural networks and ensemble learning, are in agreement with our theory.
arXiv Detail & Related papers (2021-02-28T13:02:08Z) - Unbiased Risk Estimators Can Mislead: A Case Study of Learning with
Complementary Labels [92.98756432746482]
We study a weakly supervised problem called learning with complementary labels.
We show that the quality of gradient estimation matters more in risk minimization.
We propose a novel surrogate complementary loss(SCL) framework that trades zero bias with reduced variance.
arXiv Detail & Related papers (2020-07-05T04:19:37Z) - Log-Likelihood Ratio Minimizing Flows: Towards Robust and Quantifiable
Neural Distribution Alignment [52.02794488304448]
We propose a new distribution alignment method based on a log-likelihood ratio statistic and normalizing flows.
We experimentally verify that minimizing the resulting objective results in domain alignment that preserves the local structure of input domains.
arXiv Detail & Related papers (2020-03-26T22:10:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.