Proxy Convexity: A Unified Framework for the Analysis of Neural Networks
Trained by Gradient Descent
- URL: http://arxiv.org/abs/2106.13792v1
- Date: Fri, 25 Jun 2021 17:45:00 GMT
- Title: Proxy Convexity: A Unified Framework for the Analysis of Neural Networks
Trained by Gradient Descent
- Authors: Spencer Frei and Quanquan Gu
- Abstract summary: We propose a unified non- optimization framework for the analysis of a learning network.
We show that existing guarantees can be trained unified through gradient descent.
- Score: 95.94432031144716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although the optimization objectives for learning neural networks are highly
non-convex, gradient-based methods have been wildly successful at learning
neural networks in practice. This juxtaposition has led to a number of recent
studies on provable guarantees for neural networks trained by gradient descent.
Unfortunately, the techniques in these works are often highly specific to the
problem studied in each setting, relying on different assumptions on the
distribution, optimization parameters, and network architectures, making it
difficult to generalize across different settings. In this work, we propose a
unified non-convex optimization framework for the analysis of neural network
training. We introduce the notions of proxy convexity and proxy
Polyak-Lojasiewicz (PL) inequalities, which are satisfied if the original
objective function induces a proxy objective function that is implicitly
minimized when using gradient methods. We show that stochastic gradient descent
(SGD) on objectives satisfying proxy convexity or the proxy PL inequality leads
to efficient guarantees for proxy objective functions. We further show that
many existing guarantees for neural networks trained by gradient descent can be
unified through proxy convexity and proxy PL inequalities.
Related papers
- Provable Guarantees for Neural Networks via Gradient Feature Learning [15.413985018920018]
This work proposes a unified analysis framework for two-layer networks trained by gradient descent.
The framework is centered around the principle of feature learning from prototypical gradients, and its effectiveness is demonstrated by applications in several problems.
arXiv Detail & Related papers (2023-10-19T01:45:37Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - A Unified Algebraic Perspective on Lipschitz Neural Networks [88.14073994459586]
This paper introduces a novel perspective unifying various types of 1-Lipschitz neural networks.
We show that many existing techniques can be derived and generalized via finding analytical solutions of a common semidefinite programming (SDP) condition.
Our approach, called SDP-based Lipschitz Layers (SLL), allows us to design non-trivial yet efficient generalization of convex potential layers.
arXiv Detail & Related papers (2023-03-06T14:31:09Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Optimization-Based Separations for Neural Networks [57.875347246373956]
We show that gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations.
This is the first optimization-based separation result where the approximation benefits of the stronger architecture provably manifest in practice.
arXiv Detail & Related papers (2021-12-04T18:07:47Z) - GradDiv: Adversarial Robustness of Randomized Neural Networks via
Gradient Diversity Regularization [3.9157051137215504]
We investigate the effect of adversarial attacks using proxy gradients on randomized neural networks.
We show that proxy gradients are less effective when the gradients are more scattered.
We propose Gradient Diversity (GradDiv) regularizations that minimize the concentration of the gradients to build a robust neural network.
arXiv Detail & Related papers (2021-07-06T06:57:40Z) - Improved Analysis of Clipping Algorithms for Non-convex Optimization [19.507750439784605]
Recently, citetzhang 2019gradient show that clipped (stochastic) Gradient Descent (GD) converges faster than vanilla GD/SGD.
Experiments confirm the superiority of clipping-based methods in deep learning tasks.
arXiv Detail & Related papers (2020-10-05T14:36:59Z) - The Hidden Convex Optimization Landscape of Two-Layer ReLU Neural
Networks: an Exact Characterization of the Optimal Solutions [51.60996023961886]
We prove that finding all globally optimal two-layer ReLU neural networks can be performed by solving a convex optimization program with cone constraints.
Our analysis is novel, characterizes all optimal solutions, and does not leverage duality-based analysis which was recently used to lift neural network training into convex spaces.
arXiv Detail & Related papers (2020-06-10T15:38:30Z) - Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks
Trained with the Logistic Loss [0.0]
Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks.
We analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations.
arXiv Detail & Related papers (2020-02-11T15:42:09Z) - The duality structure gradient descent algorithm: analysis and applications to neural networks [0.0]
We propose an algorithm named duality structure gradient descent (DSGD) that is amenable to non-asymptotic performance analysis.
We empirically demonstrate the behavior of DSGD in several neural network training scenarios.
arXiv Detail & Related papers (2017-08-01T21:24:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.