Related papers: Excess Risk of Two-Layer ReLU Neural Networks in Teacher-Student Settings and its Superiority to Kernel Methods

Excess Risk of Two-Layer ReLU Neural Networks in Teacher-Student Settings and its Superiority to Kernel Methods

URL: http://arxiv.org/abs/2205.14818v1
Date: Mon, 30 May 2022 02:51:36 GMT
Title: Excess Risk of Two-Layer ReLU Neural Networks in Teacher-Student Settings and its Superiority to Kernel Methods
Authors: Shunta Akiyama, Taiji Suzuki
Abstract summary: We investigate the risk of two-layer ReLU neural networks in a teacher regression model. We find that the student network provably outperforms any solution methods.
Score: 58.44819696433327
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While deep learning has outperformed other methods for various tasks, theoretical frameworks that explain its reason have not been fully established. To address this issue, we investigate the excess risk of two-layer ReLU neural networks in a teacher-student regression model, in which a student network learns an unknown teacher network through its outputs. Especially, we consider the student network that has the same width as the teacher network and is trained in two phases: first by noisy gradient descent and then by the vanilla gradient descent. Our result shows that the student network provably reaches a near-global optimal solution and outperforms any kernel methods estimator (more generally, linear estimators), including neural tangent kernel approach, random feature model, and other kernel methods, in a sense of the minimax optimal rate. The key concept inducing this superiority is the non-convexity of the neural network models. Even though the loss landscape is highly non-convex, the student network adaptively learns the teacher neurons.

Related papers

Infinite Width Limits of Self Supervised Neural Networks [6.178817969919849]
We bridge the gap between the NTK and self-supervised learning, focusing on two-layer neural networks trained under the Barlow Twins loss. We prove that the NTK of Barlow Twins indeed becomes constant as the width of the network approaches infinity.
arXiv Detail & Related papers (2024-11-17T21:13:57Z)
Novel Kernel Models and Exact Representor Theory for Neural Networks Beyond the Over-Parameterized Regime [52.00917519626559]
This paper presents two models of neural-networks and their training applicable to neural networks of arbitrary width, depth and topology. We also present an exact novel representor theory for layer-wise neural network training with unregularized gradient descent in terms of a local-extrinsic neural kernel (LeNK) This representor theory gives insight into the role of higher-order statistics in neural network training and the effect of kernel evolution in neural-network kernel models.
arXiv Detail & Related papers (2024-05-24T06:30:36Z)
A Survey on Statistical Theory of Deep Learning: Approximation, Training Dynamics, and Generative Models [13.283281356356161]
We review the literature on statistical theories of neural networks from three perspectives. Results on excess risks for neural networks are reviewed. Papers that attempt to answer how the neural network finds the solution that can generalize well on unseen data'' are reviewed.
arXiv Detail & Related papers (2024-01-14T02:30:19Z)
Benign Overfitting for Two-layer ReLU Convolutional Neural Networks [60.19739010031304]
We establish algorithm-dependent risk bounds for learning two-layer ReLU convolutional neural networks with label-flipping noise. We show that, under mild conditions, the neural network trained by gradient descent can achieve near-zero training loss and Bayes optimal test risk.
arXiv Detail & Related papers (2023-03-07T18:59:38Z)
Multi-Grade Deep Learning [3.0069322256338906]
Current deep learning model is of a single-grade neural network. We propose a multi-grade learning model that enables us to learn deep neural network much more effectively and efficiently.
arXiv Detail & Related papers (2023-02-01T00:09:56Z)
On Learnability via Gradient Method for Two-Layer ReLU Neural Networks in Teacher-Student Setting [41.60125423028092]
We study two-layer ReLU networks in a teacher-student regression model. We show that with a specific regularization and sufficient over- parameterization, a student network can identify the parameters via descent. We analyze the global minima on a sparse global property in the measure space.
arXiv Detail & Related papers (2021-06-11T09:05:41Z)
Classifying high-dimensional Gaussian mixtures: Where kernel methods fail and neural networks succeed [27.38015169185521]
We show theoretically that two-layer neural networks (2LNN) with only a few hidden neurons can beat the performance of kernel learning. We show how over-parametrising the neural network leads to faster convergence, but does not improve its final performance.
arXiv Detail & Related papers (2021-02-23T15:10:15Z)
Learning Neural Network Subspaces [74.44457651546728]
Recent observations have advanced our understanding of the neural network optimization landscape. With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks. With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks.
arXiv Detail & Related papers (2021-02-20T23:26:58Z)
Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks. Centered and ensembled finite networks have reduced posterior variance. Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z)
Feature Purification: How Adversarial Training Performs Robust Deep Learning [66.05472746340142]
We show a principle that we call Feature Purification, where we show one of the causes of the existence of adversarial examples is the accumulation of certain small dense mixtures in the hidden weights during the training process of a neural network. We present both experiments on the CIFAR-10 dataset to illustrate this principle, and a theoretical result proving that for certain natural classification tasks, training a two-layer neural network with ReLU activation using randomly gradient descent indeed this principle.
arXiv Detail & Related papers (2020-05-20T16:56:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.