On the Convergence of Shallow Neural Network Training with Randomly
Masked Neurons
- URL: http://arxiv.org/abs/2112.02668v1
- Date: Sun, 5 Dec 2021 19:51:14 GMT
- Title: On the Convergence of Shallow Neural Network Training with Randomly
Masked Neurons
- Authors: Fangshuo Liao, Anastasios Kyrillidis
- Abstract summary: Given a dense shallow neural network, we focus on creating, training, and combining randomly selected functions.
By analyzing $i)$ theworks' neural kernel, $ii)$ the surrogate functions' gradient, and $iii)$ how we sample and combine the surrogate functions, we prove linear convergence rate of the training error.
For fixed neuron selection probability, the error term decreases as we increase the number of surrogate models, and increases as we increase the number of local training steps.
- Score: 11.119895959906085
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Given a dense shallow neural network, we focus on iteratively creating,
training, and combining randomly selected subnetworks (surrogate functions),
towards training the full model. By carefully analyzing $i)$ the subnetworks'
neural tangent kernel, $ii)$ the surrogate functions' gradient, and $iii)$ how
we sample and combine the surrogate functions, we prove linear convergence rate
of the training error -- within an error region -- for an overparameterized
single-hidden layer perceptron with ReLU activations for a regression task. Our
result implies that, for fixed neuron selection probability, the error term
decreases as we increase the number of surrogate models, and increases as we
increase the number of local training steps for each selected subnetwork. The
considered framework generalizes and provides new insights on dropout training,
multi-sample dropout training, as well as Independent Subnet Training; for each
case, we provide corresponding convergence results, as corollaries of our main
theorem.
Related papers
- Benign Overfitting for Regression with Trained Two-Layer ReLU Networks [14.36840959836957]
We study the least-square regression problem with a two-layer fully-connected neural network, with ReLU activation function, trained by gradient flow.
Our first result is a generalization result, that requires no assumptions on the underlying regression function or the noise other than that they are bounded.
arXiv Detail & Related papers (2024-10-08T16:54:23Z) - Learning a Neuron by a Shallow ReLU Network: Dynamics and Implicit Bias
for Correlated Inputs [5.7166378791349315]
We prove that, for the fundamental regression task of learning a single neuron, training a one-hidden layer ReLU network converges to zero loss.
We also show and characterise a surprising distinction in this setting between interpolator networks of minimal rank and those of minimal Euclidean norm.
arXiv Detail & Related papers (2023-06-10T16:36:22Z) - Bayesian Federated Neural Matching that Completes Full Information [2.6566593102111473]
Federated learning is a machine learning paradigm where locally trained models are distilled into a global model.
We propose a novel approach that overcomes this flaw by introducing a Kullback-Leibler divergence penalty at each iteration.
arXiv Detail & Related papers (2022-11-15T09:47:56Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - An alternative approach to train neural networks using monotone
variational inequality [22.320632565424745]
We propose an alternative approach to neural network training using the monotone vector field.
Our approach can be used for more efficient fine-tuning of a pre-trained neural network.
arXiv Detail & Related papers (2022-02-17T19:24:20Z) - How does unlabeled data improve generalization in self-training? A
one-hidden-layer theoretical analysis [93.37576644429578]
This work establishes the first theoretical analysis for the known iterative self-training paradigm.
We prove the benefits of unlabeled data in both training convergence and generalization ability.
Experiments from shallow neural networks to deep neural networks are also provided to justify the correctness of our established theoretical insights on self-training.
arXiv Detail & Related papers (2022-01-21T02:16:52Z) - Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity
on Pruned Neural Networks [79.74580058178594]
We analyze the performance of training a pruned neural network by analyzing the geometric structure of the objective function.
We show that the convex region near a desirable model with guaranteed generalization enlarges as the neural network model is pruned.
arXiv Detail & Related papers (2021-10-12T01:11:07Z) - Local Critic Training for Model-Parallel Learning of Deep Neural
Networks [94.69202357137452]
We propose a novel model-parallel learning method, called local critic training.
We show that the proposed approach successfully decouples the update process of the layer groups for both convolutional neural networks (CNNs) and recurrent neural networks (RNNs)
We also show that trained networks by the proposed method can be used for structural optimization.
arXiv Detail & Related papers (2021-02-03T09:30:45Z) - Modeling from Features: a Mean-field Framework for Over-parameterized
Deep Neural Networks [54.27962244835622]
This paper proposes a new mean-field framework for over- parameterized deep neural networks (DNNs)
In this framework, a DNN is represented by probability measures and functions over its features in the continuous limit.
We illustrate the framework via the standard DNN and the Residual Network (Res-Net) architectures.
arXiv Detail & Related papers (2020-07-03T01:37:16Z) - Measuring Model Complexity of Neural Networks with Curve Activation
Functions [100.98319505253797]
We propose the linear approximation neural network (LANN) to approximate a given deep model with curve activation function.
We experimentally explore the training process of neural networks and detect overfitting.
We find that the $L1$ and $L2$ regularizations suppress the increase of model complexity.
arXiv Detail & Related papers (2020-06-16T07:38:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.