Spatially heterogeneous learning by a deep student machine
- URL: http://arxiv.org/abs/2302.07419v4
- Date: Mon, 10 Jul 2023 05:38:58 GMT
- Title: Spatially heterogeneous learning by a deep student machine
- Authors: Hajime Yoshino
- Abstract summary: Deep neural networks (DNN) with a huge number of adjustable parameters remain largely black boxes.
We study supervised learning by a DNN of width $N$ and depth $L$ consisting of $NL$ perceptrons with $c$ inputs by a statistical mechanics approach called the teacher-student setting.
We show that the problem becomes exactly solvable in what we call as 'dense limit': $N gg c gg 1$ and $M gg 1$ with fixed $alpha=M/c$ using the replica method developed in (H. Yoshino, (
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep neural networks (DNN) with a huge number of adjustable parameters remain
largely black boxes. To shed light on the hidden layers of DNN, we study
supervised learning by a DNN of width $N$ and depth $L$ consisting of $NL$
perceptrons with $c$ inputs by a statistical mechanics approach called the
teacher-student setting. We consider an ensemble of student machines that
exactly reproduce $M$ sets of $N$ dimensional input/output relations provided
by a teacher machine. We show that the problem becomes exactly solvable in what
we call as 'dense limit': $N \gg c \gg 1$ and $M \gg 1$ with fixed $\alpha=M/c$
using the replica method developed in (H. Yoshino, (2020)). We also study the
model numerically performing simple greedy MC simulations. Simulations reveal
that learning by the DNN is quite heterogeneous in the network space:
configurations of the teacher and the student machines are more correlated
within the layers closer to the input/output boundaries while the central
region remains much less correlated due to the over-parametrization in
qualitative agreement with the theoretical prediction. We evaluate the
generalization-error of the DNN with various depth $L$ both theoretically and
numerically. Remarkably both the theory and simulation suggest
generalization-ability of the student machines, which are only weakly
correlated with the teacher in the center, does not vanish even in the deep
limit $L \gg 1$ where the system becomes heavily over-parametrized. We also
consider the impact of effective dimension $D(\leq N)$ of data by incorporating
the hidden manifold model (S. Goldt et. al., (2020)) into our model. The theory
implies that the loop corrections to the dense limit become enhanced by either
decreasing the width $N$ or decreasing the effective dimension $D$ of the data.
Simulation suggests both lead to significant improvements in
generalization-ability.
Related papers
- Effective Minkowski Dimension of Deep Nonparametric Regression: Function
Approximation and Statistical Theories [70.90012822736988]
Existing theories on deep nonparametric regression have shown that when the input data lie on a low-dimensional manifold, deep neural networks can adapt to intrinsic data structures.
This paper introduces a relaxed assumption that input data are concentrated around a subset of $mathbbRd$ denoted by $mathcalS$, and the intrinsic dimension $mathcalS$ can be characterized by a new complexity notation -- effective Minkowski dimension.
arXiv Detail & Related papers (2023-06-26T17:13:31Z) - The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich
Regimes [75.59720049837459]
We study the transition from infinite-width behavior to this variance limited regime as a function of sample size $P$ and network width $N$.
We find that finite-size effects can become relevant for very small datasets on the order of $P* sim sqrtN$ for regression with ReLU networks.
arXiv Detail & Related papers (2022-12-23T04:48:04Z) - Neural Networks Efficiently Learn Low-Dimensional Representations with
SGD [22.703825902761405]
We show that SGD-trained ReLU NNs can learn a single-index target of the form $y=f(langleboldsymbolu,boldsymbolxrangle) + epsilon$ by recovering the principal direction.
We also provide compress guarantees for NNs using the approximate low-rank structure produced by SGD.
arXiv Detail & Related papers (2022-09-29T15:29:10Z) - Understanding Deep Neural Function Approximation in Reinforcement
Learning via $\epsilon$-Greedy Exploration [53.90873926758026]
This paper provides a theoretical study of deep neural function approximation in reinforcement learning (RL)
We focus on the value based algorithm with the $epsilon$-greedy exploration via deep (and two-layer) neural networks endowed by Besov (and Barron) function spaces.
Our analysis reformulates the temporal difference error in an $L2(mathrmdmu)$-integrable space over a certain averaged measure $mu$, and transforms it to a generalization problem under the non-iid setting.
arXiv Detail & Related papers (2022-09-15T15:42:47Z) - Feature Learning in $L_{2}$-regularized DNNs: Attraction/Repulsion and
Sparsity [9.077741848403791]
We show that the loss in terms of the parameters can be reformulated into a loss in terms of the layerwise activations $Z_ell$ of the training set.
This reformulation reveals the dynamics behind feature learning.
arXiv Detail & Related papers (2022-05-31T14:10:15Z) - Lessons from $O(N)$ models in one dimension [0.0]
Various topics related to the $O(N)$ model in one spacetime dimension (i.e. ordinary quantum mechanics) are considered.
The focus is on a pedagogical presentation of quantum field theory methods in a simpler context.
arXiv Detail & Related papers (2021-09-14T11:36:30Z) - Locality defeats the curse of dimensionality in convolutional
teacher-student scenarios [69.2027612631023]
We show that locality is key in determining the learning curve exponent $beta$.
We conclude by proving, using a natural assumption, that performing kernel regression with a ridge that decreases with the size of the training set leads to similar learning curve exponents to those we obtain in the ridgeless case.
arXiv Detail & Related papers (2021-06-16T08:27:31Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Hardness of Learning Halfspaces with Massart Noise [56.98280399449707]
We study the complexity of PAC learning halfspaces in the presence of Massart (bounded) noise.
We show that there an exponential gap between the information-theoretically optimal error and the best error that can be achieved by a SQ algorithm.
arXiv Detail & Related papers (2020-12-17T16:43:11Z) - Towards Deep Learning Models Resistant to Large Perturbations [0.0]
Adversarial robustness has proven to be a required property of machine learning algorithms.
We show that the well-established algorithm called "adversarial training" fails to train a deep neural network given a large, but reasonable, perturbation magnitude.
arXiv Detail & Related papers (2020-03-30T12:03:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.