On the emergence of tetrahedral symmetry in the final and penultimate
layers of neural network classifiers
- URL: http://arxiv.org/abs/2012.05420v2
- Date: Sat, 19 Dec 2020 17:22:05 GMT
- Title: On the emergence of tetrahedral symmetry in the final and penultimate
layers of neural network classifiers
- Authors: Weinan E and Stephan Wojtowytsch
- Abstract summary: We show that even the final output of the classifier $h$ is not uniform over data samples from a class $C_i$ if $h$ is a shallow network.
We explain this observation analytically in toy models for highly expressive deep neural networks.
- Score: 9.975163460952045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A recent numerical study observed that neural network classifiers enjoy a
large degree of symmetry in the penultimate layer. Namely, if $h(x) = Af(x) +b$
where $A$ is a linear map and $f$ is the output of the penultimate layer of the
network (after activation), then all data points $x_{i, 1}, \dots, x_{i, N_i}$
in a class $C_i$ are mapped to a single point $y_i$ by $f$ and the points $y_i$
are located at the vertices of a regular $k-1$-dimensional tetrahedron in a
high-dimensional Euclidean space.
We explain this observation analytically in toy models for highly expressive
deep neural networks. In complementary examples, we demonstrate rigorously that
even the final output of the classifier $h$ is not uniform over data samples
from a class $C_i$ if $h$ is a shallow network (or if the deeper layers do not
bring the data samples into a convenient geometric configuration).
Related papers
- Deep Neural Networks: Multi-Classification and Universal Approximation [0.0]
We demonstrate that a ReLU deep neural network with a width of $2$ and a depth of $2N+4M-1$ layers can achieve finite sample memorization for any dataset comprising $N$ elements.
We also provide depth estimates for approximating $W1,p$ functions and width estimates for approximating $Lp(Omega;mathbbRm)$ for $mgeq1$.
arXiv Detail & Related papers (2024-09-10T14:31:21Z) - Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$ under isotropic Gaussian data.
We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ of arbitrary link function with a sample and runtime complexity of $n asymp T asymp C(q) cdot d
arXiv Detail & Related papers (2024-06-03T17:56:58Z) - Bayesian Inference with Deep Weakly Nonlinear Networks [57.95116787699412]
We show at a physics level of rigor that Bayesian inference with a fully connected neural network is solvable.
We provide techniques to compute the model evidence and posterior to arbitrary order in $1/N$ and at arbitrary temperature.
arXiv Detail & Related papers (2024-05-26T17:08:04Z) - Learning Hierarchical Polynomials with Three-Layer Neural Networks [56.71223169861528]
We study the problem of learning hierarchical functions over the standard Gaussian distribution with three-layer neural networks.
For a large subclass of degree $k$s $p$, a three-layer neural network trained via layerwise gradientp descent on the square loss learns the target $h$ up to vanishing test error.
This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.
arXiv Detail & Related papers (2023-11-23T02:19:32Z) - Effective Minkowski Dimension of Deep Nonparametric Regression: Function
Approximation and Statistical Theories [70.90012822736988]
Existing theories on deep nonparametric regression have shown that when the input data lie on a low-dimensional manifold, deep neural networks can adapt to intrinsic data structures.
This paper introduces a relaxed assumption that input data are concentrated around a subset of $mathbbRd$ denoted by $mathcalS$, and the intrinsic dimension $mathcalS$ can be characterized by a new complexity notation -- effective Minkowski dimension.
arXiv Detail & Related papers (2023-06-26T17:13:31Z) - Neural Networks Efficiently Learn Low-Dimensional Representations with
SGD [22.703825902761405]
We show that SGD-trained ReLU NNs can learn a single-index target of the form $y=f(langleboldsymbolu,boldsymbolxrangle) + epsilon$ by recovering the principal direction.
We also provide compress guarantees for NNs using the approximate low-rank structure produced by SGD.
arXiv Detail & Related papers (2022-09-29T15:29:10Z) - Deep Network Approximation: Achieving Arbitrary Accuracy with Fixed
Number of Neurons [5.37133760455631]
We develop feed-forward neural networks that achieve the universal approximation property for all continuous functions with a fixed finite number of neurons.
We prove that $sigma$-activated networks with width $36d(2d+1)$ and depth $11$ can approximate any continuous function on a $d$-dimensioanl hypercube within an arbitrarily small error.
arXiv Detail & Related papers (2021-07-06T05:24:30Z) - A deep network construction that adapts to intrinsic dimensionality
beyond the domain [79.23797234241471]
We study the approximation of two-layer compositions $f(x) = g(phi(x))$ via deep networks with ReLU activation.
We focus on two intuitive and practically relevant choices for $phi$: the projection onto a low-dimensional embedded submanifold and a distance to a collection of low-dimensional sets.
arXiv Detail & Related papers (2020-08-06T09:50:29Z) - Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK [58.5766737343951]
We consider the dynamic of descent for learning a two-layer neural network.
We show that an over-parametrized two-layer neural network can provably learn with gradient loss at most ground with Tangent samples.
arXiv Detail & Related papers (2020-07-09T07:09:28Z) - Sharp Representation Theorems for ReLU Networks with Precise Dependence
on Depth [26.87238691716307]
We prove sharp-free representation results for neural networks with $D$ ReLU layers under square loss.
Our results confirm the prevailing hypothesis that deeper networks are better at representing less smooth functions.
arXiv Detail & Related papers (2020-06-07T05:25:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.