Related papers: On the emergence of tetrahedral symmetry in the final and penultimate layers of neural network classifiers

On the emergence of tetrahedral symmetry in the final and penultimate layers of neural network classifiers

URL: http://arxiv.org/abs/2012.05420v2
Date: Sat, 19 Dec 2020 17:22:05 GMT
Title: On the emergence of tetrahedral symmetry in the final and penultimate layers of neural network classifiers
Authors: Weinan E and Stephan Wojtowytsch
Abstract summary: We show that even the final output of the classifier $h$ is not uniform over data samples from a class $C_i$ if $h$ is a shallow network. We explain this observation analytically in toy models for highly expressive deep neural networks.
Score: 9.975163460952045
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A recent numerical study observed that neural network classifiers enjoy a large degree of symmetry in the penultimate layer. Namely, if $h(x) = Af(x) +b$ where $A$ is a linear map and $f$ is the output of the penultimate layer of the network (after activation), then all data points $x_{i, 1}, \dots, x_{i, N_i}$ in a class $C_i$ are mapped to a single point $y_i$ by $f$ and the points $y_i$ are located at the vertices of a regular $k-1$-dimensional tetrahedron in a high-dimensional Euclidean space. We explain this observation analytically in toy models for highly expressive deep neural networks. In complementary examples, we demonstrate rigorously that even the final output of the classifier $h$ is not uniform over data samples from a class $C_i$ if $h$ is a shallow network (or if the deeper layers do not bring the data samples into a convenient geometric configuration).

Related papers

Learning Networks from Wide-Sense Stationary Stochastic Processes [7.59499154221528]
A key inference problem here is to learn edge connectivity from node outputs (potentials) We use a Whittle's maximum likelihood estimator (MLE) to learn the support of $Last$ from temporally correlated samples. We show that the MLE problem is strictly convex, admitting a unique solution.
arXiv Detail & Related papers (2024-12-04T23:14:00Z)
Deep Neural Networks: Multi-Classification and Universal Approximation [0.0]
We demonstrate that a ReLU deep neural network with a width of $2$ and a depth of $2N+4M-1$ layers can achieve finite sample memorization for any dataset comprising $N$ elements. We also provide depth estimates for approximating $W1,p$ functions and width estimates for approximating $Lp(Omega;mathbbRm)$ for $mgeq1$.
arXiv Detail & Related papers (2024-09-10T14:31:21Z)
Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$ under isotropic Gaussian data. We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ of arbitrary link function with a sample and runtime complexity of $n asymp T asymp C(q) cdot d
arXiv Detail & Related papers (2024-06-03T17:56:58Z)
Bayesian Inference with Deep Weakly Nonlinear Networks [57.95116787699412]
We show at a physics level of rigor that Bayesian inference with a fully connected neural network is solvable. We provide techniques to compute the model evidence and posterior to arbitrary order in $1/N$ and at arbitrary temperature.
arXiv Detail & Related papers (2024-05-26T17:08:04Z)
Learning Hierarchical Polynomials with Three-Layer Neural Networks [56.71223169861528]
We study the problem of learning hierarchical functions over the standard Gaussian distribution with three-layer neural networks. For a large subclass of degree $k$s $p$, a three-layer neural network trained via layerwise gradientp descent on the square loss learns the target $h$ up to vanishing test error. This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.
arXiv Detail & Related papers (2023-11-23T02:19:32Z)
Effective Minkowski Dimension of Deep Nonparametric Regression: Function Approximation and Statistical Theories [70.90012822736988]
Existing theories on deep nonparametric regression have shown that when the input data lie on a low-dimensional manifold, deep neural networks can adapt to intrinsic data structures. This paper introduces a relaxed assumption that input data are concentrated around a subset of $mathbbRd$ denoted by $mathcalS$, and the intrinsic dimension $mathcalS$ can be characterized by a new complexity notation -- effective Minkowski dimension.
arXiv Detail & Related papers (2023-06-26T17:13:31Z)
Neural Networks Efficiently Learn Low-Dimensional Representations with SGD [22.703825902761405]
We show that SGD-trained ReLU NNs can learn a single-index target of the form $y=f(langleboldsymbolu,boldsymbolxrangle) + epsilon$ by recovering the principal direction. We also provide compress guarantees for NNs using the approximate low-rank structure produced by SGD.
arXiv Detail & Related papers (2022-09-29T15:29:10Z)
Deep Network Approximation: Achieving Arbitrary Accuracy with Fixed Number of Neurons [5.37133760455631]
We develop feed-forward neural networks that achieve the universal approximation property for all continuous functions with a fixed finite number of neurons. We prove that $sigma$-activated networks with width $36d(2d+1)$ and depth $11$ can approximate any continuous function on a $d$-dimensioanl hypercube within an arbitrarily small error.
arXiv Detail & Related papers (2021-07-06T05:24:30Z)
A deep network construction that adapts to intrinsic dimensionality beyond the domain [79.23797234241471]
We study the approximation of two-layer compositions $f(x) = g(phi(x))$ via deep networks with ReLU activation. We focus on two intuitive and practically relevant choices for $phi$: the projection onto a low-dimensional embedded submanifold and a distance to a collection of low-dimensional sets.
arXiv Detail & Related papers (2020-08-06T09:50:29Z)
Geometric compression of invariant manifolds in neural nets [2.461575510055098]
We study how neural networks compress uninformative input space in models where data lie in $d$ dimensions. We show that for a one-hidden layer FC network trained with gradient descent, the first layer of weights evolve to become nearly insensitive to the $d_perp=d-d_parallel$ uninformative directions. Next we show that compression shapes the Neural Kernel (NTK) evolution in time, so that its top eigenvectors become more informative and display a larger projection on the labels.
arXiv Detail & Related papers (2020-07-22T14:43:49Z)
Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK [58.5766737343951]
We consider the dynamic of descent for learning a two-layer neural network. We show that an over-parametrized two-layer neural network can provably learn with gradient loss at most ground with Tangent samples.
arXiv Detail & Related papers (2020-07-09T07:09:28Z)
Sharp Representation Theorems for ReLU Networks with Precise Dependence on Depth [26.87238691716307]
We prove sharp-free representation results for neural networks with $D$ ReLU layers under square loss. Our results confirm the prevailing hypothesis that deeper networks are better at representing less smooth functions.
arXiv Detail & Related papers (2020-06-07T05:25:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.