Computing the Information Content of Trained Neural Networks
- URL: http://arxiv.org/abs/2103.01045v1
- Date: Mon, 1 Mar 2021 14:38:25 GMT
- Title: Computing the Information Content of Trained Neural Networks
- Authors: Jeremy Bernstein and Yisong Yue
- Abstract summary: How can neural networks with vastly more weights than training data still generalise?
This paper derives both a consistent estimator and a closed-form upper bound on the information content of infinitely wide neural networks.
- Score: 46.34988166338264
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How much information does a learning algorithm extract from the training data
and store in a neural network's weights? Too much, and the network would
overfit to the training data. Too little, and the network would not fit to
anything at all. Na\"ively, the amount of information the network stores should
scale in proportion to the number of trainable weights. This raises the
question: how can neural networks with vastly more weights than training data
still generalise? A simple resolution to this conundrum is that the number of
weights is usually a bad proxy for the actual amount of information stored. For
instance, typical weight vectors may be highly compressible. Then another
question occurs: is it possible to compute the actual amount of information
stored? This paper derives both a consistent estimator and a closed-form upper
bound on the information content of infinitely wide neural networks. The
derivation is based on an identification between neural information content and
the negative log probability of a Gaussian orthant. This identification yields
bounds that analytically control the generalisation behaviour of the entire
solution space of infinitely wide networks. The bounds have a simple dependence
on both the network architecture and the training data. Corroborating the
findings of Valle-P\'erez et al. (2019), who conducted a similar analysis using
approximate Gaussian integration techniques, the bounds are found to be both
non-vacuous and correlated with the empirical generalisation behaviour at
finite width.
Related papers
- Fundamental limits of overparametrized shallow neural networks for
supervised learning [11.136777922498355]
We study a two-layer neural network trained from input-output pairs generated by a teacher network with matching architecture.
Our results come in the form of bounds relating i) the mutual information between training data and network weights, or ii) the Bayes-optimal generalization error.
arXiv Detail & Related papers (2023-07-11T08:30:50Z) - Dive into Layers: Neural Network Capacity Bounding using Algebraic
Geometry [55.57953219617467]
We show that the learnability of a neural network is directly related to its size.
We use Betti numbers to measure the topological geometric complexity of input data and the neural network.
We perform the experiments on a real-world dataset MNIST and the results verify our analysis and conclusion.
arXiv Detail & Related papers (2021-09-03T11:45:51Z) - The Separation Capacity of Random Neural Networks [78.25060223808936]
We show that a sufficiently large two-layer ReLU-network with standard Gaussian weights and uniformly distributed biases can solve this problem with high probability.
We quantify the relevant structure of the data in terms of a novel notion of mutual complexity.
arXiv Detail & Related papers (2021-07-31T10:25:26Z) - Reasoning-Modulated Representations [85.08205744191078]
We study a common setting where our task is not purely opaque.
Our approach paves the way for a new class of data-efficient representation learning.
arXiv Detail & Related papers (2021-07-19T13:57:13Z) - Slope and generalization properties of neural networks [0.0]
We show that the distribution of the slope of a well-trained neural network classifier is generally independent of the width of the layers in a fully connected network.
The slope is of similar size throughout the relevant volume, and varies smoothly. It also behaves as predicted in rescaling examples.
We discuss possible applications of the slope concept, such as using it as a part of the loss function or stopping criterion during network training, or ranking data sets in terms of their complexity.
arXiv Detail & Related papers (2021-07-03T17:54:27Z) - Redundant representations help generalization in wide neural networks [71.38860635025907]
We study the last hidden layer representations of various state-of-the-art convolutional neural networks.
We find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise.
arXiv Detail & Related papers (2021-06-07T10:18:54Z) - Cherry-Picking Gradients: Learning Low-Rank Embeddings of Visual Data
via Differentiable Cross-Approximation [53.95297550117153]
We propose an end-to-end trainable framework that processes large-scale visual data tensors by looking emphat a fraction of their entries only.
The proposed approach is particularly useful for large-scale multidimensional grid data, and for tasks that require context over a large receptive field.
arXiv Detail & Related papers (2021-05-29T08:39:57Z) - Tensor networks and efficient descriptions of classical data [0.9176056742068814]
We study how the mutual information between a subregion and its complement scales with the subsystem size $L$.
We find that for text, the mutual information scales as a power law $Lnu$ with a close to volume law exponent.
For images, the scaling is close to an area law, hinting at 2D tensor networks such as PEPS could have an adequate expressibility.
arXiv Detail & Related papers (2021-03-11T18:57:16Z) - A Law of Robustness for Weight-bounded Neural Networks [37.54604146791085]
Recently, (Bubeck et al., 2020) conjectured that when using two-layer networks with $k$ neurons to fit a generic dataset, the smallest Lipschitz constant is $Omega(sqrtfracnk)$.
In this work we derive a lower bound on the Lipschitz constant for any arbitrary model class with bounded Rademacher complexity.
Our result coincides with that conjectured in (Bubeck et al., 2020) for two-layer networks under the assumption of bounded weights.
arXiv Detail & Related papers (2021-02-16T11:28:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.