Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural
Network Representations Vary with Width and Depth
- URL: http://arxiv.org/abs/2010.15327v2
- Date: Sat, 10 Apr 2021 01:44:17 GMT
- Title: Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural
Network Representations Vary with Width and Depth
- Authors: Thao Nguyen, Maithra Raghu, Simon Kornblith
- Abstract summary: We investigate how varying depth and width affects model hidden representations.
We find a characteristic block structure in the hidden representations of larger capacity (wider or deeper) models.
This discovery has important ramifications for features learned by different models.
- Score: 32.757486048358416
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A key factor in the success of deep neural networks is the ability to scale
models to improve performance by varying the architecture depth and width. This
simple property of neural network design has resulted in highly effective
architectures for a variety of tasks. Nevertheless, there is limited
understanding of effects of depth and width on the learned representations. In
this paper, we study this fundamental question. We begin by investigating how
varying depth and width affects model hidden representations, finding a
characteristic block structure in the hidden representations of larger capacity
(wider or deeper) models. We demonstrate that this block structure arises when
model capacity is large relative to the size of the training set, and is
indicative of the underlying layers preserving and propagating the dominant
principal component of their representations. This discovery has important
ramifications for features learned by different models, namely, representations
outside the block structure are often similar across architectures with varying
widths and depths, but the block structure is unique to each model. We analyze
the output predictions of different model architectures, finding that even when
the overall accuracy is similar, wide and deep models exhibit distinctive error
patterns and variations across classes.
Related papers
- Step by Step Network [56.413861208019576]
Scaling up network depth is a fundamental pursuit in neural architecture design.<n>In this paper, we identify two key barriers that obstruct residual models from scaling deeper: shortcut degradation and limited width.<n>We propose a generalized residual architecture dubbed Step by Step Network (StepsNet) to bridge the gap between theoretical potential and practical performance.
arXiv Detail & Related papers (2025-11-18T10:35:49Z) - Towards Scalable and Versatile Weight Space Learning [51.78426981947659]
This paper introduces the SANE approach to weight-space learning.
Our method extends the idea of hyper-representations towards sequential processing of subsets of neural network weights.
arXiv Detail & Related papers (2024-06-14T13:12:07Z) - Super Consistency of Neural Network Landscapes and Learning Rate Transfer [72.54450821671624]
We study the landscape through the lens of the loss Hessian.
We find that certain spectral properties under $mu$P are largely independent of the size of the network.
We show that in the Neural Tangent Kernel (NTK) and other scaling regimes, the sharpness exhibits very different dynamics at different scales.
arXiv Detail & Related papers (2024-02-27T12:28:01Z) - When Representations Align: Universality in Representation Learning Dynamics [8.188549368578704]
We derive an effective theory of representation learning under the assumption that the encoding map from input to hidden representation and the decoding map from representation to output are arbitrary smooth functions.
We show through experiments that the effective theory describes aspects of representation learning dynamics across a range of deep networks with different activation functions and architectures.
arXiv Detail & Related papers (2024-02-14T12:48:17Z) - Feature-Learning Networks Are Consistent Across Widths At Realistic
Scales [72.27228085606147]
We study the effect of width on the dynamics of feature-learning neural networks across a variety of architectures and datasets.
Early in training, wide neural networks trained on online data have not only identical loss curves but also agree in their point-wise test predictions throughout training.
We observe, however, that ensembles of narrower networks perform worse than a single wide network.
arXiv Detail & Related papers (2023-05-28T17:09:32Z) - Contrasting random and learned features in deep Bayesian linear
regression [12.234742322758418]
We study how the ability to learn affects the generalization performance of a simple class of models.
By comparing deep random feature models to deep networks in which all layers are trained, we provide a detailed characterization of the interplay between width, depth, data density, and prior mismatch.
arXiv Detail & Related papers (2022-03-01T15:51:29Z) - Redefining Neural Architecture Search of Heterogeneous Multi-Network
Models by Characterizing Variation Operators and Model Components [71.03032589756434]
We investigate the effect of different variation operators in a complex domain, that of multi-network heterogeneous neural models.
We characterize both the variation operators, according to their effect on the complexity and performance of the model; and the models, relying on diverse metrics which estimate the quality of the different parts composing it.
arXiv Detail & Related papers (2021-06-16T17:12:26Z) - Polynomial Networks in Deep Classifiers [55.90321402256631]
We cast the study of deep neural networks under a unifying framework.
Our framework provides insights on the inductive biases of each model.
The efficacy of the proposed models is evaluated on standard image and audio classification benchmarks.
arXiv Detail & Related papers (2021-04-16T06:41:20Z) - Asymptotics of Wide Convolutional Neural Networks [18.198962344790377]
We study scaling laws for wide CNNs and networks with skip connections.
We find that the difference in performance between finite and infinite width models vanishes at a definite rate with respect to model width.
arXiv Detail & Related papers (2020-08-19T21:22:19Z) - Data-driven effective model shows a liquid-like deep learning [2.0711789781518752]
It remains unknown what the landscape looks like for deep networks of binary synapses.
We propose a statistical mechanics framework by directly building a least structured model of the high-dimensional weight space.
Our data-driven model thus provides a statistical mechanics insight about why deep learning is unreasonably effective in terms of the high-dimensional weight space.
arXiv Detail & Related papers (2020-07-16T04:02:48Z) - The Heterogeneity Hypothesis: Finding Layer-Wise Differentiated Network
Architectures [179.66117325866585]
We investigate a design space that is usually overlooked, i.e. adjusting the channel configurations of predefined networks.
We find that this adjustment can be achieved by shrinking widened baseline networks and leads to superior performance.
Experiments are conducted on various networks and datasets for image classification, visual tracking and image restoration.
arXiv Detail & Related papers (2020-06-29T17:59:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.