Related papers: Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth

Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth

URL: http://arxiv.org/abs/2010.15327v2
Date: Sat, 10 Apr 2021 01:44:17 GMT
Title: Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth
Authors: Thao Nguyen, Maithra Raghu, Simon Kornblith
Abstract summary: We investigate how varying depth and width affects model hidden representations. We find a characteristic block structure in the hidden representations of larger capacity (wider or deeper) models. This discovery has important ramifications for features learned by different models.
Score: 32.757486048358416
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A key factor in the success of deep neural networks is the ability to scale models to improve performance by varying the architecture depth and width. This simple property of neural network design has resulted in highly effective architectures for a variety of tasks. Nevertheless, there is limited understanding of effects of depth and width on the learned representations. In this paper, we study this fundamental question. We begin by investigating how varying depth and width affects model hidden representations, finding a characteristic block structure in the hidden representations of larger capacity (wider or deeper) models. We demonstrate that this block structure arises when model capacity is large relative to the size of the training set, and is indicative of the underlying layers preserving and propagating the dominant principal component of their representations. This discovery has important ramifications for features learned by different models, namely, representations outside the block structure are often similar across architectures with varying widths and depths, but the block structure is unique to each model. We analyze the output predictions of different model architectures, finding that even when the overall accuracy is similar, wide and deep models exhibit distinctive error patterns and variations across classes.

Related papers

Step by Step Network [56.413861208019576]
Scaling up network depth is a fundamental pursuit in neural architecture design.<n>In this paper, we identify two key barriers that obstruct residual models from scaling deeper: shortcut degradation and limited width.<n>We propose a generalized residual architecture dubbed Step by Step Network (StepsNet) to bridge the gap between theoretical potential and practical performance.
arXiv Detail & Related papers (2025-11-18T10:35:49Z)
Towards Scalable and Versatile Weight Space Learning [51.78426981947659]
This paper introduces the SANE approach to weight-space learning. Our method extends the idea of hyper-representations towards sequential processing of subsets of neural network weights.
arXiv Detail & Related papers (2024-06-14T13:12:07Z)
Super Consistency of Neural Network Landscapes and Learning Rate Transfer [72.54450821671624]
We study the landscape through the lens of the loss Hessian. We find that certain spectral properties under $mu$P are largely independent of the size of the network. We show that in the Neural Tangent Kernel (NTK) and other scaling regimes, the sharpness exhibits very different dynamics at different scales.
arXiv Detail & Related papers (2024-02-27T12:28:01Z)
When Representations Align: Universality in Representation Learning Dynamics [8.188549368578704]
We derive an effective theory of representation learning under the assumption that the encoding map from input to hidden representation and the decoding map from representation to output are arbitrary smooth functions. We show through experiments that the effective theory describes aspects of representation learning dynamics across a range of deep networks with different activation functions and architectures.
arXiv Detail & Related papers (2024-02-14T12:48:17Z)
Feature-Learning Networks Are Consistent Across Widths At Realistic Scales [72.27228085606147]
We study the effect of width on the dynamics of feature-learning neural networks across a variety of architectures and datasets. Early in training, wide neural networks trained on online data have not only identical loss curves but also agree in their point-wise test predictions throughout training. We observe, however, that ensembles of narrower networks perform worse than a single wide network.
arXiv Detail & Related papers (2023-05-28T17:09:32Z)
Contrasting random and learned features in deep Bayesian linear regression [12.234742322758418]
We study how the ability to learn affects the generalization performance of a simple class of models. By comparing deep random feature models to deep networks in which all layers are trained, we provide a detailed characterization of the interplay between width, depth, data density, and prior mismatch.
arXiv Detail & Related papers (2022-03-01T15:51:29Z)
Redefining Neural Architecture Search of Heterogeneous Multi-Network Models by Characterizing Variation Operators and Model Components [71.03032589756434]
We investigate the effect of different variation operators in a complex domain, that of multi-network heterogeneous neural models. We characterize both the variation operators, according to their effect on the complexity and performance of the model; and the models, relying on diverse metrics which estimate the quality of the different parts composing it.
arXiv Detail & Related papers (2021-06-16T17:12:26Z)
Polynomial Networks in Deep Classifiers [55.90321402256631]
We cast the study of deep neural networks under a unifying framework. Our framework provides insights on the inductive biases of each model. The efficacy of the proposed models is evaluated on standard image and audio classification benchmarks.
arXiv Detail & Related papers (2021-04-16T06:41:20Z)
Asymptotics of Wide Convolutional Neural Networks [18.198962344790377]
We study scaling laws for wide CNNs and networks with skip connections. We find that the difference in performance between finite and infinite width models vanishes at a definite rate with respect to model width.
arXiv Detail & Related papers (2020-08-19T21:22:19Z)
Data-driven effective model shows a liquid-like deep learning [2.0711789781518752]
It remains unknown what the landscape looks like for deep networks of binary synapses. We propose a statistical mechanics framework by directly building a least structured model of the high-dimensional weight space. Our data-driven model thus provides a statistical mechanics insight about why deep learning is unreasonably effective in terms of the high-dimensional weight space.
arXiv Detail & Related papers (2020-07-16T04:02:48Z)
The Heterogeneity Hypothesis: Finding Layer-Wise Differentiated Network Architectures [179.66117325866585]
We investigate a design space that is usually overlooked, i.e. adjusting the channel configurations of predefined networks. We find that this adjustment can be achieved by shrinking widened baseline networks and leads to superior performance. Experiments are conducted on various networks and datasets for image classification, visual tracking and image restoration.
arXiv Detail & Related papers (2020-06-29T17:59:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.