Diffused Redundancy in Pre-trained Representations
- URL: http://arxiv.org/abs/2306.00183v3
- Date: Tue, 14 Nov 2023 17:00:52 GMT
- Title: Diffused Redundancy in Pre-trained Representations
- Authors: Vedant Nanda, Till Speicher, John P. Dickerson, Soheil Feizi, Krishna
P. Gummadi, Adrian Weller
- Abstract summary: We take a closer look at how features are encoded in pre-trained representations.
We find that learned representations in a given layer exhibit a degree of diffuse redundancy.
Our findings shed light on the nature of representations learned by pre-trained deep neural networks.
- Score: 98.55546694886819
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Representations learned by pre-training a neural network on a large dataset
are increasingly used successfully to perform a variety of downstream tasks. In
this work, we take a closer look at how features are encoded in such
pre-trained representations. We find that learned representations in a given
layer exhibit a degree of diffuse redundancy, ie, any randomly chosen subset of
neurons in the layer that is larger than a threshold size shares a large degree
of similarity with the full layer and is able to perform similarly as the whole
layer on a variety of downstream tasks. For example, a linear probe trained on
$20\%$ of randomly picked neurons from the penultimate layer of a ResNet50
pre-trained on ImageNet1k achieves an accuracy within $5\%$ of a linear probe
trained on the full layer of neurons for downstream CIFAR10 classification. We
conduct experiments on different neural architectures (including CNNs and
Transformers) pre-trained on both ImageNet1k and ImageNet21k and evaluate a
variety of downstream tasks taken from the VTAB benchmark. We find that the
loss and dataset used during pre-training largely govern the degree of diffuse
redundancy and the "critical mass" of neurons needed often depends on the
downstream task, suggesting that there is a task-inherent
redundancy-performance Pareto frontier. Our findings shed light on the nature
of representations learned by pre-trained deep neural networks and suggest that
entire layers might not be necessary to perform many downstream tasks. We
investigate the potential for exploiting this redundancy to achieve efficient
generalization for downstream tasks and also draw caution to certain possible
unintended consequences. Our code is available at
\url{https://github.com/nvedant07/diffused-redundancy}.
Related papers
- Fully Spiking Actor Network with Intra-layer Connections for
Reinforcement Learning [51.386945803485084]
We focus on the task where the agent needs to learn multi-dimensional deterministic policies to control.
Most existing spike-based RL methods take the firing rate as the output of SNNs, and convert it to represent continuous action space (i.e., the deterministic policy) through a fully-connected layer.
To develop a fully spiking actor network without any floating-point matrix operations, we draw inspiration from the non-spiking interneurons found in insects.
arXiv Detail & Related papers (2024-01-09T07:31:34Z) - Provable Multi-Task Representation Learning by Two-Layer ReLU Neural Networks [69.38572074372392]
We present the first results proving that feature learning occurs during training with a nonlinear model on multiple tasks.
Our key insight is that multi-task pretraining induces a pseudo-contrastive loss that favors representations that align points that typically have the same label across tasks.
arXiv Detail & Related papers (2023-07-13T16:39:08Z) - Hidden Classification Layers: Enhancing linear separability between
classes in neural networks layers [0.0]
We investigate the impact on deep network performances of a training approach.
We propose a neural network architecture which induces an error function involving the outputs of all the network layers.
arXiv Detail & Related papers (2023-06-09T10:52:49Z) - ReLU Neural Networks with Linear Layers are Biased Towards Single- and Multi-Index Models [9.96121040675476]
This manuscript explores how properties of functions learned by neural networks of depth greater than two layers affect predictions.
Our framework considers a family of networks of varying depths that all have the same capacity but different representation costs.
arXiv Detail & Related papers (2023-05-24T22:10:12Z) - Neural networks trained with SGD learn distributions of increasing
complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics.
We then exploit higher-order statistics only later during training.
We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z) - Neural Capacitance: A New Perspective of Neural Network Selection via
Edge Dynamics [85.31710759801705]
Current practice requires expensive computational costs in model training for performance prediction.
We propose a novel framework for neural network selection by analyzing the governing dynamics over synaptic connections (edges) during training.
Our framework is built on the fact that back-propagation during neural network training is equivalent to the dynamical evolution of synaptic connections.
arXiv Detail & Related papers (2022-01-11T20:53:15Z) - Targeted Gradient Descent: A Novel Method for Convolutional Neural
Networks Fine-tuning and Online-learning [9.011106198253053]
A convolutional neural network (ConvNet) is usually trained and then tested using images drawn from the same distribution.
To generalize a ConvNet to various tasks often requires a complete training dataset that consists of images drawn from different tasks.
We present Targeted Gradient Descent (TGD), a novel fine-tuning method that can extend a pre-trained network to a new task without revisiting data from the previous task.
arXiv Detail & Related papers (2021-09-29T21:22:09Z) - Redundant representations help generalization in wide neural networks [71.38860635025907]
We study the last hidden layer representations of various state-of-the-art convolutional neural networks.
We find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise.
arXiv Detail & Related papers (2021-06-07T10:18:54Z) - Representation Learning Beyond Linear Prediction Functions [33.94130046391917]
We show that diversity can be achieved when source tasks and the target task use different prediction function spaces beyond linear functions.
For a general function class, we find that eluder dimension gives a lower bound on the number of tasks required for diversity.
arXiv Detail & Related papers (2021-05-31T14:21:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.