Principal Component Networks: Parameter Reduction Early in Training
- URL: http://arxiv.org/abs/2006.13347v1
- Date: Tue, 23 Jun 2020 21:40:24 GMT
- Title: Principal Component Networks: Parameter Reduction Early in Training
- Authors: Roger Waleffe and Theodoros Rekatsinas
- Abstract summary: We show how to find small networks that exhibit the same performance as their over parameterized counterparts.
We use PCA to find a basis of high variance for layer inputs and represent layer weights using these directions.
We also show that ResNet-20 PCNs outperform deep ResNet-110 networks while training faster.
- Score: 10.14522349959932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works show that overparameterized networks contain small subnetworks
that exhibit comparable accuracy to the full model when trained in isolation.
These results highlight the potential to reduce training costs of deep neural
networks without sacrificing generalization performance. However, existing
approaches for finding these small networks rely on expensive multi-round
train-and-prune procedures and are non-practical for large data sets and
models. In this paper, we show how to find small networks that exhibit the same
performance as their overparameterized counterparts after only a few training
epochs. We find that hidden layer activations in overparameterized networks
exist primarily in subspaces smaller than the actual model width. Building on
this observation, we use PCA to find a basis of high variance for layer inputs
and represent layer weights using these directions. We eliminate all weights
not relevant to the found PCA basis and term these network architectures
Principal Component Networks. On CIFAR-10 and ImageNet, we show that PCNs train
faster and use less energy than overparameterized models, without accuracy
loss. We find that our transformation leads to networks with up to 23.8x fewer
parameters, with equal or higher end-model accuracy---in some cases we observe
improvements up to 3%. We also show that ResNet-20 PCNs outperform deep
ResNet-110 networks while training faster.
Related papers
- Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters.
In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z) - Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z) - LilNetX: Lightweight Networks with EXtreme Model Compression and
Structured Sparsification [36.651329027209634]
LilNetX is an end-to-end trainable technique for neural networks.
It enables learning models with specified accuracy-rate-computation trade-off.
arXiv Detail & Related papers (2022-04-06T17:59:10Z) - An Experimental Study of the Impact of Pre-training on the Pruning of a
Convolutional Neural Network [0.0]
In recent years, deep neural networks have known a wide success in various application domains.
Deep neural networks usually involve a large number of parameters, which correspond to the weights of the network.
The pruning methods notably attempt to reduce the size of the parameter set, by identifying and removing the irrelevant weights.
arXiv Detail & Related papers (2021-12-15T16:02:15Z) - Simultaneous Training of Partially Masked Neural Networks [67.19481956584465]
We show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be split-off from the trained full network with remarkable good performance.
We show that training a Transformer with a low-rank core gives a low-rank model with superior performance than when training the low-rank model alone.
arXiv Detail & Related papers (2021-06-16T15:57:51Z) - Channel Planting for Deep Neural Networks using Knowledge Distillation [3.0165431987188245]
We present a novel incremental training algorithm for deep neural networks called planting.
Our planting can search the optimal network architecture with smaller number of parameters for improving the network performance.
We evaluate the effectiveness of the proposed method on different datasets such as CIFAR-10/100 and STL-10.
arXiv Detail & Related papers (2020-11-04T16:29:59Z) - HALO: Learning to Prune Neural Networks with Shrinkage [5.283963846188862]
Deep neural networks achieve state-of-the-art performance in a variety of tasks by extracting a rich set of features from unstructured data.
Modern techniques for inducing sparsity and reducing model size are (1) network pruning, (2) training with a sparsity inducing penalty, and (3) training a binary mask jointly with the weights of the network.
We present a novel penalty called Hierarchical Adaptive Lasso which learns to adaptively sparsify weights of a given network via trainable parameters.
arXiv Detail & Related papers (2020-08-24T04:08:48Z) - Go Wide, Then Narrow: Efficient Training of Deep Thin Networks [62.26044348366186]
We propose an efficient method to train a deep thin network with a theoretic guarantee.
By training with our method, ResNet50 can outperform ResNet101, and BERT Base can be comparable with BERT Large.
arXiv Detail & Related papers (2020-07-01T23:34:35Z) - Adjoined Networks: A Training Paradigm with Applications to Network
Compression [3.995047443480282]
We introduce Adjoined Networks, or AN, a learning paradigm that trains both the original base network and the smaller compressed network together.
Using ResNet-50 as the base network, AN achieves 71.8% top-1 accuracy with only 1.8M parameters and 1.6 GFLOPs on the ImageNet data-set.
We propose Differentiable Adjoined Networks (DAN), a training paradigm that augments AN by using neural architecture search to jointly learn both the width and the weights for each layer of the smaller network.
arXiv Detail & Related papers (2020-06-10T02:48:16Z) - Network Adjustment: Channel Search Guided by FLOPs Utilization Ratio [101.84651388520584]
This paper presents a new framework named network adjustment, which considers network accuracy as a function of FLOPs.
Experiments on standard image classification datasets and a wide range of base networks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2020-04-06T15:51:00Z) - Large-Scale Gradient-Free Deep Learning with Recursive Local
Representation Alignment [84.57874289554839]
Training deep neural networks on large-scale datasets requires significant hardware resources.
Backpropagation, the workhorse for training these networks, is an inherently sequential process that is difficult to parallelize.
We propose a neuro-biologically-plausible alternative to backprop that can be used to train deep networks.
arXiv Detail & Related papers (2020-02-10T16:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.