What's Hidden in a One-layer Randomly Weighted Transformer?
- URL: http://arxiv.org/abs/2109.03939v1
- Date: Wed, 8 Sep 2021 21:22:52 GMT
- Title: What's Hidden in a One-layer Randomly Weighted Transformer?
- Authors: Sheng Shen, Zhewei Yao, Douwe Kiela, Kurt Keutzer and Michael W.
Mahoney
- Abstract summary: Hidden within one-layer randomly weighted neural networks, there existworks that can achieve impressive performance.
Using a fixed pre-trained embedding layer, the previously foundworks are smaller than, but can match 98%/92% (34.14/25.24 BLEU) of the performance of, a trained Transformer small/base on IWSLT14/WMT14.
- Score: 100.98342094831334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We demonstrate that, hidden within one-layer randomly weighted neural
networks, there exist subnetworks that can achieve impressive performance,
without ever modifying the weight initializations, on machine translation
tasks. To find subnetworks for one-layer randomly weighted neural networks, we
apply different binary masks to the same weight matrix to generate different
layers. Hidden within a one-layer randomly weighted Transformer, we find that
subnetworks that can achieve 29.45/17.29 BLEU on IWSLT14/WMT14. Using a fixed
pre-trained embedding layer, the previously found subnetworks are smaller than,
but can match 98%/92% (34.14/25.24 BLEU) of the performance of, a trained
Transformer small/base on IWSLT14/WMT14. Furthermore, we demonstrate the
effectiveness of larger and deeper transformers in this setting, as well as the
impact of different initialization methods. We released the source code at
https://github.com/sIncerass/one_layer_lottery_ticket.
Related papers
- Neural Metamorphosis [72.88137795439407]
This paper introduces a new learning paradigm termed Neural Metamorphosis (NeuMeta), which aims to build self-morphable neural networks.
NeuMeta directly learns the continuous weight manifold of neural networks.
It sustains full-size performance even at a 75% compression rate.
arXiv Detail & Related papers (2024-10-10T14:49:58Z) - Foldable SuperNets: Scalable Merging of Transformers with Different Initializations and Tasks [31.962161747846114]
Foldable SuperNet Merge (FS-Merge) is a simple, data-efficient, and capable of merging models of varying widths.
FS-Merge consistently outperforms existing methods, achieving SOTA results, particularly in limited data scenarios.
arXiv Detail & Related papers (2024-10-02T12:34:32Z) - Transformers are Multi-State RNNs [25.99353771107789]
We show that decoder-only transformers can be conceptualized as unbounded multi-state RNNs.
Transformers can be converted into $textitbounded$ multi-state RNNs by fixing the size of their hidden state.
We introduce a novel, training-free compression policy - $textbfT$oken $textbfO$mission $textbfV$ia $textbfA$ttention (TOVA)
arXiv Detail & Related papers (2024-01-11T18:35:26Z) - Toward a Deeper Understanding: RetNet Viewed through Convolution [25.8904146140577]
Vision Transformer (ViT) can learn global dependencies superior to CNN, yet CNN's inherent locality can substitute for expensive training resources.
This paper investigates the effectiveness of RetNet from a CNN perspective and presents a variant of RetNet tailored to the visual domain.
We propose a novel Gaussian mixture mask (GMM) in which one mask only has two learnable parameters and it can be conveniently used in any ViT variants whose attention mechanism allows the use of masks.
arXiv Detail & Related papers (2023-09-11T10:54:22Z) - Spike-driven Transformer [31.931401322707995]
Spiking Neural Networks (SNNs) provide an energy-efficient deep learning option due to their unique spike-based event-driven (i.e., spike-driven) paradigm.
In this paper, we incorporate the spike-driven paradigm into Transformer by the proposed Spike-driven Transformer with four unique properties.
It is shown that the Spike-driven Transformer can achieve 77.1% top-1 accuracy on ImageNet-1K, which is the state-of-the-art result in the SNN field.
arXiv Detail & Related papers (2023-07-04T13:00:18Z) - Random Weights Networks Work as Loss Prior Constraint for Image
Restoration [50.80507007507757]
We present our belief Random Weights Networks can be Acted as Loss Prior Constraint for Image Restoration''
Our belief can be directly inserted into existing networks without any training and testing computational cost.
To emphasize, our main focus is to spark the realms of loss function and save their current neglected status.
arXiv Detail & Related papers (2023-03-29T03:43:51Z) - Rethinking Hierarchicies in Pre-trained Plain Vision Transformer [76.35955924137986]
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective.
customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT.
This paper proposes a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training.
arXiv Detail & Related papers (2022-11-03T13:19:23Z) - Parameter-Efficient Masking Networks [61.43995077575439]
Advanced network designs often contain a large number of repetitive structures (e.g., Transformer)
In this study, we are the first to investigate the representative potential of fixed random weights with limited unique values by learning masks.
It leads to a new paradigm for model compression to diminish the model size.
arXiv Detail & Related papers (2022-10-13T03:39:03Z) - Incremental Task Learning with Incremental Rank Updates [20.725181015069435]
We propose a new incremental task learning framework based on low-rank factorization.
We show that our approach performs better than the current state-of-the-art methods in terms of accuracy and forgetting.
arXiv Detail & Related papers (2022-07-19T05:21:14Z) - Container: Context Aggregation Network [83.12004501984043]
Recent finding shows that a simple based solution without any traditional convolutional or Transformer components can produce effective visual representations.
We present the model (CONText Ion NERtwok), a general-purpose building block for multi-head context aggregation.
In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named modellight, can be employed in object detection and instance segmentation networks.
arXiv Detail & Related papers (2021-06-02T18:09:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.