Related papers: What's Hidden in a One-layer Randomly Weighted Transformer?

What's Hidden in a One-layer Randomly Weighted Transformer?

URL: http://arxiv.org/abs/2109.03939v1
Date: Wed, 8 Sep 2021 21:22:52 GMT
Title: What's Hidden in a One-layer Randomly Weighted Transformer?
Authors: Sheng Shen, Zhewei Yao, Douwe Kiela, Kurt Keutzer and Michael W. Mahoney
Abstract summary: Hidden within one-layer randomly weighted neural networks, there existworks that can achieve impressive performance. Using a fixed pre-trained embedding layer, the previously foundworks are smaller than, but can match 98%/92% (34.14/25.24 BLEU) of the performance of, a trained Transformer small/base on IWSLT14/WMT14.
Score: 100.98342094831334
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We demonstrate that, hidden within one-layer randomly weighted neural networks, there exist subnetworks that can achieve impressive performance, without ever modifying the weight initializations, on machine translation tasks. To find subnetworks for one-layer randomly weighted neural networks, we apply different binary masks to the same weight matrix to generate different layers. Hidden within a one-layer randomly weighted Transformer, we find that subnetworks that can achieve 29.45/17.29 BLEU on IWSLT14/WMT14. Using a fixed pre-trained embedding layer, the previously found subnetworks are smaller than, but can match 98%/92% (34.14/25.24 BLEU) of the performance of, a trained Transformer small/base on IWSLT14/WMT14. Furthermore, we demonstrate the effectiveness of larger and deeper transformers in this setting, as well as the impact of different initialization methods. We released the source code at https://github.com/sIncerass/one_layer_lottery_ticket.

Related papers

Neural Metamorphosis [72.88137795439407]
This paper introduces a new learning paradigm termed Neural Metamorphosis (NeuMeta), which aims to build self-morphable neural networks. NeuMeta directly learns the continuous weight manifold of neural networks. It sustains full-size performance even at a 75% compression rate.
arXiv Detail & Related papers (2024-10-10T14:49:58Z)
Foldable SuperNets: Scalable Merging of Transformers with Different Initializations and Tasks [31.962161747846114]
Foldable SuperNet Merge (FS-Merge) is a simple, data-efficient, and capable of merging models of varying widths. FS-Merge consistently outperforms existing methods, achieving SOTA results, particularly in limited data scenarios.
arXiv Detail & Related papers (2024-10-02T12:34:32Z)
Transformers are Multi-State RNNs [25.99353771107789]
We show that decoder-only transformers can be conceptualized as unbounded multi-state RNNs. Transformers can be converted into $textitbounded$ multi-state RNNs by fixing the size of their hidden state. We introduce a novel, training-free compression policy - $textbfT$oken $textbfO$mission $textbfV$ia $textbfA$ttention (TOVA)
arXiv Detail & Related papers (2024-01-11T18:35:26Z)
Toward a Deeper Understanding: RetNet Viewed through Convolution [25.8904146140577]
Vision Transformer (ViT) can learn global dependencies superior to CNN, yet CNN's inherent locality can substitute for expensive training resources. This paper investigates the effectiveness of RetNet from a CNN perspective and presents a variant of RetNet tailored to the visual domain. We propose a novel Gaussian mixture mask (GMM) in which one mask only has two learnable parameters and it can be conveniently used in any ViT variants whose attention mechanism allows the use of masks.
arXiv Detail & Related papers (2023-09-11T10:54:22Z)
Spike-driven Transformer [31.931401322707995]
Spiking Neural Networks (SNNs) provide an energy-efficient deep learning option due to their unique spike-based event-driven (i.e., spike-driven) paradigm. In this paper, we incorporate the spike-driven paradigm into Transformer by the proposed Spike-driven Transformer with four unique properties. It is shown that the Spike-driven Transformer can achieve 77.1% top-1 accuracy on ImageNet-1K, which is the state-of-the-art result in the SNN field.
arXiv Detail & Related papers (2023-07-04T13:00:18Z)
Random Weights Networks Work as Loss Prior Constraint for Image Restoration [50.80507007507757]
We present our belief Random Weights Networks can be Acted as Loss Prior Constraint for Image Restoration'' Our belief can be directly inserted into existing networks without any training and testing computational cost. To emphasize, our main focus is to spark the realms of loss function and save their current neglected status.
arXiv Detail & Related papers (2023-03-29T03:43:51Z)
Rethinking Hierarchicies in Pre-trained Plain Vision Transformer [76.35955924137986]
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective. customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT. This paper proposes a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training.
arXiv Detail & Related papers (2022-11-03T13:19:23Z)
Parameter-Efficient Masking Networks [61.43995077575439]
Advanced network designs often contain a large number of repetitive structures (e.g., Transformer) In this study, we are the first to investigate the representative potential of fixed random weights with limited unique values by learning masks. It leads to a new paradigm for model compression to diminish the model size.
arXiv Detail & Related papers (2022-10-13T03:39:03Z)
Incremental Task Learning with Incremental Rank Updates [20.725181015069435]
We propose a new incremental task learning framework based on low-rank factorization. We show that our approach performs better than the current state-of-the-art methods in terms of accuracy and forgetting.
arXiv Detail & Related papers (2022-07-19T05:21:14Z)
Container: Context Aggregation Network [83.12004501984043]
Recent finding shows that a simple based solution without any traditional convolutional or Transformer components can produce effective visual representations. We present the model (CONText Ion NERtwok), a general-purpose building block for multi-head context aggregation. In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named modellight, can be employed in object detection and instance segmentation networks.
arXiv Detail & Related papers (2021-06-02T18:09:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.