Dissecting Lottery Ticket Transformers: Structural and Behavioral Study
of Sparse Neural Machine Translation
- URL: http://arxiv.org/abs/2009.13270v2
- Date: Mon, 12 Oct 2020 18:55:22 GMT
- Title: Dissecting Lottery Ticket Transformers: Structural and Behavioral Study
of Sparse Neural Machine Translation
- Authors: Rajiv Movva, Jason Y. Zhao
- Abstract summary: Recent work on the lottery ticket hypothesis has produced highly sparse Transformers for NMT while maintaining BLEU.
By probing Transformers with more and more low-magnitude weights pruned away, we find that complex semantic information is first to be degraded.
Analysis of internal activations reveals that higher layers diverge most over the course of pruning, gradually becoming less complex than their dense counterparts.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent work on the lottery ticket hypothesis has produced highly sparse
Transformers for NMT while maintaining BLEU. However, it is unclear how such
pruning techniques affect a model's learned representations. By probing
Transformers with more and more low-magnitude weights pruned away, we find that
complex semantic information is first to be degraded. Analysis of internal
activations reveals that higher layers diverge most over the course of pruning,
gradually becoming less complex than their dense counterparts. Meanwhile, early
layers of sparse models begin to perform more encoding. Attention mechanisms
remain remarkably consistent as sparsity increases.
Related papers
- Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape.
This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z) - Transformer Layers as Painters [16.43731831488477]
We show that lower and final layers of pretrained transformers differ from middle layers, but that middle layers have a surprising amount of uniformity.
We also show that some classes of problems have to robustness to skipping layers, running the layers in an order different from how they were trained, or running the layers in parallel.
Our observations suggest that even frozen pretrained models may gracefully trade accuracy for latency by skipping layers or running layers in parallel.
arXiv Detail & Related papers (2024-07-12T14:31:05Z) - Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - Why "classic" Transformers are shallow and how to make them go deep [4.520356456308492]
Key innovation in Transformer is a Self-Attention mechanism designed to capture contextual information.
extending the original Transformer design to models of greater depth has proven exceedingly challenging.
We propose a new strategy of surgically removing excessive similarity in contrast to the existing approach of diminishing the SA mechanism explicitly or implicitly.
arXiv Detail & Related papers (2023-12-11T07:49:16Z) - On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting.
Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - XAI for Transformers: Better Explanations through Conservative
Propagation [60.67748036747221]
We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction.
Our proposal can be seen as a proper extension of the well-established LRP method to Transformers.
arXiv Detail & Related papers (2022-02-15T10:47:11Z) - Understanding the Difficulty of Training Transformers [120.99980924577787]
We show that unbalanced gradients are not the root cause of the instability of training.
We propose Admin to stabilize the early stage's training and unleash its full potential in the late stage.
arXiv Detail & Related papers (2020-04-17T13:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.