Dissecting Lottery Ticket Transformers: Structural and Behavioral Study
of Sparse Neural Machine Translation
- URL: http://arxiv.org/abs/2009.13270v2
- Date: Mon, 12 Oct 2020 18:55:22 GMT
- Title: Dissecting Lottery Ticket Transformers: Structural and Behavioral Study
of Sparse Neural Machine Translation
- Authors: Rajiv Movva, Jason Y. Zhao
- Abstract summary: Recent work on the lottery ticket hypothesis has produced highly sparse Transformers for NMT while maintaining BLEU.
By probing Transformers with more and more low-magnitude weights pruned away, we find that complex semantic information is first to be degraded.
Analysis of internal activations reveals that higher layers diverge most over the course of pruning, gradually becoming less complex than their dense counterparts.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent work on the lottery ticket hypothesis has produced highly sparse
Transformers for NMT while maintaining BLEU. However, it is unclear how such
pruning techniques affect a model's learned representations. By probing
Transformers with more and more low-magnitude weights pruned away, we find that
complex semantic information is first to be degraded. Analysis of internal
activations reveals that higher layers diverge most over the course of pruning,
gradually becoming less complex than their dense counterparts. Meanwhile, early
layers of sparse models begin to perform more encoding. Attention mechanisms
remain remarkably consistent as sparsity increases.
Related papers
- Probing the Embedding Space of Transformers via Minimal Token Perturbations [40.292373831893705]
We study the effects of minimal token perturbations on the embedding space.<n>We also study how perturbations propagate across layers, demonstrating that input information is increasingly intermixed in deeper layers.<n>This work introduces the combination of token perturbations and shifts on the embedding space as a powerful tool for model interpretability.
arXiv Detail & Related papers (2025-06-22T12:22:56Z) - What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers [9.575216516290237]
This work investigates the underlying mechanisms for such dynamics, primarily in shallow Transformers.<n>We reveal that during the plateau, the model often develops an interpretable partial solution while simultaneously exhibiting a strong repetition bias in their outputs.<n>We validate that these identified phenomena-repetition bias and representation collapse-are not artifacts of toy setups but also manifest in the early pre-training stage of large language models like Pythia and OLMo.
arXiv Detail & Related papers (2025-06-16T16:51:18Z) - Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape.
This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z) - Transformer Layers as Painters [16.43731831488477]
We show that lower and final layers of pretrained transformers differ from middle layers, but that middle layers have a surprising amount of uniformity.
We also show that some classes of problems have to robustness to skipping layers, running the layers in an order different from how they were trained, or running the layers in parallel.
Our observations suggest that even frozen pretrained models may gracefully trade accuracy for latency by skipping layers or running layers in parallel.
arXiv Detail & Related papers (2024-07-12T14:31:05Z) - Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - Why "classic" Transformers are shallow and how to make them go deep [4.520356456308492]
Key innovation in Transformer is a Self-Attention mechanism designed to capture contextual information.
extending the original Transformer design to models of greater depth has proven exceedingly challenging.
We propose a new strategy of surgically removing excessive similarity in contrast to the existing approach of diminishing the SA mechanism explicitly or implicitly.
arXiv Detail & Related papers (2023-12-11T07:49:16Z) - On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting.
Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - XAI for Transformers: Better Explanations through Conservative
Propagation [60.67748036747221]
We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction.
Our proposal can be seen as a proper extension of the well-established LRP method to Transformers.
arXiv Detail & Related papers (2022-02-15T10:47:11Z) - Understanding the Difficulty of Training Transformers [120.99980924577787]
We show that unbalanced gradients are not the root cause of the instability of training.
We propose Admin to stabilize the early stage's training and unleash its full potential in the late stage.
arXiv Detail & Related papers (2020-04-17T13:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.