Transformer Layers as Painters
- URL: http://arxiv.org/abs/2407.09298v2
- Date: Mon, 5 Aug 2024 15:10:25 GMT
- Title: Transformer Layers as Painters
- Authors: Qi Sun, Marc Pickett, Aakash Kumar Nain, Llion Jones,
- Abstract summary: We show that lower and final layers of pretrained transformers differ from middle layers, but that middle layers have a surprising amount of uniformity.
We also show that some classes of problems have to robustness to skipping layers, running the layers in an order different from how they were trained, or running the layers in parallel.
Our observations suggest that even frozen pretrained models may gracefully trade accuracy for latency by skipping layers or running layers in parallel.
- Score: 16.43731831488477
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite their nearly universal adoption for large language models, the internal workings of transformers are not well understood. We aim to better understand the impact of removing or reorganizing information throughout the layers of a pretrained transformer. Such an understanding could both yield better usage of existing models as well as to make architectural improvements to produce new variants. We present a series of empirical studies on frozen models that show that the lower and final layers of pretrained transformers differ from middle layers, but that middle layers have a surprising amount of uniformity. We further show that some classes of problems have robustness to skipping layers, running the layers in an order different from how they were trained, or running the layers in parallel. Our observations suggest that even frozen pretrained models may gracefully trade accuracy for latency by skipping layers or running layers in parallel.
Related papers
- How Transformers Utilize Multi-Head Attention in In-Context Learning? A Case Study on Sparse Linear Regression [19.64743851296488]
In this study, we consider a sparse linear regression problem and investigate how a trained multi-head transformer performs in-context learning.
We experimentally discover that the utilization of multi-heads exhibits different patterns across layers.
We demonstrate that such a preprocess-then-optimize algorithm can significantly outperform naive gradient descent and ridge regression algorithms.
arXiv Detail & Related papers (2024-08-08T15:33:02Z) - Uncovering Layer-Dependent Activation Sparsity Patterns in ReLU Transformers [2.1572258716881905]
We explore how token-level sparsity evolves over the course of training, and how it connects to broader sparsity patterns.
In particular, we demonstrate that the first and last layer of the network have distinctive and in many ways inverted relationships to sparsity.
We additionally explore the phenomenon of ReLU dimensions "turning off", and show evidence suggesting that "neuron death" is being driven by the dynamics of training.
arXiv Detail & Related papers (2024-07-10T17:10:10Z) - LayerShuffle: Enhancing Robustness in Vision Transformers by Randomizing Layer Execution Order [10.362659730151591]
We show that vision transformers can adapt to arbitrary layer execution orders at test time.
We also find that our trained models can be randomly merged with each other resulting in functional "Frankenstein" models.
arXiv Detail & Related papers (2024-07-05T13:54:15Z) - What Matters in Transformers? Not All Attention is Needed [7.857824255138334]
Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks.
It also introduces redundant architectures, posing efficiency challenges for real-world deployment.
We investigate redundancy across different modules within Transformers, including Blocks, Attention layers, using a similarity-based metric.
arXiv Detail & Related papers (2024-06-22T08:41:48Z) - Entropy Guided Extrapolative Decoding to Improve Factuality in Large Language Models [55.45444773200529]
Large language models (LLMs) exhibit impressive natural language capabilities but suffer from hallucination.
Recent work has focused on decoding techniques to improve factuality during inference.
arXiv Detail & Related papers (2024-04-14T19:45:35Z) - Weight Copy and Low-Rank Adaptation for Few-Shot Distillation of Vision Transformers [22.1372572833618]
We propose a novel few-shot feature distillation approach for vision transformers.
We first copy the weights from intermittent layers of existing vision transformers into shallower architectures (students)
Next, we employ an enhanced version of Low-Rank Adaptation (LoRA) to distill knowledge into the student in a few-shot scenario.
arXiv Detail & Related papers (2024-04-14T18:57:38Z) - Masked Image Modeling with Local Multi-Scale Reconstruction [54.91442074100597]
Masked Image Modeling (MIM) achieves outstanding success in self-supervised representation learning.
Existing MIM models conduct reconstruction task only at the top layer of encoder.
We design local multi-scale reconstruction, where the lower and upper layers reconstruct fine-scale and coarse-scale supervision signals respectively.
arXiv Detail & Related papers (2023-03-09T13:42:04Z) - SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained
Siamese Transformers [95.57829796484472]
We propose a novel zero-shot multi-frame image restoration method for removing unwanted obstruction elements.
It has three stages: transformer pre-training, zero-shot restoration, and hard patch refinement.
For zero-shot image restoration, we design a novel model, termed SiamTrans, which is constructed by Siamese transformers, encoders, and decoders.
arXiv Detail & Related papers (2021-12-17T10:42:39Z) - LV-BERT: Exploiting Layer Variety for BERT [85.27287501885807]
We introduce convolution into the layer type set, which is experimentally found beneficial to pre-trained models.
We then adopt an evolutionary algorithm guided by pre-training accuracy to find the optimal architecture.
LV-BERT model obtained by our method outperforms BERT and its variants on various downstream tasks.
arXiv Detail & Related papers (2021-06-22T13:20:14Z) - IOT: Instance-wise Layer Reordering for Transformer Structures [173.39918590438245]
We break the assumption of the fixed layer order in the Transformer and introduce instance-wise layer reordering into the model structure.
Our method can also be applied to other architectures beyond Transformer.
arXiv Detail & Related papers (2021-03-05T03:44:42Z) - Reservoir Transformer [89.28052130103345]
Inspired by old and well-established ideas in machine learning, we explore a variety of non-linear "reservoir" layers interspersed with regular transformer layers.
We show improvements in wall-clock compute time until convergence, as well as overall performance, on various machine translation and (masked) language modelling tasks.
arXiv Detail & Related papers (2020-12-30T05:20:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.