Transformer Layers as Painters
- URL: http://arxiv.org/abs/2407.09298v1
- Date: Fri, 12 Jul 2024 14:31:05 GMT
- Title: Transformer Layers as Painters
- Authors: Qi Sun, Marc Pickett, Aakash Kumar Nain, Llion Jones,
- Abstract summary: We show that lower and final layers of pretrained transformers differ from middle layers, but that middle layers have a surprising amount of uniformity.
We also show that some classes of problems have to robustness to skipping layers, running the layers in an order different from how they were trained, or running the layers in parallel.
Our observations suggest that even frozen pretrained models may gracefully trade accuracy for latency by skipping layers or running layers in parallel.
- Score: 16.43731831488477
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite their nearly universal adoption for large language models, the internal workings of transformers are not well understood. We aim to better understand the impact of removing or reorganizing information throughout the layers of a pretrained transformer. Such an understanding could both yield better usage of existing models as well as to make architectural improvements to produce new variants. We present a series of empirical studies on frozen models that show that the lower and final layers of pretrained transformers differ from middle layers, but that middle layers have a surprising amount of uniformity. We further show that some classes of problems have robustness to skipping layers, running the layers in an order different from how they were trained, or running the layers in parallel. Our observations suggest that even frozen pretrained models may gracefully trade accuracy for latency by skipping layers or running layers in parallel.
Related papers
- Uncovering Layer-Dependent Activation Sparsity Patterns in ReLU Transformers [2.1572258716881905]
We explore how token-level sparsity evolves over the course of training, and how it connects to broader sparsity patterns.
In particular, we demonstrate that the first and last layer of the network have distinctive and in many ways inverted relationships to sparsity.
We additionally explore the phenomenon of ReLU dimensions "turning off", and show evidence suggesting that "neuron death" is being driven by the dynamics of training.
arXiv Detail & Related papers (2024-07-10T17:10:10Z) - LayerShuffle: Enhancing Robustness in Vision Transformers by Randomizing Layer Execution Order [10.362659730151591]
We show that vision transformers can adapt to arbitrary layer execution orders at test time.
We also find that our trained models can be randomly merged with each other resulting in functional "Frankenstein" models.
arXiv Detail & Related papers (2024-07-05T13:54:15Z) - What Matters in Transformers? Not All Attention is Needed [7.857824255138334]
Scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks.
However, it also introduces redundant structures, posing challenges for real-world deployment.
We investigate the varying redundancy across different modules, including Blocks, Transformers and Attention layers, using a similarity-based metric.
arXiv Detail & Related papers (2024-06-22T08:41:48Z) - Entropy Guided Extrapolative Decoding to Improve Factuality in Large Language Models [55.45444773200529]
Large language models (LLMs) exhibit impressive natural language capabilities but suffer from hallucination.
Recent work has focused on decoding techniques to improve factuality during inference.
arXiv Detail & Related papers (2024-04-14T19:45:35Z) - LaCo: Large Language Model Pruning via Layer Collapse [63.973142426228016]
Large language models (LLMs) based on transformer are witnessing a notable trend of size expansion.
We propose a concise layer-wise pruning method called textitLayer Collapse (LaCo), in which rear model layers collapse into a prior layer.
Experiments show that our method maintains an average task performance of over 80% at pruning ratios of 25-30%.
arXiv Detail & Related papers (2024-02-17T04:16:30Z) - Not all layers are equally as important: Every Layer Counts BERT [5.121744234312891]
This paper introduces a novel modification of the transformer architecture, tailored for the data-efficient pretraining of language models.
Our approach allows each transformer layer to select which outputs of previous layers to process.
arXiv Detail & Related papers (2023-11-03T23:08:50Z) - Masked Image Modeling with Local Multi-Scale Reconstruction [54.91442074100597]
Masked Image Modeling (MIM) achieves outstanding success in self-supervised representation learning.
Existing MIM models conduct reconstruction task only at the top layer of encoder.
We design local multi-scale reconstruction, where the lower and upper layers reconstruct fine-scale and coarse-scale supervision signals respectively.
arXiv Detail & Related papers (2023-03-09T13:42:04Z) - A Neural ODE Interpretation of Transformer Layers [8.839601328192957]
Transformer layers, which use an alternating pattern of multi-head attention and multi-layer perceptron (MLP) layers, provide an effective tool for a variety of machine learning problems.
We build upon this connection and propose a modification of the internal architecture of a transformer layer.
Our experiments show that this simple modification improves the performance of transformer networks in multiple tasks.
arXiv Detail & Related papers (2022-12-12T16:18:58Z) - LV-BERT: Exploiting Layer Variety for BERT [85.27287501885807]
We introduce convolution into the layer type set, which is experimentally found beneficial to pre-trained models.
We then adopt an evolutionary algorithm guided by pre-training accuracy to find the optimal architecture.
LV-BERT model obtained by our method outperforms BERT and its variants on various downstream tasks.
arXiv Detail & Related papers (2021-06-22T13:20:14Z) - IOT: Instance-wise Layer Reordering for Transformer Structures [173.39918590438245]
We break the assumption of the fixed layer order in the Transformer and introduce instance-wise layer reordering into the model structure.
Our method can also be applied to other architectures beyond Transformer.
arXiv Detail & Related papers (2021-03-05T03:44:42Z) - Reservoir Transformer [89.28052130103345]
Inspired by old and well-established ideas in machine learning, we explore a variety of non-linear "reservoir" layers interspersed with regular transformer layers.
We show improvements in wall-clock compute time until convergence, as well as overall performance, on various machine translation and (masked) language modelling tasks.
arXiv Detail & Related papers (2020-12-30T05:20:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.