Jump to Conclusions: Short-Cutting Transformers With Linear Transformations
- URL: http://arxiv.org/abs/2303.09435v2
- Date: Tue, 18 Jun 2024 19:58:27 GMT
- Title: Jump to Conclusions: Short-Cutting Transformers With Linear Transformations
- Authors: Alexander Yom Din, Taelin Karidi, Leshem Choshen, Mor Geva,
- Abstract summary: Transformer-based language models create hidden representations of their inputs at every layer, but only use final-layer representations for prediction.
This obscures the internal decision-making process of the model and the utility of its intermediate representations.
We suggest a simple method for such casting, using linear transformations.
- Score: 60.37563766047492
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based language models create hidden representations of their inputs at every layer, but only use final-layer representations for prediction. This obscures the internal decision-making process of the model and the utility of its intermediate representations. One way to elucidate this is to cast the hidden representations as final representations, bypassing the transformer computation in-between. In this work, we suggest a simple method for such casting, using linear transformations. This approximation far exceeds the prevailing practice of inspecting hidden representations from all layers, in the space of the final layer. Moreover, in the context of language modeling, our method produces more accurate predictions from hidden layers, across various model scales, architectures, and data distributions. This allows "peeking" into intermediate representations, showing that GPT-2 and BERT often predict the final output already in early layers. We then demonstrate the practicality of our method to recent early exit strategies, showing that when aiming, for example, at retention of 95% accuracy, our approach saves additional 7.9% layers for GPT-2 and 5.4% layers for BERT. Last, we extend our method to linearly approximate sub-modules, finding that attention is most tolerant to this change. Our code and learned mappings are publicly available at https://github.com/sashayd/mat.
Related papers
- Post-Processing Temporal Action Detection [134.26292288193298]
Temporal Action Detection (TAD) methods typically take a pre-processing step in converting an input varying-length video into a fixed-length snippet representation sequence.
This pre-processing step would temporally downsample the video, reducing the inference resolution and hampering the detection performance in the original temporal resolution.
We introduce a novel model-agnostic post-processing method without model redesign and retraining.
arXiv Detail & Related papers (2022-11-27T19:50:37Z) - Self-improving Multiplane-to-layer Images for Novel View Synthesis [3.9901365062418312]
We present a new method for lightweight novel-view synthesis that generalizes to an arbitrary forward-facing scene.
We start by representing the scene with a set of fronto-parallel semitransparent planes and afterward convert them to deformable layers in an end-to-end manner.
Our method does not require fine-tuning when a new scene is processed and can handle an arbitrary number of views without restrictions.
arXiv Detail & Related papers (2022-10-04T13:27:14Z) - FedAvg with Fine Tuning: Local Updates Lead to Representation Learning [54.65133770989836]
Federated Averaging (FedAvg) algorithm consists of alternating between a few local gradient updates at client nodes, followed by a model averaging update at the server.
We show that the reason behind generalizability of the FedAvg's output is its power in learning the common data representation among the clients' tasks.
We also provide empirical evidence demonstrating FedAvg's representation learning ability in federated image classification with heterogeneous data.
arXiv Detail & Related papers (2022-05-27T00:55:24Z) - Transformer Feed-Forward Layers Build Predictions by Promoting Concepts
in the Vocabulary Space [49.029910567673824]
Transformer-based language models (LMs) are at the core of modern NLP, but their internal prediction construction process is opaque and largely not understood.
We make a substantial step towards unveiling this underlying prediction process, by reverse-engineering the operation of the feed-forward network (FFN) layers.
arXiv Detail & Related papers (2022-03-28T12:26:00Z) - Parameter Decoupling Strategy for Semi-supervised 3D Left Atrium
Segmentation [0.0]
We present a novel semi-supervised segmentation model based on parameter decoupling strategy to encourage consistent predictions from diverse views.
Our method has achieved a competitive result over the state-of-the-art semisupervised methods on the Atrial Challenge dataset.
arXiv Detail & Related papers (2021-09-20T14:51:42Z) - Unifying Global-Local Representations in Salient Object Detection with Transformer [55.23033277636774]
We introduce a new attention-based encoder, vision transformer, into salient object detection.
With the global view in very shallow layers, the transformer encoder preserves more local representations.
Our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks.
arXiv Detail & Related papers (2021-08-05T17:51:32Z) - IOT: Instance-wise Layer Reordering for Transformer Structures [173.39918590438245]
We break the assumption of the fixed layer order in the Transformer and introduce instance-wise layer reordering into the model structure.
Our method can also be applied to other architectures beyond Transformer.
arXiv Detail & Related papers (2021-03-05T03:44:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.