You Do Not Fully Utilize Transformer's Representation Capacity
- URL: http://arxiv.org/abs/2502.09245v1
- Date: Thu, 13 Feb 2025 12:00:50 GMT
- Title: You Do Not Fully Utilize Transformer's Representation Capacity
- Authors: Gleb Gerasimov, Yaroslav Aksenov, Nikita Balagansky, Viacheslav Sinii, Daniil Gavrilov,
- Abstract summary: In contrast to RNNs, Transformers can attend to all previous tokens directly.
Standard Transformers only use representations from the immediately preceding layer.
We introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that preserves the model's overall memory footprint while expanding its representational capacity.
- Score: 4.753535328327317
- License:
- Abstract: In contrast to RNNs, which compress previous tokens into a single hidden state, Transformers can attend to all previous tokens directly. However, standard Transformers only use representations from the immediately preceding layer. In this paper, we show that this design choice causes representation collapse and leads to suboptimal performance. To address this issue, we introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that preserves the model's overall memory footprint while expanding its representational capacity by allowing access to hidden states from earlier layers. Through extensive experiments across various architectures and different lookup mechanisms, we demonstrate consistent performance improvements on a wide range of tasks. Moreover, our analysis of the learned representation dynamics and our exploration of depthwise circuits reveal how LIMe integrates information across layers, pointing to promising directions for future research.
Related papers
- Masked Completion via Structured Diffusion with White-Box Transformers [23.07048591213815]
We provide the first instantiation of the white-box design paradigm that can be applied to large-scale unsupervised representation learning.
We do this by exploiting a fundamental connection between diffusion, compression, and (masked) completion, deriving a deep transformer-like masked autoencoder architecture.
CRATE-MAE demonstrates highly promising performance on large-scale imagery datasets.
arXiv Detail & Related papers (2024-04-03T04:23:01Z) - White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is? [27.58916930770997]
We show a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable.
Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets.
arXiv Detail & Related papers (2023-11-22T02:23:32Z) - Expediting Large-Scale Vision Transformer for Dense Prediction without
Fine-tuning [28.180891300826165]
Many advanced approaches have been developed to reduce the total number of tokens in large-scale vision transformers.
We present two non-parametric operators, a token clustering layer to decrease the number of tokens and a token reconstruction layer to increase the number of tokens.
Results are promising on five dense prediction tasks, including object detection, semantic segmentation, panoptic segmentation, instance segmentation, and depth estimation.
arXiv Detail & Related papers (2022-10-03T15:49:48Z) - Dynamic Spatial Sparsification for Efficient Vision Transformers and
Convolutional Neural Networks [88.77951448313486]
We present a new approach for model acceleration by exploiting spatial sparsity in visual data.
We propose a dynamic token sparsification framework to prune redundant tokens.
We extend our method to hierarchical models including CNNs and hierarchical vision Transformers.
arXiv Detail & Related papers (2022-07-04T17:00:51Z) - Plug-In Inversion: Model-Agnostic Inversion for Vision with Data
Augmentations [61.95114821573875]
We introduce Plug-In Inversion, which relies on a simple set of augmentations and does not require excessive hyper- parameter tuning.
We illustrate the practicality of our approach by inverting Vision Transformers (ViTs) and Multi-Layer Perceptrons (MLPs) trained on the ImageNet dataset.
arXiv Detail & Related papers (2022-01-31T02:12:45Z) - Stochastic Layers in Vision Transformers [85.38733795180497]
We introduce fully layers in vision transformers, without causing any severe drop in performance.
The additionality boosts the robustness of visual features and strengthens privacy.
We use our features for three different applications, namely, adversarial robustness, network calibration, and feature privacy.
arXiv Detail & Related papers (2021-12-30T16:07:59Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Vision Transformers for Dense Prediction [77.34726150561087]
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks.
Our experiments show that this architecture yields substantial improvements on dense prediction tasks.
arXiv Detail & Related papers (2021-03-24T18:01:17Z) - Addressing Some Limitations of Transformers with Feedback Memory [51.94640029417114]
Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks.
We propose the Feedback Transformer architecture that exposes all previous representations to all future representations.
We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.
arXiv Detail & Related papers (2020-02-21T16:37:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.