Related papers: You Do Not Fully Utilize Transformer's Representation Capacity

You Do Not Fully Utilize Transformer's Representation Capacity

URL: http://arxiv.org/abs/2502.09245v1
Date: Thu, 13 Feb 2025 12:00:50 GMT
Title: You Do Not Fully Utilize Transformer's Representation Capacity
Authors: Gleb Gerasimov, Yaroslav Aksenov, Nikita Balagansky, Viacheslav Sinii, Daniil Gavrilov,
Abstract summary: In contrast to RNNs, Transformers can attend to all previous tokens directly.<n>Standard Transformers only use representations from the immediately preceding layer.<n>We introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that preserves the model's overall memory footprint while expanding its representational capacity.
Score: 4.753535328327317
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In contrast to RNNs, which compress previous tokens into a single hidden state, Transformers can attend to all previous tokens directly. However, standard Transformers only use representations from the immediately preceding layer. In this paper, we show that this design choice causes representation collapse and leads to suboptimal performance. To address this issue, we introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that preserves the model's overall memory footprint while expanding its representational capacity by allowing access to hidden states from earlier layers. Through extensive experiments across various architectures and different lookup mechanisms, we demonstrate consistent performance improvements on a wide range of tasks. Moreover, our analysis of the learned representation dynamics and our exploration of depthwise circuits reveal how LIMe integrates information across layers, pointing to promising directions for future research.

Related papers

Masked Completion via Structured Diffusion with White-Box Transformers [23.07048591213815]
We provide the first instantiation of the white-box design paradigm that can be applied to large-scale unsupervised representation learning. We do this by exploiting a fundamental connection between diffusion, compression, and (masked) completion, deriving a deep transformer-like masked autoencoder architecture. CRATE-MAE demonstrates highly promising performance on large-scale imagery datasets.
arXiv Detail & Related papers (2024-04-03T04:23:01Z)
White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is? [27.58916930770997]
We show a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable. Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets.
arXiv Detail & Related papers (2023-11-22T02:23:32Z)
Jump to Conclusions: Short-Cutting Transformers With Linear Transformations [60.37563766047492]
Transformer-based language models create hidden representations of their inputs at every layer, but only use final-layer representations for prediction. This obscures the internal decision-making process of the model and the utility of its intermediate representations. We suggest a simple method for such casting, using linear transformations.
arXiv Detail & Related papers (2023-03-16T16:10:16Z)
Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning [28.180891300826165]
Many advanced approaches have been developed to reduce the total number of tokens in large-scale vision transformers. We present two non-parametric operators, a token clustering layer to decrease the number of tokens and a token reconstruction layer to increase the number of tokens. Results are promising on five dense prediction tasks, including object detection, semantic segmentation, panoptic segmentation, instance segmentation, and depth estimation.
arXiv Detail & Related papers (2022-10-03T15:49:48Z)
Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction. Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information. We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z)
Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks [88.77951448313486]
We present a new approach for model acceleration by exploiting spatial sparsity in visual data. We propose a dynamic token sparsification framework to prune redundant tokens. We extend our method to hierarchical models including CNNs and hierarchical vision Transformers.
arXiv Detail & Related papers (2022-07-04T17:00:51Z)
Plug-In Inversion: Model-Agnostic Inversion for Vision with Data Augmentations [61.95114821573875]
We introduce Plug-In Inversion, which relies on a simple set of augmentations and does not require excessive hyper- parameter tuning. We illustrate the practicality of our approach by inverting Vision Transformers (ViTs) and Multi-Layer Perceptrons (MLPs) trained on the ImageNet dataset.
arXiv Detail & Related papers (2022-01-31T02:12:45Z)
Stochastic Layers in Vision Transformers [85.38733795180497]
We introduce fully layers in vision transformers, without causing any severe drop in performance. The additionality boosts the robustness of visual features and strengthens privacy. We use our features for three different applications, namely, adversarial robustness, network calibration, and feature privacy.
arXiv Detail & Related papers (2021-12-30T16:07:59Z)
Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD) It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z)
Vision Transformers for Dense Prediction [77.34726150561087]
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks.
arXiv Detail & Related papers (2021-03-24T18:01:17Z)
Addressing Some Limitations of Transformers with Feedback Memory [51.94640029417114]
Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks. We propose the Feedback Transformer architecture that exposes all previous representations to all future representations. We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.
arXiv Detail & Related papers (2020-02-21T16:37:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.