Memory Efficient Transformer Adapter for Dense Predictions
- URL: http://arxiv.org/abs/2502.01962v1
- Date: Tue, 04 Feb 2025 03:19:33 GMT
- Title: Memory Efficient Transformer Adapter for Dense Predictions
- Authors: Dong Zhang, Rui Yan, Pingcheng Dong, Kwang-Ting Cheng,
- Abstract summary: We propose META, a memory-efficient ViT adapter that can improve the model's memory efficiency and decrease memory time consumption.
Within the proposed block, the cross-shaped self-attention is employed to reduce the model's frequent reshaping operations.
META substantially enhances the predicted quality, while achieving a new state-of-the-art accuracy-efficiency trade-off.
- Score: 42.413108132475855
- License:
- Abstract: While current Vision Transformer (ViT) adapter methods have shown promising accuracy, their inference speed is implicitly hindered by inefficient memory access operations, e.g., standard normalization and frequent reshaping. In this work, we propose META, a simple and fast ViT adapter that can improve the model's memory efficiency and decrease memory time consumption by reducing the inefficient memory access operations. Our method features a memory-efficient adapter block that enables the common sharing of layer normalization between the self-attention and feed-forward network layers, thereby reducing the model's reliance on normalization operations. Within the proposed block, the cross-shaped self-attention is employed to reduce the model's frequent reshaping operations. Moreover, we augment the adapter block with a lightweight convolutional branch that can enhance local inductive biases, particularly beneficial for the dense prediction tasks, e.g., object detection, instance segmentation, and semantic segmentation. The adapter block is finally formulated in a cascaded manner to compute diverse head features, thereby enriching the variety of feature representations. Empirically, extensive evaluations on multiple representative datasets validate that META substantially enhances the predicted quality, while achieving a new state-of-the-art accuracy-efficiency trade-off. Theoretically, we demonstrate that META exhibits superior generalization capability and stronger adaptability.
Related papers
- Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves [123.07450481623124]
We propose Skip Tuning as a novel paradigm for adapting vision-language models to downstream tasks.
Unlike existing PT or adapter-based methods, Skip Tuning applies Layer-wise Skipping (LSkip) and Class-wise Skipping (CSkip) upon the FT baseline without introducing extra context vectors or adapter modules.
arXiv Detail & Related papers (2024-12-16T07:33:23Z) - CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism.
We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies.
By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z) - PREM: A Simple Yet Effective Approach for Node-Level Graph Anomaly
Detection [65.24854366973794]
Node-level graph anomaly detection (GAD) plays a critical role in identifying anomalous nodes from graph-structured data in domains such as medicine, social networks, and e-commerce.
We introduce a simple method termed PREprocessing and Matching (PREM for short) to improve the efficiency of GAD.
Our approach streamlines GAD, reducing time and memory consumption while maintaining powerful anomaly detection capabilities.
arXiv Detail & Related papers (2023-10-18T02:59:57Z) - SPION: Layer-Wise Sparse Training of Transformer via Convolutional Flood
Filling [1.0128808054306186]
We propose a novel sparsification scheme for the Transformer that integrates convolution filters and the flood filling method.
Our sparsification approach reduces the computational complexity and memory footprint of the Transformer during training.
New SPION achieves up to 3.08X speedup over existing state-of-the-art sparse Transformer models.
arXiv Detail & Related papers (2023-09-22T02:14:46Z) - Revisiting the Parameter Efficiency of Adapters from the Perspective of
Precision Redundancy [17.203320079872952]
Current state-of-the-art results in computer vision depend in part on fine-tuning large pre-trained vision models.
With the exponential growth of model sizes, the conventional full fine-tuning leads to increasingly huge storage and transmission overhead.
In this paper, we investigate how to make adapters even more efficient, reaching a new minimum size required to store a task-specific fine-tuned network.
arXiv Detail & Related papers (2023-07-31T17:22:17Z) - Optimizing ViViT Training: Time and Memory Reduction for Action
Recognition [30.431334125903145]
We address the challenges posed by the substantial training time and memory consumption associated with video transformers.
Our method is designed to lower this barrier and is based on the idea of freezing the spatial transformer during training.
arXiv Detail & Related papers (2023-06-07T23:06:53Z) - AdapterBias: Parameter-efficient Token-dependent Representation Shift
for Adapters in NLP Tasks [55.705355299065474]
Transformer-based pre-trained models with millions of parameters require large storage.
Recent approaches tackle this shortcoming by training adapters, but these approaches still require a relatively large number of parameters.
In this study, AdapterBias, a surprisingly simple yet effective adapter architecture, is proposed.
arXiv Detail & Related papers (2022-04-30T16:49:41Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.