Related papers: Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference

Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference

URL: http://arxiv.org/abs/2405.14700v2
Date: Thu, 29 Aug 2024 09:44:53 GMT
Title: Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference
Authors: Ting Liu, Xuyang Liu, Siteng Huang, Liangtao Shi, Zunnan Xu, Yi Xin, Quanjun Yin, Xiaohong Liu,
Abstract summary: textbfSparse-Tuning is a novel PEFT method that accounts for the information redundancy in images and videos. Sparse-Tuning minimizes the quantity of tokens processed at each layer, leading to a quadratic reduction in computational and memory overhead. Our results show that our Sparse-Tuning reduces GFLOPs to textbf62%-70% of the original ViT-B while achieving state-of-the-art performance.
Score: 14.030836300221756
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Parameter-efficient fine-tuning (PEFT) has emerged as a popular solution for adapting pre-trained Vision Transformer (ViT) models to downstream applications. While current PEFT methods have achieved parameter efficiency, they overlook the efficiency of computation and GPU memory during both fine-tuning and inference, falling short of practical requirements. In this paper, we propose \textbf{Sparse-Tuning}, a novel PEFT method that accounts for the information redundancy in images and videos to boost the above efficiency. By sparsely preserving the semantic-relevant tokens and merging irrelevant ones, Sparse-Tuning minimizes the quantity of tokens processed at each layer, leading to a quadratic reduction in computational and memory overhead. To align our token sparsification strategy suitably with fine-tuning purposes, we further design Dense Adapters that establish dense connections from shallow layers to deeper layers. These Dense Adapters integrate multi-level local features to enrich the current tokens, improving both token preservation and model adaptation. Empirical results on VTAB-1K, three image datasets, and two video datasets show that our Sparse-Tuning reduces GFLOPs to \textbf{62\%-70\%} of the original ViT-B while achieving state-of-the-art performance. Source code is available at \url{https://github.com/liuting20/Sparse-Tuning}.

Related papers

Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models. We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z)
Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping [13.846838416902575]
A key bottleneck stems from the proliferation of visual tokens required for fine-grained image understanding. We propose Skip-Vision, a unified framework addressing both training and inference inefficiencies in vision-language models. Experimental results demonstrate that Skip-Vision reduces training time by up to 35%, inference FLOPs by 75%, and latency by 45%.
arXiv Detail & Related papers (2025-03-26T04:16:48Z)
Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation [30.912818564963512]
DETRIS is a parameter-efficient tuning framework designed to enhance low-rank visual feature propagation.<n>Our simple yet efficient approach greatly surpasses state-of-the-art methods with 0.9% to 1.8% backbone parameter updates.
arXiv Detail & Related papers (2025-01-15T05:00:03Z)
Transducer Tuning: Efficient Model Adaptation for Software Tasks Using Code Property Graphs [8.26418657158164]
approach is a technique to adapt large models for downstream code tasks using Code Property Graphs (CPGs) Our approach introduces a modular component called transducer that enriches code embeddings with structural and dependency information from CPGs. Our results demonstrate competitive performance compared to full parameter fine-tuning while reducing up to 99% trainable parameters to save memory.
arXiv Detail & Related papers (2024-12-18T03:25:17Z)
Visual Fourier Prompt Tuning [63.66866445034855]
We propose the Visual Fourier Prompt Tuning (VFPT) method as a general and effective solution for adapting large-scale transformer-based models. Our approach incorporates the Fast Fourier Transform into prompt embeddings and harmoniously considers both spatial and frequency domain information. Our results demonstrate that our approach outperforms current state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2024-11-02T18:18:35Z)
iConFormer: Dynamic Parameter-Efficient Tuning with Input-Conditioned Adaptation [15.97351561456467]
In this paper, we propose a novel PEFT approach, input-Conditioned transFormer, termed iConFormer. We introduce an input-Conditioned Network (iCoN) in the dynamic adapter that enables instance-level feature transformation. To be specific, iCoN generates channel-wise convolutional kernels for each feature and transform it using adaptive convolution process to effectively capture task-specific and fine-grained details tailor to downstream tasks.
arXiv Detail & Related papers (2024-09-04T16:06:23Z)
Adaptive Layer Selection for Efficient Vision Transformer Fine-Tuning [18.776903525210933]
We introduce an efficient fine-tuning method for ViTs called $textbfALaST$ ($textitAdaptive Layer Selection Fine-Tuning for Vision Transformers$) Our approach is based on the observation that not all layers are equally critical during fine-tuning, and their importance varies depending on the current mini-batch. We show that this adaptive compute allocation enables a nearly-optimal schedule for distributing computational resources.
arXiv Detail & Related papers (2024-08-16T11:27:52Z)
Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning [63.43972993473501]
Token compression expedites the training and inference of Vision Transformers (ViTs) However, when applied to downstream tasks, compression degrees are mismatched between training and inference stages. We propose a model arithmetic framework to decouple the compression degrees between the two stages.
arXiv Detail & Related papers (2024-08-13T10:36:43Z)
PRANCE: Joint Token-Optimization and Structural Channel-Pruning for Adaptive ViT Inference [44.77064952091458]
PRANCE is a Vision Transformer compression framework that jointly optimize the activated channels and reduces tokens, based on the characteristics of inputs. We introduce a novel "Result-to-Go" training mechanism that models ViTs' inference process as a sequential decision process. Our framework is shown to be compatible with various token optimization techniques such as pruning, merging, and pruning-merging strategies.
arXiv Detail & Related papers (2024-07-06T09:04:27Z)
LPViT: Low-Power Semi-structured Pruning for Vision Transformers [43.126752035656196]
Vision transformers have emerged as a promising alternative to convolutional neural networks for image analysis tasks.<n>One significant drawback of ViTs is their resource-intensive nature, leading to increased memory footprint, complexity, and power consumption.<n>We introduce a new block-structured pruning to address the resource-intensive issue for ViTs, offering a balanced trade-off between accuracy and hardware acceleration.
arXiv Detail & Related papers (2024-07-02T08:58:19Z)
Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [67.13876021157887]
Dynamic Tuning (DyT) is a novel approach to improve both parameter and inference efficiency for ViT adaptation. DyT achieves superior performance compared to existing PEFT methods while evoking only 71% of their FLOPs on the VTAB-1K benchmark.
arXiv Detail & Related papers (2024-03-18T14:05:52Z)
Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis [51.14136878142034]
Point cloud analysis has achieved outstanding performance by transferring point cloud pre-trained models. Existing methods for model adaptation usually update all model parameters, which is inefficient as it relies on high computational costs. In this paper, we aim to study parameter-efficient transfer learning for point cloud analysis with an ideal trade-off between task performance and parameter efficiency.
arXiv Detail & Related papers (2024-03-03T08:25:04Z)
AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z)
Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks. We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs. Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.