Visual Prompt-Agnostic Evolution
- URL: http://arxiv.org/abs/2601.20232v1
- Date: Wed, 28 Jan 2026 04:06:44 GMT
- Title: Visual Prompt-Agnostic Evolution
- Authors: Junze Wang, Lei Fan, Dezheng Zhang, Weipeng Jing, Donglin Di, Yang Song, Sidong Liu, Cong Cong,
- Abstract summary: Visual Prompt Tuning (VPT) adapts a frozen Vision Transformer (ViT) to downstream tasks by inserting a small number of learnable prompt tokens.<n>Existing VPT variants often suffer from unstable training dynamics, characterized by gradient oscillations.<n>We propose Prompt-Agnostic Evolution ($mathtPAE$), which strengthens vision prompt tuning by explicitly modeling prompt dynamics.
- Score: 14.918966632639235
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual Prompt Tuning (VPT) adapts a frozen Vision Transformer (ViT) to downstream tasks by inserting a small number of learnable prompt tokens into the token sequence at each layer. However, we observe that existing VPT variants often suffer from unstable training dynamics, characterized by gradient oscillations. A layer-wise analysis reveals that shallow-layer prompts tend to stagnate early, while deeper-layer prompts exhibit high-variance oscillations, leading to cross-layer mismatch. These issues slow convergence and degrade final performance. To address these challenges, we propose Prompt-Agnostic Evolution ($\mathtt{PAE}$), which strengthens vision prompt tuning by explicitly modeling prompt dynamics. From a frequency-domain perspective, we initialize prompts in a task-aware direction by uncovering and propagating frequency shortcut patterns that the backbone inherently exploits for recognition. To ensure coherent evolution across layers, we employ a shared Koopman operator that imposes a global linear transformation instead of uncoordinated, layer-specific updates. Finally, inspired by Lyapunov stability theory, we introduce a regularizer that constrains error amplification during evolution. Extensive experiments show that $\mathtt{PAE}$ accelerates convergence with an average $1.41\times$ speedup and improves accuracy by 1--3% on 25 datasets across multiple downstream tasks. Beyond performance, $\mathtt{PAE}$ is prompt-agnostic and lightweight, and it integrates seamlessly with diverse VPT variants without backbone modification or inference-time changes.
Related papers
- LARV: Data-Free Layer-wise Adaptive Rescaling Veneer for Model Merging [11.135582038431368]
We introduce LARV, a training-free, data-free, merger-agnostic Layer-wise Adaptive Rescaling Veneer.<n>LARV adaptively suppresses shallow-layer interference and amplifies deeper-layer alignment using a simple deterministic schedule.<n> Layerwise analysis and corruption tests indicate that LARV suppresses shallow-layer interference while modestly amplifying deeper, task-stable features.
arXiv Detail & Related papers (2026-02-10T05:10:31Z) - STF: Shallow-Level Temporal Feedback to Enhance Spiking Transformers [29.501367277718046]
Spiking Neural Networks (SNNs) suffer from a great performance gap compared to floating-point mboxArtificial Neural Networks (ANNs)<n>Recent efforts have introduced deep-level feedback loops to transmit high-level semantic information to narrow this gap.<n>We propose Shallow-level Temporal Feedback (STF), a lightweight plug-and-play module for the encoding layer.
arXiv Detail & Related papers (2025-08-01T07:30:59Z) - Visual Fourier Prompt Tuning [63.66866445034855]
We propose the Visual Fourier Prompt Tuning (VFPT) method as a general and effective solution for adapting large-scale transformer-based models.
Our approach incorporates the Fast Fourier Transform into prompt embeddings and harmoniously considers both spatial and frequency domain information.
Our results demonstrate that our approach outperforms current state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2024-11-02T18:18:35Z) - Visual Prompt Tuning in Null Space for Continual Learning [51.96411454304625]
Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL)
This paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks' features.
In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient projection.
arXiv Detail & Related papers (2024-06-09T05:57:40Z) - Distributed Extra-gradient with Optimal Complexity and Communication
Guarantees [60.571030754252824]
We consider monotone variational inequality (VI) problems in multi-GPU settings where multiple processors/workers/clients have access to local dual vectors.
Extra-gradient, which is a de facto algorithm for monotone VI problems, has not been designed to be communication-efficient.
We propose a quantized generalized extra-gradient (Q-GenX), which is an unbiased and adaptive compression method tailored to solve VIs.
arXiv Detail & Related papers (2023-08-17T21:15:04Z) - Trained Transformers Learn Linear Models In-Context [39.56636898650966]
Attention-based neural networks as transformers have demonstrated a remarkable ability to exhibit inattention learning (ICL)
We show that when transformer training over random instances of linear regression problems, these models' predictions mimic nonlinear of ordinary squares.
arXiv Detail & Related papers (2023-06-16T15:50:03Z) - ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer [6.473688838974095]
We propose a new type of multiplication-reduced model, dubbed $textbfShiftAddViT$, to achieve end-to-end inference speedups on GPUs.
Experiments on various 2D/3D vision tasks consistently validate the effectiveness of our proposed ShiftAddViT.
arXiv Detail & Related papers (2023-06-10T13:53:41Z) - PTP: Boosting Stability and Performance of Prompt Tuning with
Perturbation-Based Regularizer [94.23904400441957]
We introduce perturbation-based regularizers, which can smooth the loss landscape, into prompt tuning.
We design two kinds of perturbation-based regularizers, including random-noise-based and adversarial-based.
Our new algorithms improve the state-of-the-art prompt tuning methods by 1.94% and 2.34% on SuperGLUE and FewGLUE benchmarks, respectively.
arXiv Detail & Related papers (2023-05-03T20:30:51Z) - Stabilizing Transformer Training by Preventing Attention Entropy
Collapse [56.45313891694746]
We investigate the training dynamics of Transformers by examining the evolution of the attention layers.
We show that $sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training.
We conduct experiments with $sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks.
arXiv Detail & Related papers (2023-03-11T03:30:47Z) - Training-Free Acceleration of ViTs with Delayed Spatial Merging [4.523939613157408]
Token merging has emerged as a new paradigm that can accelerate the inference of Vision Transformers (ViTs) without any retraining or fine-tuning.
We improve token merging by adding the perspectives of 1) activation outliers and 2) hierarchical representations.
We build a unified inference framework called DSM: Delayed Spatial Merging.
arXiv Detail & Related papers (2023-03-04T05:34:25Z) - Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer [63.99222215387881]
We propose Evo-ViT, a self-motivated slow-fast token evolution method for vision transformers.
Our method can significantly reduce the computational costs of vision transformers while maintaining comparable performance on image classification.
arXiv Detail & Related papers (2021-08-03T09:56:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.