Related papers: GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models

GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models

URL: http://arxiv.org/abs/2511.22125v1
Date: Thu, 27 Nov 2025 05:36:47 GMT
Title: GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models
Authors: Bin Wang, Ruotong Hu, Wenqian Wang, Wentong Li, Mingliang Gao, Runmin Cong, Wei Zhang,
Abstract summary: Visual and textual soft prompt tuning can improve the adaptability of Vision-Language Models (VLMs) in downstream tasks.<n>Existing methods attempt to mitigate this effect by regularizing the gap between hand-crafted prompts and soft prompts.<n>We propose a plug-and-play coupling prompt learning framework to optimize the performance of V-L models in video tasks.
Score: 34.002791706686345
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual and textual soft prompt tuning can effectively improve the adaptability of Vision-Language Models (VLMs) in downstream tasks. However, fine-tuning on video tasks impairs the model's generalization ability to unseen classes. Existing methods attempt to mitigate this forgetting effect by regularizing the gap between hand-crafted prompts and soft prompts, but this also weakens the learning ability of soft prompts. To address this challenge, we propose a plug-and-play coupling prompt learning framework to optimize the generalization performance of V-L models in video tasks, with the core motivation of mitigating semantic space narrowing during fine-tuning by introducing an externally supervised prompt. Specifically, for textual prompts, we introduce pre-trained prompts from other datasets as hard prompt tokens. These are concatenated with soft prompt tokens and coupled via a learnable mapping layer. This competitive prompting approach prevents the semantic space from overfitting to supervised categories. In addition, we introduce a set of well-designed irrelevant video sets and negative prompts as generic attribute anchors to maintain the generic relevance of the attributes in the pre-trained semantic space, thus preserving the generalization ability. Experiments on video tasks demonstrate that our method significantly outperforms state-of-the-art prompt tuning approaches across generalization benchmarks, particularly on base-to-new class prediction.

Related papers

AttriPrompt: Dynamic Prompt Composition Learning for CLIP [41.37140060183439]
AttriPrompt is a novel framework that enhances and refines textual semantic representations.<n>We introduce a Self-Regularization mechanism by applying explicit regularization constraints between the prompted and non-prompted text features.<n>Experiments demonstrate AttriPrompt's superiority over state-of-the-art methods, achieving up to 7.37% improvement in the base-to-novel setting.
arXiv Detail & Related papers (2025-09-07T07:07:59Z)
ConText: Driving In-context Learning for Text Removal and Segmentation [59.6299939669307]
This paper presents the first study on adapting the visual in-context learning paradigm to optical character recognition tasks.<n>We propose a task-chaining compositor in the form of image-removal-segmentation.<n>We also introduce context-aware aggregation, integrating the chained prompt pattern into the latent query representation.
arXiv Detail & Related papers (2025-06-04T10:06:32Z)
SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting [70.49268117587562]
We propose a plug-and-play Semantic-Driven Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the training set to unseen categories.<n>During inference, we dynamically synthesize the visual prompts for unseen categories based on the semantic correlation between unseen and training categories.
arXiv Detail & Related papers (2025-04-24T09:31:08Z)
Prompt-OT: An Optimal Transport Regularization Paradigm for Knowledge Preservation in Vision-Language Model Adaptation [5.296260279593993]
Vision-language models (VLMs) such as CLIP demonstrate strong performance but struggle when adapted to downstream tasks.<n>We propose an optimal transport (OT)-guided prompt learning framework that mitigates forgetting by preserving the structural consistency of feature distributions.<n>Our approach enforces joint constraints on both vision and text representations, ensuring a holistic feature alignment.
arXiv Detail & Related papers (2025-03-11T21:38:34Z)
Revisiting Prompt Pretraining of Vision-Language Models [13.888505919946578]
We propose a general framework termed Revisiting Prompt Pretraining (RPP) RPP targets at improving the fitting and generalization ability from two aspects: prompt structure and prompt supervision. We additionally utilize soft labels derived from zero-shot probability predictions provided by a pretrained Contrastive Language Image Pretraining (CLIP) teacher model.
arXiv Detail & Related papers (2024-09-10T02:36:13Z)
Can Better Text Semantics in Prompt Tuning Improve VLM Generalization? [28.041879000565874]
We introduce a prompt-tuning method that leverages class descriptions obtained from Large Language Models. Our approach constructs part-level description-guided image and text features, which are subsequently aligned to learn more generalizable prompts. Our comprehensive experiments conducted across 11 benchmark datasets show that our method outperforms established methods.
arXiv Detail & Related papers (2024-05-13T16:52:17Z)
Prompt Highlighter: Interactive Control for Multi-Modal LLMs [50.830448437285355]
This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation. We introduce a novel inference method, Prompt Highlighter, which enables users to highlight specific prompt spans to interactively control the focus during generation. We find that, during inference, guiding the models with highlighted tokens through the attention weights leads to more desired outputs.
arXiv Detail & Related papers (2023-12-07T13:53:29Z)
Self-regulating Prompts: Foundational Model Adaptation without Forgetting [112.66832145320434]
We introduce a self-regularization framework for prompting called PromptSRC. PromptSRC guides the prompts to optimize for both task-specific and task-agnostic general representations.
arXiv Detail & Related papers (2023-07-13T17:59:35Z)
Prompting classes: Exploring the Power of Prompt Class Learning in Weakly Supervised Semantic Segmentation [15.467510304266883]
We study the impact of prompt tuning on weakly supervised semantic segmentation. We introduce a novel approach based on a PrOmpt cLass lEarning (POLE) strategy. We demonstrate that our simple, yet efficient approach achieves SOTA performance in a well-known WSSS benchmark.
arXiv Detail & Related papers (2023-06-30T19:25:18Z)
Visual-Language Prompt Tuning with Knowledge-guided Context Optimization [96.27531485377871]
Representative CoOp-based work combines the learnable textual tokens with the class tokens to obtain specific textual knowledge. We introduce a novel Knowledge-guided Context Optimization (KgCoOp) to enhance the generalization ability of the learnable prompt for unseen classes.
arXiv Detail & Related papers (2023-03-23T14:04:23Z)
CLIP-Adapter: Better Vision-Language Models with Feature Adapters [84.88106370842883]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning.<n>CLIP-Adapter adopts an additional bottleneck layer to learn new features and performs residual-style feature blending.<n> Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.