Nemesis: Normalizing the Soft-prompt Vectors of Vision-Language Models
- URL: http://arxiv.org/abs/2408.13979v1
- Date: Mon, 26 Aug 2024 02:09:05 GMT
- Title: Nemesis: Normalizing the Soft-prompt Vectors of Vision-Language Models
- Authors: Shuai Fu, Xiequn Wang, Qiushi Huang, Yu Zhang,
- Abstract summary: We investigate the role of norms of soft-prompt vector in vision-language models (VLMs)
We propose a novel method named textbfNormalizing thtextbfe soft-protextbfmpt vtextbfectors of vitextbfsion-language modeltextbfs (textbfNemesis) to normalize soft-prompt vectors.
- Score: 5.58681637186155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the prevalence of large-scale pretrained vision-language models (VLMs), such as CLIP, soft-prompt tuning has become a popular method for adapting these models to various downstream tasks. However, few works delve into the inherent properties of learnable soft-prompt vectors, specifically the impact of their norms to the performance of VLMs. This motivates us to pose an unexplored research question: ``Do we need to normalize the soft prompts in VLMs?'' To fill this research gap, we first uncover a phenomenon, called the \textbf{Low-Norm Effect} by performing extensive corruption experiments, suggesting that reducing the norms of certain learned prompts occasionally enhances the performance of VLMs, while increasing them often degrades it. To harness this effect, we propose a novel method named \textbf{N}ormalizing th\textbf{e} soft-pro\textbf{m}pt v\textbf{e}ctors of vi\textbf{si}on-language model\textbf{s} (\textbf{Nemesis}) to normalize soft-prompt vectors in VLMs. To the best of our knowledge, our work is the first to systematically investigate the role of norms of soft-prompt vector in VLMs, offering valuable insights for future research in soft-prompt tuning. The code is available at \texttt{\href{https://github.com/ShyFoo/Nemesis}{https://github.com/ShyFoo/Nemesis}}.
Related papers
- ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models [67.75439511654078]
Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses.<n>They face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications.<n>We propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment.
arXiv Detail & Related papers (2025-07-01T16:01:08Z) - Demystifying the Visual Quality Paradox in Multimodal Large Language Models [49.154146792279946]
Recent Multimodal Large Language Models (MLLMs) excel on benchmark vision-language tasks, yet little is known about how input visual quality shapes their responses.<n>We conduct the first systematic study spanning leading MLLMs and a suite of vision-language benchmarks.<n>We uncover a visual-quality paradox: model, task, and even individual-instance performance can improve when images deviate from human-perceived fidelity.
arXiv Detail & Related papers (2025-06-18T17:14:07Z) - The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation [40.73687553764341]
Text-to-video (T2V) generative models, trained on large-scale datasets, are sensitive to input prompts.
We introduce textbfRAPO, a novel textbfRetrieval-textbfAugmented textbfPrompt textbfOptimization framework.
arXiv Detail & Related papers (2025-04-16T03:33:25Z) - Demystifying Singular Defects in Large Language Models [61.98878352956125]
In large language models (LLMs), the underlying causes of high-norm tokens remain largely unexplored.<n>We provide both theoretical insights and empirical validation across a range of recent models.<n>We showcase two practical applications of these findings: the improvement of quantization schemes and the design of LLM signatures.
arXiv Detail & Related papers (2025-02-10T20:09:16Z) - Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks.
LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension.
We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z) - Mixture of Prompt Learning for Vision Language Models [12.828490399811376]
We propose a mixture of soft prompt learning method incorporating a routing module.
This module is able to capture a dataset's varied styles and dynamically selects the most suitable prompts for each instance.
We also implement semantically grouped text-level supervision, initializing each soft prompt with the token embeddings of manually designed templates from its group.
arXiv Detail & Related papers (2024-09-18T14:25:02Z) - CluMo: Cluster-based Modality Fusion Prompt for Continual Learning in Visual Question Answering [23.360714576158905]
Large vision-language models (VLMs) have shown significant performance boost in various application domains.
Finetuning a VLM on a task leads to reducing its generalization power and the capacity of learning new tasks.
We propose a novel prompt-based CL method forVLMs, namely $textbfClu$ster-based $textbfMo$dality Fusion Prompt (textbfCluMo)
arXiv Detail & Related papers (2024-08-21T16:07:49Z) - Hard Prompts Made Interpretable: Sparse Entropy Regularization for Prompt Tuning with RL [29.01858866450715]
We present RLPrompt, which aims to find optimal prompt tokens leveraging soft Q-learning.
While the results show promise, we have observed that the prompts frequently appear unnatural, which impedes their interpretability.
We address this limitation by using sparse Tsallis entropy regularization, a principled approach to filtering out unlikely tokens from consideration.
arXiv Detail & Related papers (2024-07-20T03:10:19Z) - DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval [73.82017200889906]
Text-video retrieval is a critical multi-modal task to find the most relevant video for a text query.
We propose DGL, a cross-modal Dynamic prompt tuning method with Global-Local video attention.
In contrast to previous prompt tuning methods, we employ the shared latent space to generate local-level text and frame prompts.
arXiv Detail & Related papers (2024-01-19T09:58:06Z) - Prompt Highlighter: Interactive Control for Multi-Modal LLMs [50.830448437285355]
This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation.
We introduce a novel inference method, Prompt Highlighter, which enables users to highlight specific prompt spans to interactively control the focus during generation.
We find that, during inference, guiding the models with highlighted tokens through the attention weights leads to more desired outputs.
arXiv Detail & Related papers (2023-12-07T13:53:29Z) - Prompt-based Context- and Domain-aware Pretraining for Vision and
Language Navigation [19.793659852435486]
We propose a novel Prompt-bAsed coNtext- and inDoor-Aware (PANDA) pretraining framework to address these problems.
In the indoor-aware stage, we apply an efficient tuning paradigm to learn deep visual prompts from an indoor dataset.
In the context-aware stage, we design a set of hard context prompts to capture the sequence-level semantics in the instruction.
arXiv Detail & Related papers (2023-09-07T11:58:34Z) - Adapting Pre-trained Language Models to Vision-Language Tasks via
Dynamic Visual Prompting [83.21164539349273]
Pre-trained language models (PLMs) have played an increasing role in multimedia research.
In this paper, we focus on exploring PLMs as a stand-alone model for vision-language reasoning tasks.
We propose a novel transfer learning approach for PLMs, termed Dynamic Visual Prompting (DVP)
arXiv Detail & Related papers (2023-06-01T07:19:28Z) - Towards Consistent Video Editing with Text-to-Image Diffusion Models [10.340371518799444]
Existing works have advanced Text-to-Image (TTI) diffusion models for video editing in a one-shot learning manner.
These methods might produce results of unsatisfied consistency with text prompt as well as temporal sequence.
We propose a novel EI$2$ model towards textbfEnhancing vtextbfIdeo textbfEditing constextbfIstency of TTI-based frameworks.
arXiv Detail & Related papers (2023-05-27T10:03:36Z) - Self-supervised Meta-Prompt Learning with Meta-Gradient Regularization
for Few-shot Generalization [40.45470744120691]
Self-sUpervised meta-Prompt learning framework with MEta-gradient Regularization for few-shot generalization (SUPMER)
This paper proposes a novel Self-sUpervised meta-Prompt learning framework with MEta-gradient Regularization for few-shot generalization (SUPMER)
arXiv Detail & Related papers (2023-03-22T05:04:21Z) - Visually-augmented pretrained language models for NLP tasks without
images [77.74849855049523]
Existing solutions often rely on explicit images for visual knowledge augmentation.
We propose a novel textbfVisually-textbfAugmented fine-tuning approach.
Our approach can consistently improve the performance of BERT, RoBERTa, BART, and T5 at different scales.
arXiv Detail & Related papers (2022-12-15T16:13:25Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z) - Prompt-aligned Gradient for Prompt Tuning [63.346864107288766]
We present Prompt-aligned Gradient, dubbed ProGrad, to prevent prompt tuning from forgetting the general knowledge learned from vision-language models (VLMs)
ProGrad only updates the prompt whose gradient is aligned to the "general direction", which is represented as the gradient of the KL loss of the pre-defined prompt prediction.
Experiments demonstrate the stronger few-shot generalization ability of ProGrad over state-of-the-art prompt tuning methods.
arXiv Detail & Related papers (2022-05-30T06:05:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.