Show or Tell? Effectively prompting Vision-Language Models for semantic segmentation
- URL: http://arxiv.org/abs/2503.19647v1
- Date: Tue, 25 Mar 2025 13:36:59 GMT
- Title: Show or Tell? Effectively prompting Vision-Language Models for semantic segmentation
- Authors: Niccolo Avogaro, Thomas Frick, Mattia Rigotti, Andrea Bartezzaghi, Filip Janicki, Cristiano Malossi, Konrad Schindler, Roy Assaf,
- Abstract summary: Large Vision-Language Models can be instructed to solve diverse tasks by prompting, without task-specific training.<n>We evaluate the segmentation performance of several recent models guided by either text or visual prompts.<n>We propose PromptMatcher, a training-free baseline that combines both text and visual prompts.
- Score: 22.057386630831402
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Vision-Language Models (VLMs) are increasingly being regarded as foundation models that can be instructed to solve diverse tasks by prompting, without task-specific training. We examine the seemingly obvious question: how to effectively prompt VLMs for semantic segmentation. To that end, we systematically evaluate the segmentation performance of several recent models guided by either text or visual prompts on the out-of-distribution MESS dataset collection. We introduce a scalable prompting scheme, few-shot prompted semantic segmentation, inspired by open-vocabulary segmentation and few-shot learning. It turns out that VLMs lag far behind specialist models trained for a specific segmentation task, by about 30% on average on the Intersection-over-Union metric. Moreover, we find that text prompts and visual prompts are complementary: each one of the two modes fails on many examples that the other one can solve. Our analysis suggests that being able to anticipate the most effective prompt modality can lead to a 11% improvement in performance. Motivated by our findings, we propose PromptMatcher, a remarkably simple training-free baseline that combines both text and visual prompts, achieving state-of-the-art results outperforming the best text-prompted VLM by 2.5%, and the top visual-prompted VLM by 3.5% on few-shot prompted semantic segmentation.
Related papers
- Progressive Language-guided Visual Learning for Multi-Task Visual Grounding [21.297317604403652]
We propose a Progressive Language-guided Visual Learning framework for multi-task visual grounding.
In this paper, we propose a Progressive Language-guided Visual Learning framework for multi-task visual grounding.
arXiv Detail & Related papers (2025-04-22T12:48:12Z) - The Power of One: A Single Example is All it Takes for Segmentation in VLMs [29.735863112700358]
Large-scale vision-language models (VLMs) exhibit strong multimodal understanding capabilities by implicitly learning associations between textual descriptions and image regions.<n>This emergent ability enables zero-shot object detection and segmentation, using techniques that rely on text-image attention maps.<n>We show that this approach yields strong zero-shot performance, further enhanced through fine-tuning with a single visual example.
arXiv Detail & Related papers (2025-03-13T18:18:05Z) - DRUM: Learning Demonstration Retriever for Large MUlti-modal Models [10.884258583493175]
We propose a novel framework, underlinedemonstration underlineretriever for large munderlineulti-modal underlinemodel (DRUM)<n>First, we discuss the retrieval strategies for a visual-language task, assuming an embedding model is given. And we propose to concate the image and text embeddings to enhance the retrieval performance.<n>Second, we propose to re-rank the demonstrations retrieved by the embedding model via the LVLM's feedbacks, and calculate a list-wise ranking loss for training
arXiv Detail & Related papers (2024-12-10T15:56:12Z) - Exploring the Transferability of Visual Prompting for Multimodal Large Language Models [47.162575147632396]
Transferable Visual Prompting (TVP) is a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model.
We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts.
arXiv Detail & Related papers (2024-04-17T09:39:07Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Exploring Effective Factors for Improving Visual In-Context Learning [56.14208975380607]
In-Context Learning (ICL) is to understand a new task via a few demonstrations (aka. prompt) and predict new inputs without tuning the models.
This paper shows that prompt selection and prompt fusion are two major factors that have a direct impact on the inference performance of visual context learning.
We propose a simple framework prompt-SelF for visual in-context learning.
arXiv Detail & Related papers (2023-04-10T17:59:04Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.