Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?
- URL: http://arxiv.org/abs/2602.23339v1
- Date: Thu, 26 Feb 2026 18:45:33 GMT
- Title: Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?
- Authors: Tilemachos Aravanis, Vladan Stojnić, Bill Psomas, Nikos Komodakis, Giorgos Tolias,
- Abstract summary: Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction.<n>We introduce a few-shot setting that augments textual prompts with a support set of pixel-annotated images.<n>We propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features.
- Score: 13.02242959998332
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.
Related papers
- DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation [16.64056234334767]
Open-vocabulary semantic segmentation aims to segment images into distinct semantic regions at the pixel level.<n>Current methods utilize text embeddings from pre-trained vision-language models like CLIP.<n>We propose a dual prompting framework, DPSeg, for this task.
arXiv Detail & Related papers (2025-05-16T20:25:42Z) - The Power of One: A Single Example is All it Takes for Segmentation in VLMs [29.735863112700358]
Large-scale vision-language models (VLMs) exhibit strong multimodal understanding capabilities by implicitly learning associations between textual descriptions and image regions.<n>This emergent ability enables zero-shot object detection and segmentation, using techniques that rely on text-image attention maps.<n>We show that this approach yields strong zero-shot performance, further enhanced through fine-tuning with a single visual example.
arXiv Detail & Related papers (2025-03-13T18:18:05Z) - ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)<n>We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z) - Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation [56.001484215308075]
We present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP.<n>We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process.<n> Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks.
arXiv Detail & Related papers (2024-11-28T19:00:03Z) - InvSeg: Test-Time Prompt Inversion for Semantic Segmentation [33.60580908728705]
InvSeg is a test-time prompt inversion method that tackles open-vocabulary semantic segmentation.<n>We introduce Contrastive Soft Clustering (CSC) to align derived masks with the image's structure information.<n>InvSeg learns context-rich text prompts in embedding space and achieves accurate semantic alignment across modalities.
arXiv Detail & Related papers (2024-10-15T10:20:31Z) - Multi-Grained Cross-modal Alignment for Learning Open-vocabulary
Semantic Segmentation from Text Supervision [23.931443799102663]
We introduce a Multi-Grained Cross-modal Alignment (MGCA) framework to bridge the granularity gap without any dense annotations.
Specifically, MGCA constructs pseudo multi-granular semantic correspondences upon image-text pairs.
Our method achieves significant advancements over state-of-the-art methods, demonstrating its effectiveness and efficiency.
arXiv Detail & Related papers (2024-03-06T13:43:36Z) - Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision [87.15580604023555]
Unpair-Seg is a novel weakly-supervised open-vocabulary segmentation framework.
It learns from unpaired image-mask and image-text pairs, which can be independently and efficiently collected.
It achieves 14.6% and 19.5% mIoU on the ADE-847 and PASCAL Context-459 datasets.
arXiv Detail & Related papers (2024-02-14T06:01:44Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - Rewrite Caption Semantics: Bridging Semantic Gaps for
Language-Supervised Semantic Segmentation [100.81837601210597]
We propose Concept Curation (CoCu) to bridge the gap between visual and textual semantics in pre-training data.
CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin.
arXiv Detail & Related papers (2023-09-24T00:05:39Z) - Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation.
It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.