Related papers: DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation

DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation

URL: http://arxiv.org/abs/2505.11676v1
Date: Fri, 16 May 2025 20:25:42 GMT
Title: DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation
Authors: Ziyu Zhao, Xiaoguang Li, Linjia Shi, Nasrin Imanpour, Song Wang,
Abstract summary: Open-vocabulary semantic segmentation aims to segment images into distinct semantic regions at the pixel level.<n>Current methods utilize text embeddings from pre-trained vision-language models like CLIP.<n>We propose a dual prompting framework, DPSeg, for this task.
Score: 16.64056234334767
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Open-vocabulary semantic segmentation aims to segment images into distinct semantic regions for both seen and unseen categories at the pixel level. Current methods utilize text embeddings from pre-trained vision-language models like CLIP but struggle with the inherent domain gap between image and text embeddings, even after extensive alignment during training. Additionally, relying solely on deep text-aligned features limits shallow-level feature guidance, which is crucial for detecting small objects and fine details, ultimately reducing segmentation accuracy. To address these limitations, we propose a dual prompting framework, DPSeg, for this task. Our approach combines dual-prompt cost volume generation, a cost volume-guided decoder, and a semantic-guided prompt refinement strategy that leverages our dual prompting scheme to mitigate alignment issues in visual prompt generation. By incorporating visual embeddings from a visual prompt encoder, our approach reduces the domain gap between text and image embeddings while providing multi-level guidance through shallow features. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches on multiple public datasets.

Related papers

Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text Matching [18.554618623593004]
Existing methods typically learn a set of embeddings to find the optimal match for each view's text and compute similarity.<n>We propose Dense-to-Sparse Feature Distilled Visual Semantic Embedding (D2S-VSE), which enhances the information capacity of sparse text by leveraging dense text distillation.<n>Our proposed D2S-VSE model is extensively evaluated on the large-scale MS-COCO and Flickr30K datasets.
arXiv Detail & Related papers (2025-03-19T07:42:24Z)
The Power of One: A Single Example is All it Takes for Segmentation in VLMs [29.735863112700358]
Large-scale vision-language models (VLMs) exhibit strong multimodal understanding capabilities by implicitly learning associations between textual descriptions and image regions.<n>This emergent ability enables zero-shot object detection and segmentation, using techniques that rely on text-image attention maps.<n>We show that this approach yields strong zero-shot performance, further enhanced through fine-tuning with a single visual example.
arXiv Detail & Related papers (2025-03-13T18:18:05Z)
InvSeg: Test-Time Prompt Inversion for Semantic Segmentation [33.60580908728705]
InvSeg is a test-time prompt inversion method that tackles open-vocabulary semantic segmentation.<n>We introduce Contrastive Soft Clustering (CSC) to align derived masks with the image's structure information.<n>InvSeg learns context-rich text prompts in embedding space and achieves accurate semantic alignment across modalities.
arXiv Detail & Related papers (2024-10-15T10:20:31Z)
Multi-Grained Cross-modal Alignment for Learning Open-vocabulary Semantic Segmentation from Text Supervision [23.931443799102663]
We introduce a Multi-Grained Cross-modal Alignment (MGCA) framework to bridge the granularity gap without any dense annotations. Specifically, MGCA constructs pseudo multi-granular semantic correspondences upon image-text pairs. Our method achieves significant advancements over state-of-the-art methods, demonstrating its effectiveness and efficiency.
arXiv Detail & Related papers (2024-03-06T13:43:36Z)
Leveraging Open-Vocabulary Diffusion to Camouflaged Instance Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions. We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z)
DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition with Limited Annotations [79.433122872973]
Multi-label image recognition in the low-label regime is a task of great challenge and practical significance. We leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs. We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++)
arXiv Detail & Related papers (2023-08-03T17:33:20Z)
CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation [56.58365347854647]
We introduce a novel cost-based approach to adapt vision-language foundation models, notably CLIP. Our method potently adapts CLIP for segmenting seen and unseen classes by fine-tuning its encoders.
arXiv Detail & Related papers (2023-03-21T12:28:21Z)
ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View Semantic Consistency [126.88107868670767]
We propose multi-textbfView textbfConsistent learning (ViewCo) for text-supervised semantic segmentation. We first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image. We also propose cross-view segmentation consistency modeling to address the ambiguity issue of text supervision.
arXiv Detail & Related papers (2023-01-31T01:57:52Z)
Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding [95.78002228538841]
We propose a new open-world semantic segmentation pipeline that makes the first attempt to learn to segment semantic objects of various open-world categories without any efforts on dense annotations. Our method can directly segment objects of arbitrary categories, outperforming zero-shot segmentation methods that require data labeling on three benchmark datasets.
arXiv Detail & Related papers (2022-07-18T09:20:04Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.