Related papers: OpenDAS: Domain Adaptation for Open-Vocabulary Segmentation

OpenDAS: Domain Adaptation for Open-Vocabulary Segmentation

URL: http://arxiv.org/abs/2405.20141v1
Date: Thu, 30 May 2024 15:16:06 GMT
Title: OpenDAS: Domain Adaptation for Open-Vocabulary Segmentation
Authors: Gonca Yilmaz, Songyou Peng, Francis Engelmann, Marc Pollefeys, Hermann Blum,
Abstract summary: We introduce a new task domain adaptation for open-vocabulary segmentation. We propose an approach that combines parameter-efficient prompt tuning with a triplet-loss-based training strategy. Our results outperform other parameter-efficient adaptation strategies in open-vocabulary segment classification tasks.
Score: 54.98688607911399
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The advent of Vision Language Models (VLMs) transformed image understanding from closed-set classifications to dynamic image-language interactions, enabling open-vocabulary segmentation. Despite this flexibility, VLMs often fall behind closed-set classifiers in accuracy due to their reliance on ambiguous image captions and lack of domain-specific knowledge. We, therefore, introduce a new task domain adaptation for open-vocabulary segmentation, enhancing VLMs with domain-specific priors while preserving their open-vocabulary nature. Existing adaptation methods, when applied to segmentation tasks, improve performance on training queries but can reduce VLM performance on zero-shot text inputs. To address this shortcoming, we propose an approach that combines parameter-efficient prompt tuning with a triplet-loss-based training strategy. This strategy is designed to enhance open-vocabulary generalization while adapting to the visual domain. Our results outperform other parameter-efficient adaptation strategies in open-vocabulary segment classification tasks across indoor and outdoor datasets. Notably, our approach is the only one that consistently surpasses the original VLM on zero-shot queries. Our adapted VLMs can be plug-and-play integrated into existing open-vocabulary segmentation pipelines, improving OV-Seg by +6.0% mIoU on ADE20K, and OpenMask3D by +4.1% AP on ScanNet++ Offices without any changes to the methods.

Related papers

OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models [28.56745509698125]
We propose OpenAVS, a training-free language-based approach to align audio and visual modalities using text as a proxy for open-vocabulary Audio-Visual (AVS)<n>OpenAVS infers masks through 1) audio-to-text prompt generation, 2) LLM-guided prompt translation, and 3) text-to-visual sounding object segmentation.<n>It surpasses existing unsupervised, zero-shot, and few-shot AVS methods by a significant margin, achieving absolute performance gains of approximately 9.4% and 10.9% in mIoU and F-score, respectively.
arXiv Detail & Related papers (2025-04-30T01:52:10Z)
LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation [16.021683473678515]
We propose a training-free method for semantic segmentation using Vision-and-Language Models (VLMs) Our approach enhances the initial per-patch predictions of VLMs through label propagation. Our method, called LPOSS+, performs inference over the entire image, avoiding window-based processing.
arXiv Detail & Related papers (2025-03-25T15:47:13Z)
VLMs meet UDA: Boosting Transferability of Open Vocabulary Segmentation with Unsupervised Domain Adaptation [3.776249047528669]
This paper proposes enhancing segmentation accuracy across diverse domains by integrating Vision-Language reasoning with key strategies for Unsupervised Domain Adaptation (UDA) We improve the fine-grained segmentation capabilities of VLMs through multi-scale contextual data, robust text embeddings with prompt augmentation, and layer-wise fine-tuning in our proposed Foundational-Retaining Open Vocabulary (FROVSS) framework. The resulting UDA-FROV framework is the first UDA approach to effectively adapt across domains without requiring shared categories.
arXiv Detail & Related papers (2024-12-12T12:49:42Z)
Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction [80.67150791183126]
Pre-trained vision-language models (VLMs) have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks. We propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. We show that DenseVLM can directly replace the original VLM in open-vocabulary object detection and image segmentation methods.
arXiv Detail & Related papers (2024-12-09T06:34:23Z)
Overcoming Domain Limitations in Open-vocabulary Segmentation [24.169403141373927]
Open-vocabulary segmentation (OVS) has gained attention for its ability to recognize a broader range of classes. OVS models show significant performance drops when applied to unseen domains beyond the previous training dataset. We propose a method that allows OVS models to learn information from new domains while preserving prior knowledge.
arXiv Detail & Related papers (2024-10-15T12:11:41Z)
Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation [42.020470627552136]
Open-vocabulary segmentation is primarily bottlenecked by mask classification, not mask generation. We propose a novel Fine-grained Semantic Adaptation (FISA) method to address this limitation. FISA enhances the extracted visual features with fine-grained semantic awareness by explicitly integrating this crucial semantic information early in the visual encoding process.
arXiv Detail & Related papers (2024-09-24T17:50:28Z)
CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor [18.288738950822342]
Mask labels are labor-intensive, which limits the number of categories in segmentation datasets. We introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts. Experiments show that our method outperforms not only the training-free counterparts, but also those fine-tuned with millions of data samples.
arXiv Detail & Related papers (2023-12-12T19:00:04Z)
Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models [44.146292819267956]
Large-scale vision-language models (VLMs) learn to implicitly associate image regions with words, which prove effective for tasks like visual question. In this paper, we propose a simple, yet extremely effective, training-free technique, Plug-and-Play-Vocabulary Semantic (OVSS) for this task.
arXiv Detail & Related papers (2023-11-28T06:42:58Z)
DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection [72.25697820290502]
This work introduces a straightforward and efficient strategy to identify potential novel classes through zero-shot classification. We refer to this approach as the self-training strategy, which enhances recall and accuracy for novel classes without requiring extra annotations, datasets, and re-training. Empirical evaluations on three datasets, including LVIS, V3Det, and COCO, demonstrate significant improvements over the baseline performance.
arXiv Detail & Related papers (2023-10-02T17:52:24Z)
Panoptic Vision-Language Feature Fields [27.209602602110916]
We propose the first algorithm for open-vocabulary panoptic segmentation in 3D scenes. Our algorithm learns a semantic feature field of the scene by distilling vision-language features from a pretrained 2D model. Our method achieves panoptic segmentation performance similar to the state-of-the-art closed-set 3D systems on the HyperSim, ScanNet and Replica dataset.
arXiv Detail & Related papers (2023-09-11T13:41:27Z)
Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification. In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary. We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z)
OpenMask3D: Open-Vocabulary 3D Instance Segmentation [84.58747201179654]
OpenMask3D is a zero-shot approach for open-vocabulary 3D instance segmentation. Our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings.
arXiv Detail & Related papers (2023-06-23T17:36:44Z)
Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation. It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z)
Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD) During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task. Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.