OpenDAS: Domain Adaptation for Open-Vocabulary Segmentation
- URL: http://arxiv.org/abs/2405.20141v1
- Date: Thu, 30 May 2024 15:16:06 GMT
- Title: OpenDAS: Domain Adaptation for Open-Vocabulary Segmentation
- Authors: Gonca Yilmaz, Songyou Peng, Francis Engelmann, Marc Pollefeys, Hermann Blum,
- Abstract summary: We introduce a new task domain adaptation for open-vocabulary segmentation.
We propose an approach that combines parameter-efficient prompt tuning with a triplet-loss-based training strategy.
Our results outperform other parameter-efficient adaptation strategies in open-vocabulary segment classification tasks.
- Score: 54.98688607911399
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The advent of Vision Language Models (VLMs) transformed image understanding from closed-set classifications to dynamic image-language interactions, enabling open-vocabulary segmentation. Despite this flexibility, VLMs often fall behind closed-set classifiers in accuracy due to their reliance on ambiguous image captions and lack of domain-specific knowledge. We, therefore, introduce a new task domain adaptation for open-vocabulary segmentation, enhancing VLMs with domain-specific priors while preserving their open-vocabulary nature. Existing adaptation methods, when applied to segmentation tasks, improve performance on training queries but can reduce VLM performance on zero-shot text inputs. To address this shortcoming, we propose an approach that combines parameter-efficient prompt tuning with a triplet-loss-based training strategy. This strategy is designed to enhance open-vocabulary generalization while adapting to the visual domain. Our results outperform other parameter-efficient adaptation strategies in open-vocabulary segment classification tasks across indoor and outdoor datasets. Notably, our approach is the only one that consistently surpasses the original VLM on zero-shot queries. Our adapted VLMs can be plug-and-play integrated into existing open-vocabulary segmentation pipelines, improving OV-Seg by +6.0% mIoU on ADE20K, and OpenMask3D by +4.1% AP on ScanNet++ Offices without any changes to the methods.
Related papers
- OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models [28.56745509698125]
We propose OpenAVS, a training-free language-based approach to align audio and visual modalities using text as a proxy for open-vocabulary Audio-Visual (AVS)<n>OpenAVS infers masks through 1) audio-to-text prompt generation, 2) LLM-guided prompt translation, and 3) text-to-visual sounding object segmentation.<n>It surpasses existing unsupervised, zero-shot, and few-shot AVS methods by a significant margin, achieving absolute performance gains of approximately 9.4% and 10.9% in mIoU and F-score, respectively.
arXiv Detail & Related papers (2025-04-30T01:52:10Z) - LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation [16.021683473678515]
We propose a training-free method for semantic segmentation using Vision-and-Language Models (VLMs)
Our approach enhances the initial per-patch predictions of VLMs through label propagation.
Our method, called LPOSS+, performs inference over the entire image, avoiding window-based processing.
arXiv Detail & Related papers (2025-03-25T15:47:13Z) - VLMs meet UDA: Boosting Transferability of Open Vocabulary Segmentation with Unsupervised Domain Adaptation [3.776249047528669]
This paper proposes enhancing segmentation accuracy across diverse domains by integrating Vision-Language reasoning with key strategies for Unsupervised Domain Adaptation (UDA)
We improve the fine-grained segmentation capabilities of VLMs through multi-scale contextual data, robust text embeddings with prompt augmentation, and layer-wise fine-tuning in our proposed Foundational-Retaining Open Vocabulary (FROVSS) framework.
The resulting UDA-FROV framework is the first UDA approach to effectively adapt across domains without requiring shared categories.
arXiv Detail & Related papers (2024-12-12T12:49:42Z) - Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction [80.67150791183126]
Pre-trained vision-language models (VLMs) have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks.
We propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations.
We show that DenseVLM can directly replace the original VLM in open-vocabulary object detection and image segmentation methods.
arXiv Detail & Related papers (2024-12-09T06:34:23Z) - Overcoming Domain Limitations in Open-vocabulary Segmentation [24.169403141373927]
Open-vocabulary segmentation (OVS) has gained attention for its ability to recognize a broader range of classes.
OVS models show significant performance drops when applied to unseen domains beyond the previous training dataset.
We propose a method that allows OVS models to learn information from new domains while preserving prior knowledge.
arXiv Detail & Related papers (2024-10-15T12:11:41Z) - Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation [42.020470627552136]
Open-vocabulary segmentation is primarily bottlenecked by mask classification, not mask generation.
We propose a novel Fine-grained Semantic Adaptation (FISA) method to address this limitation.
FISA enhances the extracted visual features with fine-grained semantic awareness by explicitly integrating this crucial semantic information early in the visual encoding process.
arXiv Detail & Related papers (2024-09-24T17:50:28Z) - CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor [18.288738950822342]
Mask labels are labor-intensive, which limits the number of categories in segmentation datasets.
We introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts.
Experiments show that our method outperforms not only the training-free counterparts, but also those fine-tuned with millions of data samples.
arXiv Detail & Related papers (2023-12-12T19:00:04Z) - Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models [44.146292819267956]
Large-scale vision-language models (VLMs) learn to implicitly associate image regions with words, which prove effective for tasks like visual question.
In this paper, we propose a simple, yet extremely effective, training-free technique, Plug-and-Play-Vocabulary Semantic (OVSS) for this task.
arXiv Detail & Related papers (2023-11-28T06:42:58Z) - DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection [72.25697820290502]
This work introduces a straightforward and efficient strategy to identify potential novel classes through zero-shot classification.
We refer to this approach as the self-training strategy, which enhances recall and accuracy for novel classes without requiring extra annotations, datasets, and re-training.
Empirical evaluations on three datasets, including LVIS, V3Det, and COCO, demonstrate significant improvements over the baseline performance.
arXiv Detail & Related papers (2023-10-02T17:52:24Z) - Panoptic Vision-Language Feature Fields [27.209602602110916]
We propose the first algorithm for open-vocabulary panoptic segmentation in 3D scenes.
Our algorithm learns a semantic feature field of the scene by distilling vision-language features from a pretrained 2D model.
Our method achieves panoptic segmentation performance similar to the state-of-the-art closed-set 3D systems on the HyperSim, ScanNet and Replica dataset.
arXiv Detail & Related papers (2023-09-11T13:41:27Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - OpenMask3D: Open-Vocabulary 3D Instance Segmentation [84.58747201179654]
OpenMask3D is a zero-shot approach for open-vocabulary 3D instance segmentation.
Our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings.
arXiv Detail & Related papers (2023-06-23T17:36:44Z) - Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation.
It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.