OpenDAS: Domain Adaptation for Open-Vocabulary Segmentation
- URL: http://arxiv.org/abs/2405.20141v1
- Date: Thu, 30 May 2024 15:16:06 GMT
- Title: OpenDAS: Domain Adaptation for Open-Vocabulary Segmentation
- Authors: Gonca Yilmaz, Songyou Peng, Francis Engelmann, Marc Pollefeys, Hermann Blum,
- Abstract summary: We introduce a new task domain adaptation for open-vocabulary segmentation.
We propose an approach that combines parameter-efficient prompt tuning with a triplet-loss-based training strategy.
Our results outperform other parameter-efficient adaptation strategies in open-vocabulary segment classification tasks.
- Score: 54.98688607911399
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The advent of Vision Language Models (VLMs) transformed image understanding from closed-set classifications to dynamic image-language interactions, enabling open-vocabulary segmentation. Despite this flexibility, VLMs often fall behind closed-set classifiers in accuracy due to their reliance on ambiguous image captions and lack of domain-specific knowledge. We, therefore, introduce a new task domain adaptation for open-vocabulary segmentation, enhancing VLMs with domain-specific priors while preserving their open-vocabulary nature. Existing adaptation methods, when applied to segmentation tasks, improve performance on training queries but can reduce VLM performance on zero-shot text inputs. To address this shortcoming, we propose an approach that combines parameter-efficient prompt tuning with a triplet-loss-based training strategy. This strategy is designed to enhance open-vocabulary generalization while adapting to the visual domain. Our results outperform other parameter-efficient adaptation strategies in open-vocabulary segment classification tasks across indoor and outdoor datasets. Notably, our approach is the only one that consistently surpasses the original VLM on zero-shot queries. Our adapted VLMs can be plug-and-play integrated into existing open-vocabulary segmentation pipelines, improving OV-Seg by +6.0% mIoU on ADE20K, and OpenMask3D by +4.1% AP on ScanNet++ Offices without any changes to the methods.
Related papers
- Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation [44.008094698200026]
FreeDA is a training-free diffusion-augmented method for open-vocabulary semantic segmentation.
FreeDA achieves state-of-the-art performance on five datasets.
arXiv Detail & Related papers (2024-04-09T18:00:25Z) - Bridge the Modality and Capacity Gaps in Vision-Language Model Selection [60.049430086731846]
Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names.
A promising zero-shot image classification strategy is selecting the most appropriate Pre-Trained VLM from the VLM Zoo.
We analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection.
We propose VLM Selection With gAp Bridging (SWAB) to mitigate the negative impact of these two gaps.
arXiv Detail & Related papers (2024-03-20T17:54:58Z) - OpenSD: Unified Open-Vocabulary Segmentation and Detection [24.08879095731279]
We present a universal transformer-based framework, abbreviated as OpenSD, to handle open-vocabulary segmentation and detection tasks.
To better leverage CLIP for end-to-end segmentation and detection, we propose dual classifiers to handle the in-vocabulary domain and out-of-vocabulary domain.
The results demonstrate that OpenSD outperforms state-of-the-art open-vocabulary segmentation and detection methods in both closed- and open-vocabulary settings.
arXiv Detail & Related papers (2023-12-10T08:51:34Z) - Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models [44.146292819267956]
Large-scale vision-language models (VLMs) learn to implicitly associate image regions with words, which prove effective for tasks like visual question.
In this paper, we propose a simple, yet extremely effective, training-free technique, Plug-and-Play-Vocabulary Semantic (OVSS) for this task.
arXiv Detail & Related papers (2023-11-28T06:42:58Z) - Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS)
We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes.
By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z) - TAP: Targeted Prompting for Task Adaptive Generation of Textual Training
Instances for Visual Classification [28.72126911321771]
Vision and Language Models (VLMs) have enabled visual recognition of a potentially unlimited set of categories described by text prompts.
For the best visual recognition performance, these models still require tuning to better fit the data distributions of the downstream tasks.
arXiv Detail & Related papers (2023-09-13T08:59:54Z) - Language Models as Black-Box Optimizers for Vision-Language Models [62.80817942316398]
Vision-language models (VLMs) pre-trained on web-scale datasets have demonstrated remarkable capabilities on downstream tasks when fine-tuned with minimal data.
We aim to develop a black-box approach to optimize VLMs through natural language prompts.
arXiv Detail & Related papers (2023-09-12T04:03:41Z) - Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment.
Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.