OpenDAS: Open-Vocabulary Domain Adaptation for 2D and 3D Segmentation
- URL: http://arxiv.org/abs/2405.20141v4
- Date: Tue, 29 Oct 2024 23:03:34 GMT
- Title: OpenDAS: Open-Vocabulary Domain Adaptation for 2D and 3D Segmentation
- Authors: Gonca Yilmaz, Songyou Peng, Marc Pollefeys, Francis Engelmann, Hermann Blum,
- Abstract summary: We propose the task of open-vocabulary domain adaptation to infuse domain-specific knowledge into Vision-Language Models (VLMs)
Existing VLM adaptation methods improve performance on base (training) queries, but fail to preserve the open-set capabilities of VLMs on novel queries.
Our approach is the only parameter-efficient method that consistently surpasses the original VLM on novel classes.
- Score: 54.98688607911399
- License:
- Abstract: Recently, Vision-Language Models (VLMs) have advanced segmentation techniques by shifting from the traditional segmentation of a closed-set of predefined object classes to open-vocabulary segmentation (OVS), allowing users to segment novel classes and concepts unseen during training of the segmentation model. However, this flexibility comes with a trade-off: fully-supervised closed-set methods still outperform OVS methods on base classes, that is on classes on which they have been explicitly trained. This is due to the lack of pixel-aligned training masks for VLMs (which are trained on image-caption pairs), and the absence of domain-specific knowledge, such as autonomous driving. Therefore, we propose the task of open-vocabulary domain adaptation to infuse domain-specific knowledge into VLMs while preserving their open-vocabulary nature. By doing so, we achieve improved performance in base and novel classes. Existing VLM adaptation methods improve performance on base (training) queries, but fail to fully preserve the open-set capabilities of VLMs on novel queries. To address this shortcoming, we combine parameter-efficient prompt tuning with a triplet-loss-based training strategy that uses auxiliary negative queries. Notably, our approach is the only parameter-efficient method that consistently surpasses the original VLM on novel classes. Our adapted VLMs can seamlessly be integrated into existing OVS pipelines, e.g., improving OVSeg by +6.0% mIoU on ADE20K for open-vocabulary 2D segmentation, and OpenMask3D by +4.1% AP on ScanNet++ Offices for open-vocabulary 3D instance segmentation without other changes. The project page is available at https://open-das.github.io/.
Related papers
- VLMs meet UDA: Boosting Transferability of Open Vocabulary Segmentation with Unsupervised Domain Adaptation [3.776249047528669]
This paper proposes enhancing segmentation accuracy across diverse domains by integrating Vision-Language reasoning with key strategies for Unsupervised Domain Adaptation (UDA)
We improve the fine-grained segmentation capabilities of VLMs through multi-scale contextual data, robust text embeddings with prompt augmentation, and layer-wise fine-tuning in our proposed Foundational-Retaining Open Vocabulary (FROVSS) framework.
The resulting UDA-FROV framework is the first UDA approach to effectively adapt across domains without requiring shared categories.
arXiv Detail & Related papers (2024-12-12T12:49:42Z) - DenseVLM: A Retrieval and Decoupled Alignment Framework for Open-Vocabulary Dense Prediction [80.67150791183126]
We propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations.
We show that DenseVLM can be seamlessly integrated into open-vocabulary object detection and image segmentation tasks, leading to notable performance improvements.
arXiv Detail & Related papers (2024-12-09T06:34:23Z) - Overcoming Domain Limitations in Open-vocabulary Segmentation [24.169403141373927]
Open-vocabulary segmentation (OVS) has gained attention for its ability to recognize a broader range of classes.
OVS models show significant performance drops when applied to unseen domains beyond the previous training dataset.
We propose a method that allows OVS models to learn information from new domains while preserving prior knowledge.
arXiv Detail & Related papers (2024-10-15T12:11:41Z) - Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation [42.020470627552136]
Open-vocabulary segmentation is primarily bottlenecked by mask classification, not mask generation.
We propose a novel Fine-grained Semantic Adaptation (FISA) method to address this limitation.
FISA enhances the extracted visual features with fine-grained semantic awareness by explicitly integrating this crucial semantic information early in the visual encoding process.
arXiv Detail & Related papers (2024-09-24T17:50:28Z) - CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor [18.288738950822342]
Mask labels are labor-intensive, which limits the number of categories in segmentation datasets.
We introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts.
Experiments show that our method outperforms not only the training-free counterparts, but also those fine-tuned with millions of data samples.
arXiv Detail & Related papers (2023-12-12T19:00:04Z) - DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection [72.25697820290502]
This work introduces a straightforward and efficient strategy to identify potential novel classes through zero-shot classification.
We refer to this approach as the self-training strategy, which enhances recall and accuracy for novel classes without requiring extra annotations, datasets, and re-training.
Empirical evaluations on three datasets, including LVIS, V3Det, and COCO, demonstrate significant improvements over the baseline performance.
arXiv Detail & Related papers (2023-10-02T17:52:24Z) - OpenMask3D: Open-Vocabulary 3D Instance Segmentation [84.58747201179654]
OpenMask3D is a zero-shot approach for open-vocabulary 3D instance segmentation.
Our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings.
arXiv Detail & Related papers (2023-06-23T17:36:44Z) - Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation.
It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.