Related papers: OpenDAS: Open-Vocabulary Domain Adaptation for 2D and 3D Segmentation

OpenDAS: Open-Vocabulary Domain Adaptation for 2D and 3D Segmentation

URL: http://arxiv.org/abs/2405.20141v4
Date: Tue, 29 Oct 2024 23:03:34 GMT
Title: OpenDAS: Open-Vocabulary Domain Adaptation for 2D and 3D Segmentation
Authors: Gonca Yilmaz, Songyou Peng, Marc Pollefeys, Francis Engelmann, Hermann Blum,
Abstract summary: We propose the task of open-vocabulary domain adaptation to infuse domain-specific knowledge into Vision-Language Models (VLMs) Existing VLM adaptation methods improve performance on base (training) queries, but fail to preserve the open-set capabilities of VLMs on novel queries. Our approach is the only parameter-efficient method that consistently surpasses the original VLM on novel classes.
Score: 54.98688607911399
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, Vision-Language Models (VLMs) have advanced segmentation techniques by shifting from the traditional segmentation of a closed-set of predefined object classes to open-vocabulary segmentation (OVS), allowing users to segment novel classes and concepts unseen during training of the segmentation model. However, this flexibility comes with a trade-off: fully-supervised closed-set methods still outperform OVS methods on base classes, that is on classes on which they have been explicitly trained. This is due to the lack of pixel-aligned training masks for VLMs (which are trained on image-caption pairs), and the absence of domain-specific knowledge, such as autonomous driving. Therefore, we propose the task of open-vocabulary domain adaptation to infuse domain-specific knowledge into VLMs while preserving their open-vocabulary nature. By doing so, we achieve improved performance in base and novel classes. Existing VLM adaptation methods improve performance on base (training) queries, but fail to fully preserve the open-set capabilities of VLMs on novel queries. To address this shortcoming, we combine parameter-efficient prompt tuning with a triplet-loss-based training strategy that uses auxiliary negative queries. Notably, our approach is the only parameter-efficient method that consistently surpasses the original VLM on novel classes. Our adapted VLMs can seamlessly be integrated into existing OVS pipelines, e.g., improving OVSeg by +6.0% mIoU on ADE20K for open-vocabulary 2D segmentation, and OpenMask3D by +4.1% AP on ScanNet++ Offices for open-vocabulary 3D instance segmentation without other changes. The project page is available at https://open-das.github.io/.

Related papers

LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation [16.021683473678515]
We propose a training-free method for semantic segmentation using Vision-and-Language Models (VLMs) Our approach enhances the initial per-patch predictions of VLMs through label propagation. Our method, called LPOSS+, performs inference over the entire image, avoiding window-based processing.
arXiv Detail & Related papers (2025-03-25T15:47:13Z)
VLMs meet UDA: Boosting Transferability of Open Vocabulary Segmentation with Unsupervised Domain Adaptation [3.776249047528669]
This paper proposes enhancing segmentation accuracy across diverse domains by integrating Vision-Language reasoning with key strategies for Unsupervised Domain Adaptation (UDA) We improve the fine-grained segmentation capabilities of VLMs through multi-scale contextual data, robust text embeddings with prompt augmentation, and layer-wise fine-tuning in our proposed Foundational-Retaining Open Vocabulary (FROVSS) framework. The resulting UDA-FROV framework is the first UDA approach to effectively adapt across domains without requiring shared categories.
arXiv Detail & Related papers (2024-12-12T12:49:42Z)
Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction [80.67150791183126]
Pre-trained vision-language models (VLMs) have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks. We propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. We show that DenseVLM can directly replace the original VLM in open-vocabulary object detection and image segmentation methods.
arXiv Detail & Related papers (2024-12-09T06:34:23Z)
Overcoming Domain Limitations in Open-vocabulary Segmentation [24.169403141373927]
Open-vocabulary segmentation (OVS) has gained attention for its ability to recognize a broader range of classes. OVS models show significant performance drops when applied to unseen domains beyond the previous training dataset. We propose a method that allows OVS models to learn information from new domains while preserving prior knowledge.
arXiv Detail & Related papers (2024-10-15T12:11:41Z)
Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation [42.020470627552136]
Open-vocabulary segmentation is primarily bottlenecked by mask classification, not mask generation. We propose a novel Fine-grained Semantic Adaptation (FISA) method to address this limitation. FISA enhances the extracted visual features with fine-grained semantic awareness by explicitly integrating this crucial semantic information early in the visual encoding process.
arXiv Detail & Related papers (2024-09-24T17:50:28Z)
CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor [18.288738950822342]
Mask labels are labor-intensive, which limits the number of categories in segmentation datasets. We introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts. Experiments show that our method outperforms not only the training-free counterparts, but also those fine-tuned with millions of data samples.
arXiv Detail & Related papers (2023-12-12T19:00:04Z)
Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models [44.146292819267956]
Large-scale vision-language models (VLMs) learn to implicitly associate image regions with words, which prove effective for tasks like visual question. In this paper, we propose a simple, yet extremely effective, training-free technique, Plug-and-Play-Vocabulary Semantic (OVSS) for this task.
arXiv Detail & Related papers (2023-11-28T06:42:58Z)
DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection [72.25697820290502]
This work introduces a straightforward and efficient strategy to identify potential novel classes through zero-shot classification. We refer to this approach as the self-training strategy, which enhances recall and accuracy for novel classes without requiring extra annotations, datasets, and re-training. Empirical evaluations on three datasets, including LVIS, V3Det, and COCO, demonstrate significant improvements over the baseline performance.
arXiv Detail & Related papers (2023-10-02T17:52:24Z)
Panoptic Vision-Language Feature Fields [27.209602602110916]
We propose the first algorithm for open-vocabulary panoptic segmentation in 3D scenes. Our algorithm learns a semantic feature field of the scene by distilling vision-language features from a pretrained 2D model. Our method achieves panoptic segmentation performance similar to the state-of-the-art closed-set 3D systems on the HyperSim, ScanNet and Replica dataset.
arXiv Detail & Related papers (2023-09-11T13:41:27Z)
Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification. In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary. We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z)
OpenMask3D: Open-Vocabulary 3D Instance Segmentation [84.58747201179654]
OpenMask3D is a zero-shot approach for open-vocabulary 3D instance segmentation. Our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings.
arXiv Detail & Related papers (2023-06-23T17:36:44Z)
Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation. It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z)
Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD) During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task. Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.