On the Domain Robustness of Contrastive Vision-Language Models
- URL: http://arxiv.org/abs/2506.23663v1
- Date: Mon, 30 Jun 2025 09:39:33 GMT
- Title: On the Domain Robustness of Contrastive Vision-Language Models
- Authors: Mario Koddenbrock, Rudolf Hoffmann, David Brodmann, Erik Rodner,
- Abstract summary: Deepbench is a framework designed to assess domain-specific robustness of vision-language models.<n>We evaluate a range of contrastive vision-language architectures and architectural variants across six real-world domains.
- Score: 2.169562514302842
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In real-world vision-language applications, practitioners increasingly rely on large, pretrained foundation models rather than custom-built solutions, despite limited transparency regarding their training data and processes. While these models achieve impressive performance on general benchmarks, their effectiveness can decline notably under specialized domain shifts, such as unique imaging conditions or environmental variations. In this work, we introduce Deepbench, a framework designed to assess domain-specific robustness of vision-language models (VLMs). Deepbench leverages a large language model (LLM) to generate realistic, context-aware image corruptions tailored to specific deployment domains without requiring labeled data. We evaluate a range of contrastive vision-language architectures and architectural variants across six real-world domains and observe substantial variability in robustness, highlighting the need for targeted, domain-aware evaluation. Deepbench is released as open-source software to support further research into domain-aware robustness assessment.
Related papers
- Towards Depth Foundation Model: Recent Trends in Vision-Based Depth Estimation [75.30238170051291]
Depth estimation is a fundamental task in 3D computer vision, crucial for applications such as 3D reconstruction, free-viewpoint rendering, robotics, autonomous driving, and AR/VR technologies.<n>Traditional methods relying on hardware sensors like LiDAR are often limited by high costs, low resolution, and environmental sensitivity, limiting their applicability in real-world scenarios.<n>Recent advances in vision-based methods offer a promising alternative, yet they face challenges in generalization and stability due to either the low-capacity model architectures or the reliance on domain-specific and small-scale datasets.
arXiv Detail & Related papers (2025-07-15T17:59:59Z) - ForensicHub: A Unified Benchmark & Codebase for All-Domain Fake Image Detection and Localization [48.147576833781386]
ForensicHub is the first unified benchmark for all-domain fake image detection and localization.<n>It decomposes forensic pipelines into interchangeable components across datasets, transforms, models, and evaluators.<n>It offers 8 key actionable insights into FIDL model architecture, dataset characteristics, and evaluation standards.
arXiv Detail & Related papers (2025-05-16T08:49:59Z) - Feature Based Methods in Domain Adaptation for Object Detection: A Review Paper [0.6437284704257459]
Domain adaptation aims to enhance the performance of machine learning models when deployed in target domains with distinct data distributions.<n>This review delves into advanced methodologies for domain adaptation, including adversarial learning, discrepancy-based, multi-domain, teacher-student, ensemble, and Vision Language Models.<n>Special attention is given to strategies that minimize the reliance on extensive labeled data, particularly in scenarios involving synthetic-to-real domain shifts.
arXiv Detail & Related papers (2024-12-23T06:34:23Z) - LARE: Latent Augmentation using Regional Embedding with Vision-Language Model [2.0971479389679337]
Vision-language models embed images as a single point in a unified embedding space.
Regional Embedding (LARE) embeds the image as a region in the unified embedding space learned by the VLM.
LARE achieves robust image classification for domains in and out using augmented image embeddings to fine-tuneVLMs.
arXiv Detail & Related papers (2024-09-19T09:21:42Z) - Pushing the Limits of Vision-Language Models in Remote Sensing without Human Annotations [5.065947993017157]
This study introduces an approach to curate vision-language datasets by employing an image decoding machine learning model.
We amassed approximately 9.6 million vision-language paired datasets in VHR imagery.
The resultant model outperformed counterparts that did not leverage publicly available vision-language datasets.
arXiv Detail & Related papers (2024-09-11T06:36:08Z) - WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization [63.98650220772378]
We present WIDIn, Wording Images for Domain-Invariant representation, to disentangle discriminative visual representation.
We first estimate the language embedding with fine-grained alignment, which can be used to adaptively identify and then remove domain-specific counterpart.
We show that WIDIn can be applied to both pretrained vision-language models like CLIP, and separately trained uni-modal models like MoCo and BERT.
arXiv Detail & Related papers (2024-05-28T17:46:27Z) - Unified Language-driven Zero-shot Domain Adaptation [55.64088594551629]
Unified Language-driven Zero-shot Domain Adaptation (ULDA) is a novel task setting.
It enables a single model to adapt to diverse target domains without explicit domain-ID knowledge.
arXiv Detail & Related papers (2024-04-10T16:44:11Z) - VLPose: Bridging the Domain Gap in Pose Estimation with Language-Vision
Tuning [53.35114015288077]
We bridge the domain gap between natural and artificial scenarios with efficient tuning strategies.
We develop a novel framework called VLPose to extend the generalization and robustness of pose estimation models.
Our approach has demonstrated improvements of 2.26% and 3.74% on HumanArt and MSCOCO, respectively.
arXiv Detail & Related papers (2024-02-22T11:21:54Z) - Domain Prompt Learning with Quaternion Networks [49.45309818782329]
We propose to leverage domain-specific knowledge from domain-specific foundation models to transfer the robust recognition ability of Vision-Language Models to specialized domains.
We present a hierarchical approach that generates vision prompt features by analyzing intermodal relationships between hierarchical language prompt features and domain-specific vision features.
Our proposed method achieves new state-of-the-art results in prompt learning.
arXiv Detail & Related papers (2023-12-12T08:49:39Z) - Recognize Any Regions [55.76437190434433]
RegionSpot integrates position-aware localization knowledge from a localization foundation model with semantic information from a ViL model.<n>Experiments in open-world object recognition show that our RegionSpot achieves significant performance gain over prior alternatives.
arXiv Detail & Related papers (2023-11-02T16:31:49Z) - Context-Conditional Adaptation for Recognizing Unseen Classes in Unseen
Domains [48.17225008334873]
We propose a feature generative framework integrated with a COntext COnditional Adaptive (COCOA) Batch-Normalization.
The generated visual features better capture the underlying data distribution enabling us to generalize to unseen classes and domains at test-time.
We thoroughly evaluate and analyse our approach on established large-scale benchmark - DomainNet.
arXiv Detail & Related papers (2021-07-15T17:51:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.