CLASH: A Benchmark for Cross-Modal Contradiction Detection
- URL: http://arxiv.org/abs/2511.19199v1
- Date: Mon, 24 Nov 2025 15:09:07 GMT
- Title: CLASH: A Benchmark for Cross-Modal Contradiction Detection
- Authors: Teodora Popordanoska, Jiameng Li, Matthew B. Blaschko,
- Abstract summary: CLASH is a novel benchmark for multimodal contradiction detection.<n>It features COCO images paired with contradictory captions containing controlled object-level or attribute-level contradictions.
- Score: 15.134491772506196
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contradictory multimodal inputs are common in real-world settings, yet existing benchmarks typically assume input consistency and fail to evaluate cross-modal contradiction detection - a fundamental capability for preventing hallucinations and ensuring reliability. We introduce CLASH, a novel benchmark for multimodal contradiction detection, featuring COCO images paired with contradictory captions containing controlled object-level or attribute-level contradictions. The samples include targeted questions evaluated in both multiple-choice and open-ended formats. The benchmark provides an extensive fine-tuning set filtered through automated quality checks, alongside a smaller human-verified diagnostic set. Our analysis of state-of-the-art models reveals substantial limitations in recognizing cross-modal conflicts, exposing systematic modality biases and category-specific weaknesses. Furthermore, we empirically demonstrate that targeted fine-tuning on CLASH substantially enhances conflict detection capabilities.
Related papers
- Beyond Raw Detection Scores: Markov-Informed Calibration for Boosting Machine-Generated Text Detection [105.14032334647932]
Machine-generated texts (MGTs) pose risks such as disinformation and phishing, highlighting the need for reliable detection.<n> Metric-based methods, which extract statistically distinguishable features of MGTs, are often more practical than complex model-based methods that are prone to overfitting.<n>We propose a Markov-informed score calibration strategy that models two relationships of context detection scores that may aid calibration.
arXiv Detail & Related papers (2026-02-08T16:06:12Z) - SetAD: Semi-Supervised Anomaly Learning in Contextual Sets [25.628827917857603]
Semi-supervised anomaly detection has shown great promise by effectively leveraging limited labeled data.<n>We propose SetAD, a novel framework that reframes semi-supervised AD as a Set-level Anomaly Detection task.<n>To enhance robustness and score calibration, we propose a context-calibrated anomaly scoring mechanism.
arXiv Detail & Related papers (2025-11-26T13:27:59Z) - CrossCheck-Bench: Diagnosing Compositional Failures in Multimodal Conflict Resolution [20.823419395675412]
CrossCheck-Bench is a diagnostic benchmark for evaluating contradiction detection in multimodal inputs.<n>We evaluate 13 state-of-the-art vision-language models and observe a consistent performance drop as tasks shift from perceptual matching to logical contradiction detection.
arXiv Detail & Related papers (2025-11-19T12:17:15Z) - LegalWiz: A Multi-Agent Generation Framework for Contradiction Detection in Legal Documents [0.10260880679794955]
We present a multi-agent contradiction-aware benchmark framework for the legal domain.<n>It generates synthetic legal-style documents, injects six structured contradiction types, and models both self- and pairwise inconsistencies.<n>This benchmark offers one of the first structured resources for contradiction-aware evaluation in legal RAG pipelines.
arXiv Detail & Related papers (2025-10-03T18:24:27Z) - Unified Unsupervised Anomaly Detection via Matching Cost Filtering [113.43366521994396]
Unsupervised anomaly detection (UAD) aims to identify image- and pixel-level anomalies using only normal training data.<n>We present Unified Cost Filtering (UCF), a generic post-hoc refinement framework for refining anomaly cost volume of any UAD model.
arXiv Detail & Related papers (2025-10-03T03:28:18Z) - METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark [48.78602579128459]
We introduce METER, a unified benchmark for interpretable forgery detection spanning images, videos, audio, and audio-visual content.<n>Our dataset comprises four tracks, each requiring not only real-vs-fake classification but also evidence-chain-based explanations.
arXiv Detail & Related papers (2025-07-22T03:42:51Z) - Fair Deepfake Detectors Can Generalize [51.21167546843708]
We show that controlling for confounders (data distribution and model capacity) enables improved generalization via fairness interventions.<n>Motivated by this insight, we propose Demographic Attribute-insensitive Intervention Detection (DAID), a plug-and-play framework composed of: i) Demographic-aware data rebalancing, which employs inverse-propensity weighting and subgroup-wise feature normalization to neutralize distributional biases; and ii) Demographic-agnostic feature aggregation, which uses a novel alignment loss to suppress sensitive-attribute signals.<n>DAID consistently achieves superior performance in both fairness and generalization compared to several state-of-the-art
arXiv Detail & Related papers (2025-07-03T14:10:02Z) - CAD: A General Multimodal Framework for Video Deepfake Detection via Cross-Modal Alignment and Distillation [24.952907733127223]
We propose a general framework for video deepfake detection via Cross-Modal Alignment and Distillation (CAD)<n>CAD comprises two core components: 1) Cross-modal alignment that identifies inconsistencies in high-level semantic synchronization (e.g., lip-speech mismatches); 2) Cross-modal distillation that mitigates mismatchs while preserving modality-specific forensic traces (e.g., spectral distortions in synthetic audio)
arXiv Detail & Related papers (2025-05-21T08:11:07Z) - Verify when Uncertain: Beyond Self-Consistency in Black Box Hallucination Detection [25.176984317213858]
Large Language Models (LLMs) suffer from hallucination problems, which hinder their reliability in sensitive applications.<n>We propose a budget-friendly, two-stage detection algorithm that calls the verifier model only for a subset of cases.
arXiv Detail & Related papers (2025-02-20T21:06:08Z) - Addressing Key Challenges of Adversarial Attacks and Defenses in the Tabular Domain: A Methodological Framework for Coherence and Consistency [25.830427564563422]
Class-Specific Anomaly Detection (CSAD) is an effective novel anomaly detection approach.<n> CSAD evaluates adversarial samples relative to their predicted class distribution, rather than a broad benign distribution.<n>Our evaluation incorporates both anomaly detection rates with SHAP-based assessments to provide a more comprehensive measure of adversarial sample quality.
arXiv Detail & Related papers (2024-12-10T09:17:09Z) - Evidential Deep Partial Multi-View Classification With Discount Fusion [24.139495744683128]
We propose a novel framework called Evidential Deep Partial Multi-View Classification (EDP-MVC)
We use K-means imputation to address missing views, creating a complete set of multi-view data.
The potential conflicts and uncertainties within this imputed data can affect the reliability of downstream inferences.
arXiv Detail & Related papers (2024-08-23T14:50:49Z) - ADC: Adversarial attacks against object Detection that evade Context
consistency checks [55.8459119462263]
We show that even context consistency checks can be brittle to properly crafted adversarial examples.
We propose an adaptive framework to generate examples that subvert such defenses.
Our results suggest that how to robustly model context and check its consistency, is still an open problem.
arXiv Detail & Related papers (2021-10-24T00:25:09Z) - Exploring Robustness of Unsupervised Domain Adaptation in Semantic
Segmentation [74.05906222376608]
We propose adversarial self-supervision UDA (or ASSUDA) that maximizes the agreement between clean images and their adversarial examples by a contrastive loss in the output space.
This paper is rooted in two observations: (i) the robustness of UDA methods in semantic segmentation remains unexplored, which pose a security concern in this field; and (ii) although commonly used self-supervision (e.g., rotation and jigsaw) benefits image tasks such as classification and recognition, they fail to provide the critical supervision signals that could learn discriminative representation for segmentation tasks.
arXiv Detail & Related papers (2021-05-23T01:50:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.