Related papers: Mixed Signals: Understanding Model Disagreement in Multimodal Empathy Detection

Related papers

Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
We identify a critical failure mode termed textual inertia, where models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence.<n>We propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs.<n>Results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation.
arXiv Detail & Related papers (2026-01-07T16:39:34Z)
CrossCheck-Bench: Diagnosing Compositional Failures in Multimodal Conflict Resolution [20.823419395675412]
CrossCheck-Bench is a diagnostic benchmark for evaluating contradiction detection in multimodal inputs.<n>We evaluate 13 state-of-the-art vision-language models and observe a consistent performance drop as tasks shift from perceptual matching to logical contradiction detection.
arXiv Detail & Related papers (2025-11-19T12:17:15Z)
When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning [22.39245479538899]
We introduce modality sabotage, a diagnostic failure mode in which a high-confidence unimodal error overrides other evidence and misleads the fused result.<n>A model-agnostic evaluation layer treats each modality as an agent, producing candidate labels and a brief self-assessment used for auditing.<n>A simple fusion mechanism aggregates these outputs, exposing contributors (modalities supporting correct outcomes) and saboteurs (modalities that mislead)
arXiv Detail & Related papers (2025-11-04T18:20:13Z)
Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning [49.17801010041155]
Multimodal large language models (MLLMs) promise enhanced reasoning by integrating diverse inputs such as text, vision, and audio.<n>Yet cross-modal reasoning remains underexplored, with conflicting reports on whether added modalities help or harm performance.<n>We categorize multimodal reasoning into six interaction patterns, varying how facts are distributed across modalities and logically combined.
arXiv Detail & Related papers (2025-09-28T08:46:11Z)
Rethinking Explainability in the Era of Multimodal AI [9.57008593971486]
multimodal AI systems have become ubiquitous and achieved remarkable performance across high-stakes applications.<n>Most existing explainability techniques remain unimodal, generating modality-specific feature attributions, concepts, or circuit traces in isolation.<n>This paper argues that such unimodal explanations systematically misrepresent and fail to capture the cross-modal influence that drives multimodal model decisions.
arXiv Detail & Related papers (2025-06-16T03:08:29Z)
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models [26.17300490736624]
Multimodal Large Language Models (MLLMs) are predominantly trained and tested on consistent visual-textual inputs.<n>We propose the Multimodal Inconsistency Reasoning benchmark to assess MLLMs' ability to detect and reason about semantic mismatches.<n>We evaluate six state-of-the-art MLLMs, showing that models with dedicated multimodal reasoning capabilities, such as o1, substantially outperform their counterparts.
arXiv Detail & Related papers (2025-02-22T01:52:37Z)
Multimodal Learning with Uncertainty Quantification based on Discounted Belief Fusion [3.66486428341988]
Multimodal AI models are increasingly used in fields like healthcare, finance, and autonomous driving.<n>Uncertainty arising from noise, insufficient evidence, or conflicts between modalities is crucial for reliable decision-making.<n>We propose a novel multimodal learning method with order-invariant evidence fusion and introduce a conflict-based discounting mechanism.
arXiv Detail & Related papers (2024-12-23T22:37:18Z)
Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition [52.522244807811894]
We propose a novel multimodal Transformer framework using prompt learning to address the issue of missing modalities. Our method introduces three types of prompts: generative prompts, missing-signal prompts, and missing-type prompts. Through prompt learning, we achieve a substantial reduction in the number of trainable parameters.
arXiv Detail & Related papers (2024-07-07T13:55:56Z)
Confidence-aware multi-modality learning for eye disease screening [58.861421804458395]
We propose a novel multi-modality evidential fusion pipeline for eye disease screening. It provides a measure of confidence for each modality and elegantly integrates the multi-modality information. Experimental results on both public and internal datasets demonstrate that our model excels in robustness.
arXiv Detail & Related papers (2024-05-28T13:27:30Z)
A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames. While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input. We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z)
Mitigating Shortcut Learning with Diffusion Counterfactuals and Diverse Ensembles [104.60508550106618]
We propose DiffDiv, an ensemble diversification framework exploiting Diffusion Probabilistic Models (DPMs)<n>We show that DPMs can generate images with novel feature combinations, even when trained on samples displaying correlated input features.<n>We show that DPM-guided diversification is sufficient to remove dependence on shortcut cues, without a need for additional supervised signals.
arXiv Detail & Related papers (2023-11-23T15:47:33Z)
Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition. We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model. HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z)
Informative Data Selection with Uncertainty for Multi-modal Object Detection [25.602915381482468]
We propose a universal uncertainty-aware multi-modal fusion model. Our model reduces the randomness in fusion and generates reliable output. Our fusion model is proven to resist severe noise interference like Gaussian, motion blur, and frost, with only slight degradation.
arXiv Detail & Related papers (2023-04-23T16:36:13Z)
Reliable Multimodality Eye Disease Screening via Mixture of Student's t Distributions [49.4545260500952]
We introduce a novel multimodality evidential fusion pipeline for eye disease screening, EyeMoSt. Our model estimates both local uncertainty for unimodality and global uncertainty for the fusion modality to produce reliable classification results. Our experimental findings on both public and in-house datasets show that our model is more reliable than current methods.
arXiv Detail & Related papers (2023-03-17T06:18:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.