Related papers: When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning

When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning

URL: http://arxiv.org/abs/2511.02794v1
Date: Tue, 04 Nov 2025 18:20:13 GMT
Title: When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning
Authors: Chenyu Zhang, Minsol Kim, Shohreh Ghorbani, Jingyao Wu, Rosalind Picard, Patricia Maes, Paul Pu Liang,
Abstract summary: We introduce modality sabotage, a diagnostic failure mode in which a high-confidence unimodal error overrides other evidence and misleads the fused result.<n>A model-agnostic evaluation layer treats each modality as an agent, producing candidate labels and a brief self-assessment used for auditing.<n>A simple fusion mechanism aggregates these outputs, exposing contributors (modalities supporting correct outcomes) and saboteurs (modalities that mislead)
Score: 22.39245479538899
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite rapid growth in multimodal large language models (MLLMs), their reasoning traces remain opaque: it is often unclear which modality drives a prediction, how conflicts are resolved, or when one stream dominates. In this paper, we introduce modality sabotage, a diagnostic failure mode in which a high-confidence unimodal error overrides other evidence and misleads the fused result. To analyze such dynamics, we propose a lightweight, model-agnostic evaluation layer that treats each modality as an agent, producing candidate labels and a brief self-assessment used for auditing. A simple fusion mechanism aggregates these outputs, exposing contributors (modalities supporting correct outcomes) and saboteurs (modalities that mislead). Applying our diagnostic layer in a case study on multimodal emotion recognition benchmarks with foundation models revealed systematic reliability profiles, providing insight into whether failures may arise from dataset artifacts or model limitations. More broadly, our framework offers a diagnostic scaffold for multimodal reasoning, supporting principled auditing of fusion dynamics and informing possible interventions.

Related papers

A multimodal slice discovery framework for systematic failure detection and explanation in medical image classification [2.173091573209431]
Existing auditing approaches rely on unimodal features or metadata-based subgroup analyses.<n>We introduce the first automated auditing framework that extends slice discovery methods to multimodal representations.<n> Comprehensive experiments were conducted under common failure scenarios using the MIMIC-CXR-JPG dataset.
arXiv Detail & Related papers (2026-02-27T17:06:37Z)
ProbeLLM: Automating Principled Diagnosis of LLM Failures [89.44131968886184]
We propose ProbeLLM, a benchmark-agnostic automated probing framework that elevates weakness discovery from individual failures to structured failure modes.<n>By restricting probing to verifiable test cases and leveraging tool-augmented generation and verification, ProbeLLM grounds failure discovery in reliable evidence.
arXiv Detail & Related papers (2026-02-13T14:33:13Z)
Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
We identify a critical failure mode termed textual inertia, where models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence.<n>We propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs.<n>Results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation.
arXiv Detail & Related papers (2026-01-07T16:39:34Z)
Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning [49.17801010041155]
Multimodal large language models (MLLMs) promise enhanced reasoning by integrating diverse inputs such as text, vision, and audio.<n>Yet cross-modal reasoning remains underexplored, with conflicting reports on whether added modalities help or harm performance.<n>We categorize multimodal reasoning into six interaction patterns, varying how facts are distributed across modalities and logically combined.
arXiv Detail & Related papers (2025-09-28T08:46:11Z)
A Closer Look at Multimodal Representation Collapse [12.399005128036746]
We show that modality collapse happens when noisy features from one modality are entangled, via a shared set of neurons in the fusion head, with predictive features from another.<n>We propose an algorithm that prevents modality collapse through explicit basis reallocation, with applications in dealing with missing modalities.
arXiv Detail & Related papers (2025-05-28T15:31:53Z)
Mixed Signals: Understanding Model Disagreement in Multimodal Empathy Detection [4.87341465958982]
We examine cases where unimodal and multimodal predictions diverge.<n>Our analysis shows that dominant signals in one modality can mislead fusion when unsupported by others.<n>These insights position disagreement as a useful diagnostic signal for identifying challenging examples and improving empathy system robustness.
arXiv Detail & Related papers (2025-05-20T06:25:02Z)
Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.<n>Models may behave unreliably due to poorly explored failure modes.<n> causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z)
Multimodal Learning with Uncertainty Quantification based on Discounted Belief Fusion [3.66486428341988]
Multimodal AI models are increasingly used in fields like healthcare, finance, and autonomous driving.<n>Uncertainty arising from noise, insufficient evidence, or conflicts between modalities is crucial for reliable decision-making.<n>We propose a novel multimodal learning method with order-invariant evidence fusion and introduce a conflict-based discounting mechanism.
arXiv Detail & Related papers (2024-12-23T22:37:18Z)
Confidence-aware multi-modality learning for eye disease screening [58.861421804458395]
We propose a novel multi-modality evidential fusion pipeline for eye disease screening. It provides a measure of confidence for each modality and elegantly integrates the multi-modality information. Experimental results on both public and internal datasets demonstrate that our model excels in robustness.
arXiv Detail & Related papers (2024-05-28T13:27:30Z)
Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition. We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model. HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z)
Reliable Multimodality Eye Disease Screening via Mixture of Student's t Distributions [49.4545260500952]
We introduce a novel multimodality evidential fusion pipeline for eye disease screening, EyeMoSt. Our model estimates both local uncertainty for unimodality and global uncertainty for the fusion modality to produce reliable classification results. Our experimental findings on both public and in-house datasets show that our model is more reliable than current methods.
arXiv Detail & Related papers (2023-03-17T06:18:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.