Related papers: ReMoD: Rethinking Modality Contribution in Multimodal Stance Detection via Dual Reasoning

ReMoD: Rethinking Modality Contribution in Multimodal Stance Detection via Dual Reasoning

URL: http://arxiv.org/abs/2511.06057v1
Date: Sat, 08 Nov 2025 15:56:24 GMT
Title: ReMoD: Rethinking Modality Contribution in Multimodal Stance Detection via Dual Reasoning
Authors: Bingbing Wang, Zhengda Jin, Bin Liang, Jing Li, Ruifeng Xu,
Abstract summary: Multimodal Stance Detection (MSD) is a crucial task for understanding public opinion on social media.<n>Existing work simply fuses information from various modalities to learn stance representations, overlooking the varying contributions of stance expression from different modalities.<n>We propose **ReMoD**, a framework that **Re**thinks **Mo**dality contribution of stance expression through a **D**ual-reasoning paradigm.
Score: 29.400983680521733
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Multimodal Stance Detection (MSD) is a crucial task for understanding public opinion on social media. Existing work simply fuses information from various modalities to learn stance representations, overlooking the varying contributions of stance expression from different modalities. Therefore, stance misunderstanding noises may be drawn into the stance learning process due to the risk of learning errors by rough modality combination. To address this, we get inspiration from the dual-process theory of human cognition and propose **ReMoD**, a framework that **Re**thinks **Mo**dality contribution of stance expression through a **D**ual-reasoning paradigm. ReMoD integrates *experience-driven intuitive reasoning* to capture initial stance cues with *deliberate reflective reasoning* to adjust for modality biases, refine stance judgments, and thereby dynamically weight modality contributions based on their actual expressive power for the target stance. Specifically, the intuitive stage queries the Modality Experience Pool (MEP) and Semantic Experience Pool (SEP) to form an initial stance hypothesis, prioritizing historically impactful modalities. This hypothesis is then refined in the reflective stage via two reasoning chains: Modality-CoT updates MEP with adaptive fusion strategies to amplify relevant modalities, while Semantic-CoT refines SEP with deeper contextual insights of stance semantics. These dual experience structures are continuously refined during training and recalled at inference to guide robust and context-aware stance decisions. Extensive experiments on the public MMSD benchmark demonstrate that our ReMoD significantly outperforms most baseline models and exhibits strong generalization capabilities.

Related papers

MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization [4.088161686930475]
We propose Modality-Decoupled Direct Preference Optimization (MoD-DPO) for improving modality grounding in omni LLMs.<n>MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant modalities and sensitivity to perturbations in relevant modalities.<n>Experiments demonstrate that MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines.
arXiv Detail & Related papers (2026-03-03T17:50:24Z)
Beyond Words: Evaluating and Bridging Epistemic Divergence in User-Agent Interaction via Theory of Mind [8.740788873949471]
Large Language Models (LLMs) have developed rapidly and are widely applied to both general-purpose and professional tasks.<n>They still struggle to comprehend and respond to the true user needs when intentions and instructions are imprecisely conveyed.
arXiv Detail & Related papers (2026-02-14T16:01:59Z)
Learning What to Attend First: Modality-Importance-Guided Reasoning for Reliable Multimodal Emotion Understanding [50.014363382140324]
Modality Importance (MI) is a simple yet effective mechanism for identifying the emotion-dominant modality.<n>MIGR reorganizes reasoning sequences so that explanations begin from the modality most critical to the target emotion.<n>Results show that MIGR substantially improves reasoning reliability, decreasing instances of correct predictions accompanied by emotionally inconsistent explanations.
arXiv Detail & Related papers (2025-12-02T12:29:41Z)
Thinking Forward and Backward: Multi-Objective Reinforcement Learning for Retrieval-Augmented Reasoning [137.33138614095435]
Retrieval-augmented generation (RAG) has proven to be effective in mitigating hallucinations in large language models.<n>Recent efforts have incorporated search-based interactions into RAG, enabling iterative reasoning with real-time retrieval.<n>We propose Bi-RAR, a novel retrieval-augmented reasoning framework that evaluates each intermediate step jointly in both forward and backward directions.
arXiv Detail & Related papers (2025-11-12T08:29:39Z)
From <Answer> to <Think>: Multidimensional Supervision of Reasoning Process for LLM Optimization [62.07990937720985]
Dimension-level Reward Model (DRM) is a new supervision framework for Large Language Models.<n>DRM evaluates the quality of a reasoning process along three fundamental, complementary, and interpretable dimensions.<n> Experimental results show that DRM provides effective supervision signals, guides the optimization of LLMs and enhances their reasoning ability.
arXiv Detail & Related papers (2025-10-13T14:29:15Z)
CMR-SPB: Cross-Modal Multi-Hop Reasoning over Text, Image, and Speech with Path Balance [10.843417240658992]
Cross-modal multi-hop reasoning (CMR) is a valuable yet underexplored capability of multimodal large language models (MLLMs)<n>We argue that existing benchmarks for evaluating this ability have critical shortcomings.<n>We introduce a novel benchmark -- Cross-Modal Multi-Hop Reasoning over Text, Image and Speech with Path Balance (CMR-SPB)
arXiv Detail & Related papers (2025-08-22T08:17:31Z)
Coherent Multimodal Reasoning with Iterative Self-Evaluation for Vision-Language Models [4.064135211977999]
Large language models (LLMs) and vision-language models (LVLMs) struggle with complex, multi-step, cross-modal common sense reasoning tasks.<n>We propose the Coherent Multimodal Reasoning Framework (CMRF), a novel approach that enhances LVLMs' common sense reasoning capabilities.<n>CMRF mimics human problem-solving by decomposing complex queries, generating step-by-step inferences, and self-correcting errors.
arXiv Detail & Related papers (2025-08-04T20:33:58Z)
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization [26.757458496178437]
We introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning.<n>We construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains.<n>We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning.<n> Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL.
arXiv Detail & Related papers (2025-03-13T17:56:05Z)
CoLA: Conditional Dropout and Language-driven Robust Dual-modal Salient Object Detection [12.780661306169474]
We introduce textbfConditional Dropout and textbfLAnguage-driven(textbfCoLA) framework. The framework mitigates the impact of noisy inputs, while preserving its performance under complete modalities. It outperforms state-of-the-art dual-modal SOD models, under both modality-complete and modality-missing conditions.
arXiv Detail & Related papers (2024-07-09T11:49:24Z)
Exploiting modality-invariant feature for robust multimodal emotion recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN) We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z)
Anticipating the Unseen Discrepancy for Vision and Language Navigation [63.399180481818405]
Vision-Language Navigation requires the agent to follow natural language instructions to reach a specific target. The large discrepancy between seen and unseen environments makes it challenging for the agent to generalize well. We propose Unseen Discrepancy Anticipating Vision and Language Navigation (DAVIS) that learns to generalize to unseen environments via encouraging test-time visual consistency.
arXiv Detail & Related papers (2022-09-10T19:04:40Z)
Learning Relation Alignment for Calibrated Cross-modal Retrieval [52.760541762871505]
We propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations. We present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions mutually via inter-modal alignment.
arXiv Detail & Related papers (2021-05-28T14:25:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.