Learning What to Attend First: Modality-Importance-Guided Reasoning for Reliable Multimodal Emotion Understanding
- URL: http://arxiv.org/abs/2512.02699v1
- Date: Tue, 02 Dec 2025 12:29:41 GMT
- Title: Learning What to Attend First: Modality-Importance-Guided Reasoning for Reliable Multimodal Emotion Understanding
- Authors: Hyeongseop Rha, Jeong Hun Yeo, Junil Won, Se Jin Park, Yong Man Ro,
- Abstract summary: Modality Importance (MI) is a simple yet effective mechanism for identifying the emotion-dominant modality.<n>MIGR reorganizes reasoning sequences so that explanations begin from the modality most critical to the target emotion.<n>Results show that MIGR substantially improves reasoning reliability, decreasing instances of correct predictions accompanied by emotionally inconsistent explanations.
- Score: 50.014363382140324
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present Modality-Importance-Guided Reasoning (MIGR), a framework designed to improve the reliability of reasoning-based multimodal emotion understanding in multimodal large language models. Although existing methods have advanced emotion understanding, they often suffer from reasoning drift: models gradually rely on their own generated text instead of multimodal evidence, and their explanations are overly shaped by visually initiated reasoning paths. To address these issues, we introduce Modality Importance (MI), a simple yet effective mechanism for identifying the emotion-dominant modality. Using MI, MIGR reorganizes reasoning sequences so that explanations begin from the modality most critical to the target emotion, preventing early reasoning from being misled by less informative cues. Our two-stage framework-comprising modality-aligned supervised fine-tuning and modality-aware reward optimization-encourages models to generate emotionally grounded, causally relevant, and coherence-preserving explanations. Experimental results on the DFEW benchmark show that MIGR substantially improves reasoning reliability, decreasing instances of correct predictions accompanied by emotionally inconsistent explanations from 18.10% to 7.37%. These results confirm the benefit of initiating reasoning from the emotion-dominant modality.
Related papers
- Adversarial Yet Cooperative: Multi-Perspective Reasoning in Retrieved-Augmented Language Models [72.4149653187766]
We propose a Reasoner-Verifier framework named Adrialversa Reasoning RAG (ARR)<n>The Reasoner and Verifier engage in reasoning on retrieved evidence and critiquing each other's logic while being guided by process-aware advantage.<n> Experiments on multiple benchmarks demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2026-01-08T06:57:03Z) - Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
We identify a critical failure mode termed textual inertia, where models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence.<n>We propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs.<n>Results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation.
arXiv Detail & Related papers (2026-01-07T16:39:34Z) - ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation [79.17352367219736]
ROVER tests the use of one modality to guide, verify, or refine outputs in the other.<n>ROVER is a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning.
arXiv Detail & Related papers (2025-11-03T02:27:46Z) - Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier [53.55996102181836]
We propose the Emotional Rationale Verifier (ERV) and an Explanation Reward.<n>Our method guides the model to produce reasoning that is explicitly consistent with the target emotion.<n>We show that our approach not only enhances alignment between explanation and prediction but also empowers MLLMs to deliver emotionally coherent, trustworthy interactions.
arXiv Detail & Related papers (2025-10-27T16:40:17Z) - Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning [49.17801010041155]
Multimodal large language models (MLLMs) promise enhanced reasoning by integrating diverse inputs such as text, vision, and audio.<n>Yet cross-modal reasoning remains underexplored, with conflicting reports on whether added modalities help or harm performance.<n>We categorize multimodal reasoning into six interaction patterns, varying how facts are distributed across modalities and logically combined.
arXiv Detail & Related papers (2025-09-28T08:46:11Z) - ReaLM: Reflection-Enhanced Autonomous Reasoning with Small Language Models [76.28894983518164]
Small Language Models (SLMs) are a cost-effective alternative to Large Language Models (LLMs)<n>They often struggle with complex reasoning due to their limited capacity and a tendency to produce mistakes or inconsistent answers.<n>We introduce ReaLM, a reinforcement learning framework for robust and self-sufficient reasoning in vertical domains.
arXiv Detail & Related papers (2025-08-17T14:50:23Z) - Application of Multiple Chain-of-Thought in Contrastive Reasoning for Implicit Sentiment Analysis [1.9472869221587836]
Implicit sentiment analysis aims to uncover emotions that are subtly expressed, often obscured by ambiguity and figurative language.<n>We propose a novel Dual Reverse Chain Reasoning framework to enhance the performance of implicit sentiment analysis.<n>We also introduce a Triple Reverse Chain Reasoning framework to address the limitations of random hypotheses.
arXiv Detail & Related papers (2025-03-10T10:10:50Z) - Decoding the Flow: CauseMotion for Emotional Causality Analysis in Long-form Conversations [22.000288488609733]
CauseMotion is a long-sequence emotional causal reasoning framework grounded in Retrieval-Augmented Generation (RAG) and multimodal fusion.<n>By integrating RAG with a sliding window mechanism, it effectively retrieves and leverages contextually relevant dialogue segments.<n>A GLM-4 integrated with CauseMotion achieves an 8.7% improvement in causal accuracy over the original model and surpasses GPT-4o by 1.2%.<n>On the publicly available DiaASQ dataset, CauseMotion-GLM-4 achieves state-of-the-art results in accuracy, F1 score, and causal reasoning accuracy.
arXiv Detail & Related papers (2025-01-01T09:10:32Z) - From Heuristic to Analytic: Cognitively Motivated Strategies for
Coherent Physical Commonsense Reasoning [66.98861219674039]
Heuristic-Analytic Reasoning (HAR) strategies drastically improve the coherence of rationalizations for model decisions.
Our findings suggest that human-like reasoning strategies can effectively improve the coherence and reliability of PLM reasoning.
arXiv Detail & Related papers (2023-10-24T19:46:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.