Fugu-MT 論文翻訳(概要): MSA-Thinker: Discrimination-Calibration Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis

論文の概要: MSA-Thinker: Discrimination-Calibration Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis

arxiv url: http://arxiv.org/abs/2604.00013v1
Date: Tue, 10 Mar 2026 12:48:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 02:36:13.189361
Title: MSA-Thinker: Discrimination-Calibration Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis
Title（参考訳）: MSA-Thinker:Hint-Guided Reinforcement Learningを用いたマルチモーダル感性分析のための識別校正推論
Authors: Miaosen Luo, Zhenhao Yang, Jieshen Long, Jinghu Sun, Yichu Liu, Sijie Mai,
Abstract要約: マルチモーダル感情分析は、テキスト、聴覚、視覚のモダリティを統合することで人間の感情を理解することを目的としている。 CoT(Chain-of-Thought)推論を取り入れた既存の手法は、高いアノテーションコストによって妨げられる。本研究では,Hintに基づく強化学習と構造化識別校正(DC)推論を統合した新しい学習フレームワークを提案する。
参考スコア（独自算出の注目度）: 5.1150258716324055
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal sentiment analysis aims to understand human emotions by integrating textual, auditory, and visual modalities. Although Multimodal Large Language Models (MLLMs) have achieved state-of-the-art performance via supervised fine-tuning (SFT), their end-to-end "black-box" nature limits interpretability. Existing methods incorporating Chain-of-Thought (CoT) reasoning are hindered by high annotation costs, while Reinforcement Learning (RL) faces challenges such as low exploration efficiency and sparse rewards, particularly on hard samples. To address these issues, we propose a novel training framework that integrates structured Discrimination-Calibration (DC) reasoning with Hint-based Reinforcement Learning. First, we perform cold-start SFT using high-quality CoT data synthesized by a teacher model (Qwen3Omni-30B), which inherently contains the DC structure. This equips the model with a reasoning paradigm that performs macro discrimination followed by fine-grained calibration from the initial stage. Building on this, we propose Hint-GRPO, which leverages the discrimination phase within the DC structure as a verifiable anchor during RL to provide directional hints for hard samples, guiding policy optimization and effectively mitigating the reward sparsity problem. Experiments on the Qwen2.5Omni-7B model demonstrate that our method not only achieves higher accuracy in fine-grained sentiment regression tasks but also generates high-quality structured reasoning chains. Crucially, it exhibits superior generalization capability in cross-domain evaluations. This enhances model interpretability while validating the positive contribution of explicit reasoning steps to model robustness, offering a new paradigm for building trustworthy and efficient sentiment analysis systems.
Abstract（参考訳）: マルチモーダル感情分析は、テキスト、聴覚、視覚のモダリティを統合することで人間の感情を理解することを目的としている。マルチモーダル大規模言語モデル(MLLM)は、教師付き微調整(SFT)によって最先端のパフォーマンスを達成したが、そのエンドツーエンドの「ブラックボックス」の性質は解釈可能性に限界がある。 CoT(Chain-of-Thought)推論を取り入れた既存の手法は、高いアノテーションコストによって妨げられ、Reinforcement Learning(RL)は、探索効率の低下やスパース報酬といった課題に直面している。これらの課題に対処するために,Hintに基づく強化学習と構造化識別校正(DC)推論を統合した新しいトレーニングフレームワークを提案する。まず,直流構造を含む教師モデル(Qwen3Omni-30B)で合成した高品質なCoTデータを用いて,コールドスタートSFTを行う。このモデルには、マクロ判別を行う推論パラダイムと、初期段階からのきめ細かいキャリブレーションが組み込まれている。そこで我々は,Hint-GRPOを提案する。このHint-GRPOは,直流構造内の識別フェーズをRL中に検証可能なアンカーとして利用し,ハードサンプルの方向ヒントを提供し,ポリシー最適化を誘導し,報酬空間問題を効果的に緩和する。 Qwen2.5Omni-7Bモデルを用いた実験により,本手法は微粒化回帰タスクにおいて高い精度を達成できるだけでなく,高品質な構造化推論連鎖を生成することを示した。重要な点として、クロスドメイン評価において優れた一般化能力を示す。これは、信頼性と効率的な感情分析システムを構築するための新しいパラダイムを提供するとともに、ロバストネスをモデル化するための明確な推論ステップの肯定的な貢献を検証しながら、モデルの解釈可能性を高める。

論文の概要: MSA-Thinker: Discrimination-Calibration Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis

関連論文リスト