Fugu-MT 論文翻訳(概要): Rethinking the Text-Vision Reasoning Imbalance in MLLMs through the Lens of Training Recipes

論文の概要: Rethinking the Text-Vision Reasoning Imbalance in MLLMs through the Lens of Training Recipes

arxiv url: http://arxiv.org/abs/2510.22836v1
Date: Sun, 26 Oct 2025 21:06:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 15:28:15.386739
Title: Rethinking the Text-Vision Reasoning Imbalance in MLLMs through the Lens of Training Recipes
Title（参考訳）: MLLMにおけるテキストビジョン推論不均衡の再考
Authors: Guanyu Yao, Qiucheng Wu, Yang Zhang, Zhaowen Wang, Handong Zhao, Shiyu Chang,
Abstract要約: MLLM(Multimodal large language model)は、視覚・言語タスクにおいて強力な機能を示す。近年の研究では、視覚的・テキスト的モダリティ間の推論能力の不均衡が指摘されている。我々は、この現象を、テキスト中心と視覚中心の入力のパフォーマンス格差として定義される、テクティモダリティギャップと呼ぶ。
参考スコア（独自算出の注目度）: 54.374410871041164
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language models (MLLMs) have demonstrated strong capabilities on vision-and-language tasks. However, recent findings reveal an imbalance in their reasoning capabilities across visual and textual modalities. Specifically, current MLLMs often over-rely on textual cues while under-attending to visual content, resulting in suboptimal performance on tasks that require genuine visual reasoning. We refer to this phenomenon as the \textit{modality gap}, defined as the performance disparity between text-centric and vision-centric inputs. In this paper, we analyze the modality gap through the lens of training recipes. We first show that existing training recipes tend to amplify this gap. Then, we systematically explore strategies to bridge it from two complementary perspectives: data and loss design. Our findings provide insights into developing training recipes that mitigate the modality gap and promote more balanced multimodal reasoning. Our code is publicly available at https://github.com/UCSB-NLP-Chang/Bridging-Modality-Gap.
Abstract（参考訳）: MLLM(Multimodal large language model)は、視覚・言語タスクにおいて強力な機能を示す。しかし,近年の研究では,視覚的・テキスト的モダリティ間の推論能力の不均衡が指摘されている。特に、現在のMLLMは、視覚的コンテンツに過度に依存しながらテキストの手がかりに過度に頼り、真の視覚的推論を必要とするタスクにおいて、最適以下のパフォーマンスをもたらす。この現象を「textit{modality gap}」と呼び、テキスト中心と視覚中心の入力のパフォーマンス格差として定義する。本稿では,学習レシピのレンズを通してモダリティギャップを分析する。まず、既存のトレーニングレシピがこのギャップを増幅する傾向があることを示す。次に、データと損失設計という2つの相補的な視点からそれをブリッジする戦略を体系的に検討する。本研究は、モダリティギャップを緩和し、よりバランスの取れたマルチモーダル推論を促進するトレーニングレシピの開発に関する知見を提供する。私たちのコードはhttps://github.com/UCSB-NLP-Chang/Bridging-Modality-Gapで公開されています。

論文の概要: Rethinking the Text-Vision Reasoning Imbalance in MLLMs through the Lens of Training Recipes

関連論文リスト