Fugu-MT 論文翻訳(概要): Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training

論文の概要: Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training

arxiv url: http://arxiv.org/abs/2605.11931v1
Date: Tue, 12 May 2026 10:44:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.803437
Title: Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training
Title（参考訳）: 視覚的自己改善トレーニングによるマルチモーダル推論の改善
Authors: Qihuang Zhong, Liang Ding, Wenjie Xuan, Juhua Liu, Bo Du, Dacheng Tao,
Abstract要約: 多モーダル大規模言語モデル(MLLM)の推論能力を改善するために、明示的推論トレースを用いた後学習が一般的である。 MLLMのマルチモーダル推論を強化するための視覚対応型自己改善学習フレームワークであるVISTAを提案する。
参考スコア（独自算出の注目度）: 82.17582358979884
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Post-training with explicit reasoning traces is common to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, acquiring high-quality reasoning traces is often costly and time-consuming. Hence, the self-improvement paradigm has emerged, enabling MLLMs to self-generate reasoning traces for training without external supervision. Despite its effectiveness, we reveal two shortcomings in the self-improvement training of MLLMs: 1) data imbalance, where simple samples are over-trained, but the challenging yet crucial samples are under-trained; 2) language prior bias, where MLLMs overly rely on linguistic priors while neglecting the visual cues. To this end, we propose VISTA, a vision-aware self-improvement training framework for enhancing the multimodal reasoning of MLLMs. Specifically, VISTA first introduces a prefix resampling strategy to reuse the partial correct reasoning traces for efficient data collection, and then designs a vision-aware attention score to quantify the model's focus on visual information. Extensive experiments show that VISTA can be applied to various post-training scenarios, i.e., supervised fine-tuning and preference learning, and effectively enhances the multimodal reasoning performance across various MLLMs and tasks, e.g., bringing up to +13.66% average performance gains for Qwen2.5-VL-3B-Instruct.
Abstract（参考訳）: 明示的な推論トレースによるポストトレーニングは、MLLM(Multimodal Large Language Models)の推論能力を改善するために一般的である。しかし、高品質な推論トレースを取得することは、しばしばコストと時間を要する。そのため、自己改善パラダイムが出現し、MLLMは外部の監督なしにトレーニングのための推論トレースを自己生成することが可能になった。 MLLMの自己改善訓練における2つの欠点を明らかにする。 1) 単純なサンプルを過度に訓練するが、難しいが重要なサンプルを過度に訓練するデータ不均衡。 2) 言語優先バイアスでは,MLLMは視覚的手がかりを無視しながら,言語優先に過度に依存している。そこで本稿では,MLLMのマルチモーダル推論を強化するための,視覚対応型自己改善学習フレームワークであるVISTAを提案する。具体的には、VISTAはまず、効率的なデータ収集のために部分的正しい推論トレースを再利用するためのプレフィックス再サンプリング戦略を導入し、その後、視覚情報に対するモデルの焦点を定量化するために視覚認識の注意スコアを設計する。広範囲な実験により、VISTAは様々な訓練後のシナリオ、すなわち教師付き微調整と嗜好学習に適用でき、Qwen2.5-VL-3B-インストラクトの平均性能が+13.66%に達するような様々なMLLMやタスクのマルチモーダル推論性能を効果的に向上できることが示された。

論文の概要: Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training

関連論文リスト