Fugu-MT 論文翻訳(概要): M3PO: Multimodal-Model-Guided Preference Optimization for Visual Instruction Following

論文の概要: M3PO: Multimodal-Model-Guided Preference Optimization for Visual Instruction Following

arxiv url: http://arxiv.org/abs/2508.12458v1
Date: Sun, 17 Aug 2025 18:07:55 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-19 14:49:10.77966
Title: M3PO: Multimodal-Model-Guided Preference Optimization for Visual Instruction Following
Title（参考訳）: M3PO:マルチモーダルモデルガイドによる視覚指導の最適化
Authors: Ruirui Gao, Emily Johnson, Bowen Tan, Yanfei Qian,
Abstract要約: LVLM(Large Vision-Language Models)は、複雑なマルチモーダル命令の潜在能力を秘めている。 M3POはLVLMの視覚的命令の処理能力を高めるために設計された,新しい,データ効率のよい手法である。 M3POは、LVLM生成候補の多様なプールから、最も「学習価値の高い」選好サンプルペアをインテリジェントに選択する。
参考スコア（独自算出の注目度）: 4.119014132092875
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Vision-Language Models (LVLMs) hold immense potential for complex multimodal instruction following, yet their development is often hindered by the high cost and inconsistency of human annotation required for effective fine-tuning and preference alignment. Traditional supervised fine-tuning (SFT) and existing preference optimization methods like RLHF and DPO frequently struggle to efficiently leverage the model's own generation space to identify highly informative "hard negative" samples. To address these challenges, we propose Multimodal-Model-Guided Preference Optimization (M3PO), a novel and data-efficient method designed to enhance LVLMs' capabilities in visual instruction following. M3PO intelligently selects the most "learning-valuable" preference sample pairs from a diverse pool of LVLM-generated candidates. This selection is driven by a sophisticated mechanism that integrates two crucial signals: a Multimodal Alignment Score (MAS) to assess external quality and the model's Self-Consistency / Confidence (log-probability) to gauge internal belief. These are combined into a novel M3P-Score, which specifically identifies preferred responses and challenging dispreferred responses that the model might confidently generate despite being incorrect. These high-quality preference pairs are then used for efficient Direct Preference Optimization (DPO) fine-tuning on base LVLMs like LLaVA-1.5 (7B/13B) using LoRA. Our extensive experiments demonstrate that M3PO consistently outperforms strong baselines, including SFT, simulated RLHF, vanilla DPO, and RM-DPO, across a comprehensive suite of multimodal instruction following benchmarks (MME-Bench, POPE, IFT, Human Pref. Score).
Abstract（参考訳）: LVLM(Large Vision-Language Models)は、複雑なマルチモーダル命令の潜在能力を秘めているが、それらの開発は、効果的な微調整と優先順位調整に必要な人間のアノテーションの高コストと不整合によって妨げられることが多い。従来の教師付き微調整(SFT)や、RLHFやDPOのような既存の選好最適化手法は、高情報性の高い「ハードネガティブ」サンプルを特定するために、モデル自身の生成空間を効率的に活用するのにしばしば苦労する。これらの課題に対処するために,LVLMの視覚的指示における機能向上を目的とした,新しいデータ効率の手法であるMultimodal-Model-Guided Preference Optimization (M3PO)を提案する。 M3POは、LVLM生成候補の多様なプールから、最も「学習価値の高い」選好サンプルペアをインテリジェントに選択する。この選択は、外部品質を評価するためのマルチモーダルアライメントスコア(MAS)と、内部の信念を測定するためのモデルの自己一貫性/信頼性(log-probability)という、2つの重要な信号を統合する洗練されたメカニズムによって駆動される。これらは新しいM3Pスコアに結合され、特に好ましくない反応を識別し、不正確であるにもかかわらずモデルが自信を持って生成するであろう不適切な反応に挑戦する。これらの高品質な選好ペアは、LoRAを用いたLLaVA-1.5 (7B/13B)のようなベースLVLM上でのDPO(Direct Preference Optimization)の微調整に使用される。我々の広範な実験により、M3POは、SFT、シミュレートされたRLHF、バニラDPO、RM-DPOなど、マルチモーダル命令の総合的なスイート(MME-Bench、POPE、IFT、Human Pref.Score)において、強いベースラインを一貫して上回ります。

論文の概要: M3PO: Multimodal-Model-Guided Preference Optimization for Visual Instruction Following

関連論文リスト