Fugu-MT 論文翻訳(概要): Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

論文の概要: Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

arxiv url: http://arxiv.org/abs/2605.18740v3
Date: Wed, 27 May 2026 17:29:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:54.760801
Title: Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
Title（参考訳）: Vision-OPD: オンデマンド自己蒸留によるマルチモーダルLCMの細部学習
Authors: Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin, Le Sun, Xianpei Han, Yaojie Lu,
Abstract要約: MLLM(Multimodal Large Language Models)は、細かな視覚的理解に苦慮している。地域間自己蒸留フレームワークであるビジョンOPD(Vision On-Policy Distillation)を提案する。 Vision-OPDは同じMLLMから2つの条件ポリシーをインスタンス化する。
参考スコア（独自算出の注目度）: 71.92541392470103
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answers fine-grained questions more accurately when conditioned on evidence-centered crops than on the corresponding full images, suggesting that many failures stem from difficulty to focus on relevant evidence rather than insufficient local recognition ability. Motivated by this observation, we propose Vision-OPD (Vision On-Policy Distillation), a regional-to-global self-distillation framework that transfers the model's own privileged regional perception to its full-image policy. Vision-OPD instantiates two conditional policies from the same MLLM: a crop-conditioned teacher and a full-image-conditioned student. The student generates on-policy rollouts, and Vision-OPD minimizes token-level divergence between the teacher and student next-token distributions along these rollouts. This enables the model to internalize the benefit of visual zooming without external teacher models, ground-truth labels, reward verifiers, or inference-time tool use. Experiments on multiple fine-grained visual understanding benchmarks show that Vision-OPD models achieve competitive or superior performance against much larger open-source, closed-source, and "Thinking-with-Images" agentic models.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、細粒度の視覚的理解に苦慮している。我々は,地域間認識のギャップを観察する:同じMLLMが,証拠中心の作物に対して,対応する全画像よりも厳密な質問に正確に答えることにより,多くの失敗は,局所的な認識能力の不足よりも,関連する証拠に焦点を合わせることが困難から生じることを示唆している。本研究は,ビジョンOPD(Vision On-Policy Distillation)を提案する。ビジョンOPD(Vision On-Policy Distillation)は,モデル自体の特権的地域認識をフルイメージポリシーに伝達する,地域間自己蒸留フレームワークである。 Vision-OPDは同じMLLMから2つの条件ポリシーをインスタンス化する。学生は、オンラインのロールアウトを生成し、Vision-OPDは、これらのロールアウトに沿って、教師と学生の次のトーケン分布の間のトークンレベルのばらつきを最小限にする。これにより、外部教師モデル、地味ラベル、報酬検証器、推論時ツール使用なしに、ビジュアルズームの利点を内部化することが可能になる。複数のきめ細かいビジュアル理解ベンチマーク実験により、Vision-OPDモデルは、より大きなオープンソース、クローズドソース、および"Thinking-with-Images"エージェントモデルに対して、競争力または優れたパフォーマンスを達成することが示された。

論文の概要: Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

関連論文リスト