Fugu-MT 論文翻訳(概要): Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization

論文の概要: Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization

arxiv url: http://arxiv.org/abs/2601.01483v1
Date: Sun, 04 Jan 2026 11:09:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-06 16:25:22.43144
Title: Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization
Title（参考訳）: アドバンテージデカップリングされた参照最適化による視覚言語モデルの統一生成と自己検証
Authors: Xinyu Qiu, Heng Jia, Zhengwen Zeng, Shuheng Shen, Changhua Meng, Yi Yang, Linchao Zhu,
Abstract要約: 本稿では,一つの政策の中で回答生成と自己検証を共同で学習する統合強化学習フレームワークを提案する。 ADPOは最大で+34.1%高い検証AUCと-53.5%低い推論時間を実現し、MathVista/MMMUでは+2.8%/+1.4%の精度、ReasonSegでは+1.9 cIoU、AndroidControl/GUI Odysseyでは+1.7%/+1.0%のステップ成功率を持つ。
参考スコア（独自算出の注目度）: 48.078132893679744
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Parallel test-time scaling typically trains separate generation and verification models, incurring high training and inference costs. We propose Advantage Decoupled Preference Optimization (ADPO), a unified reinforcement learning framework that jointly learns answer generation and self-verification within a single policy. ADPO introduces two innovations: a preference verification reward improving verification capability and a decoupled optimization mechanism enabling synergistic optimization of generation and verification. Specifically, the preference verification reward computes mean verification scores from positive and negative samples as decision thresholds, providing positive feedback when prediction correctness aligns with answer correctness. Meanwhile, the advantage decoupled optimization computes separate advantages for generation and verification, applies token masks to isolate gradients, and combines masked GRPO objectives, preserving generation quality while calibrating verification scores. ADPO achieves up to +34.1% higher verification AUC and -53.5% lower inference time, with significant gains of +2.8%/+1.4% accuracy on MathVista/MMMU, +1.9 cIoU on ReasonSeg, and +1.7%/+1.0% step success rate on AndroidControl/GUI Odyssey.
Abstract（参考訳）: 並列テストタイムスケーリングは通常、生成モデルと検証モデルを分離してトレーニングし、高いトレーニングと推論コストを発生させる。本稿では,単一ポリシ内で回答生成と自己検証を共同で学習する統合強化学習フレームワークであるAdvantage Decoupled Preference Optimization (ADPO)を提案する。 ADPOは2つのイノベーションを導入している: 優先検証報酬による検証能力の向上と、生成と検証の相乗的最適化を可能にする分離最適化機構である。特に、優先検証報酬は、正と負のサンプルからの平均検証スコアを判定閾値として算出し、予測正当性が正の正の正の値に一致した場合に正のフィードバックを与える。一方、デカップリング最適化は、生成と検証の異なる利点を計算し、グラデーションを分離するためにトークンマスクを適用し、マスクされたGRPO目標を組み合わせ、検証スコアを調整しながら生成品質を保存する。 ADPOは最大で+34.1%高い検証AUCと-53.5%低い推論時間を実現し、MathVista/MMMUでは+2.8%/+1.4%の精度、ReasonSegでは+1.9 cIoU、AndroidControl/GUI Odysseyでは+1.7%/+1.0%のステップ成功率を持つ。

論文の概要: Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization

関連論文リスト