Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization
- URL: http://arxiv.org/abs/2601.01483v1
- Date: Sun, 04 Jan 2026 11:09:33 GMT
- Title: Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization
- Authors: Xinyu Qiu, Heng Jia, Zhengwen Zeng, Shuheng Shen, Changhua Meng, Yi Yang, Linchao Zhu,
- Abstract summary: We propose a unified reinforcement learning framework that jointly learns answer generation and self-verification within a single policy.<n>ADPO achieves up to +34.1% higher verification AUC and -53.5% lower inference time, with significant gains of +2.8%/+1.4% accuracy on MathVista/MMMU, +1.9 cIoU on ReasonSeg, and +1.7%/+1.0% step success rate on AndroidControl/GUI Odyssey.
- Score: 48.078132893679744
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Parallel test-time scaling typically trains separate generation and verification models, incurring high training and inference costs. We propose Advantage Decoupled Preference Optimization (ADPO), a unified reinforcement learning framework that jointly learns answer generation and self-verification within a single policy. ADPO introduces two innovations: a preference verification reward improving verification capability and a decoupled optimization mechanism enabling synergistic optimization of generation and verification. Specifically, the preference verification reward computes mean verification scores from positive and negative samples as decision thresholds, providing positive feedback when prediction correctness aligns with answer correctness. Meanwhile, the advantage decoupled optimization computes separate advantages for generation and verification, applies token masks to isolate gradients, and combines masked GRPO objectives, preserving generation quality while calibrating verification scores. ADPO achieves up to +34.1% higher verification AUC and -53.5% lower inference time, with significant gains of +2.8%/+1.4% accuracy on MathVista/MMMU, +1.9 cIoU on ReasonSeg, and +1.7%/+1.0% step success rate on AndroidControl/GUI Odyssey.
Related papers
- GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [133.27496265096445]
We show how to apply Group Relative Policy Optimization under multi-reward setting without examining its suitability.<n>We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues.<n>GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
arXiv Detail & Related papers (2026-01-08T18:59:24Z) - AWPO: Enhancing Tool-Use of Large Language Models through Explicit Integration of Reasoning Rewards [60.2998874976509]
We propose advantage-weighted policy optimization (AWPO) to integrate explicit reasoning rewards to enhance tool-use capability.<n>AWPO incorporates variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals.<n>Experiments demonstrate that AWPO achieves state-of-the-art performance across standard tool-use benchmarks.
arXiv Detail & Related papers (2025-12-22T08:07:00Z) - Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment [61.80228667422234]
VGPO redefines value estimation across both temporal and group dimensions.<n>It transforms the sparse terminal reward into dense, process-aware value estimates.<n>It replaces standard group normalization with a novel process enhanced by absolute values to maintain a stable optimization signal.
arXiv Detail & Related papers (2025-12-13T16:31:26Z) - Beyond Outliers: A Study of Optimizers Under Quantization [82.75879062804955]
We study impact of choice on model robustness under quantization.<n>We evaluate how model performance degrades when trained with different baselines.<n>We derive scaling laws for quantization-aware training under different parameters.
arXiv Detail & Related papers (2025-09-27T21:15:22Z) - The Unanticipated Asymmetry Between Perceptual Optimization and Assessment [15.11427750828098]
We show that fidelity metrics that excel in image quality assessment (IQA) are not necessarily effective for perceptual optimization.<n>We also show that discriminator design plays a decisive role in shaping optimization, with patch-level and convolutional architectures providing more faithful detail reconstruction than vanilla or Transformer-based alternatives.
arXiv Detail & Related papers (2025-09-25T08:08:26Z) - Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling [8.593971468631613]
Test-time scaling (TTS) has proven effective in enhancing the reasoning capabilities of large language models (LLMs)<n> Verification plays a key role in TTS, simultaneously influencing (1) reasoning performance and (2) compute efficiency, due to the quality and computational cost of verification.<n>We introduce Variable Granularity Search (VG-Search), a unified algorithm that generalizes beam search and Best-of-N sampling via a tunable granularity parameter g.
arXiv Detail & Related papers (2025-05-16T22:24:48Z) - On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows [71.92083784393418]
Agentic AI (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low.<n>Inference-time alignment relies on three components: sampling, evaluation, and feedback.<n>We introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques.
arXiv Detail & Related papers (2025-04-02T17:40:47Z) - UniCBE: An Uniformity-driven Comparing Based Evaluation Framework with Unified Multi-Objective Optimization [19.673388630963807]
We propose UniCBE, a unified uniformity-driven CBE framework.<n>On the AlpacaEval benchmark, UniCBE saves over 17% of evaluation budgets while achieving a Pearson correlation with ground truth exceeding 0.995.<n>In scenarios where new models are continuously introduced, UniCBE can even save over 50% of evaluation costs.
arXiv Detail & Related papers (2025-02-17T05:28:12Z) - Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss.
The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z) - Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.<n>We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.<n>We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z) - Self-Play Preference Optimization for Language Model Alignment [75.83359213697854]
Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences.
We propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game.
Our approach, dubbed Self-Play Preference Optimization (SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium.
arXiv Detail & Related papers (2024-05-01T17:59:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.