Related papers: VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation

VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation

URL: http://arxiv.org/abs/2601.02256v1
Date: Mon, 05 Jan 2026 16:36:40 GMT
Title: VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation
Authors: Shikun Sun, Liao Qu, Huichao Zhang, Yiheng Liu, Yangyang Song, Xian Li, Xu Wang, Yi Jiang, Daniel K. Du, Xinglong Wu, Jia Jia,
Abstract summary: Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive ( VAR) models.<n>Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts.<n>We propose a novel framework to enhance Group Relative Policy Optimization ( GRPO) by explicitly managing these conflicts.
Score: 31.201343197395573
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment. To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide early-stage generation; 2) a dynamic time-step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models.

Related papers

ToProVAR: Efficient Visual Autoregressive Modeling via Tri-Dimensional Entropy-Aware Semantic Analysis and Sparsity Optimization [13.916180996567128]
Visual Autoregressive( VAR) models enhance generation quality but face a critical efficiency bottleneck in later stages.<n>We present a novel optimization framework for VAR models that fundamentally differs from prior approaches.<n>Our approach achieves aggressive acceleration of the generation process while significantly preserving semantic fidelity and fine details.
arXiv Detail & Related papers (2026-02-26T12:36:56Z)
Discovering Multiagent Learning Algorithms with Large Language Models [8.649235365712004]
We propose the use of AlphaEvolve, an evolutionary coding agent powered by large language models, to automatically discover new multiagent learning algorithms.<n>We demonstrate the generality of this framework by evolving novel variants for two distinct paradigms of game-theoretic learning.
arXiv Detail & Related papers (2026-02-18T22:41:00Z)
OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL [63.388513841293616]
Existing forgery detection methods fail to handle the interleaved text, images, and videos prevalent in real-world misinformation.<n>To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding.<n>We propose textbf OmniVL-Guard, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding.
arXiv Detail & Related papers (2026-02-11T09:41:36Z)
Game-Theoretic Co-Evolution for LLM-Based Heuristic Discovery [37.96481049421407]
Large language models (LLMs) have enabled rapid progress in automatic discovery.<n>We propose a game-theoretic framework that reframes discovery as a program level co-evolution between solver and instance generator.
arXiv Detail & Related papers (2026-01-30T12:14:52Z)
Generative Actor Critic [74.04971271003869]
Generative Actor Critic (GAC) is a novel framework that decouples sequential decision-making by reframing textitpolicy evaluation as learning a generative model of the joint distribution over trajectories and returns.<n>Experiments on Gym-MuJoCo and Maze2D benchmarks demonstrate GAC's strong offline performance and significantly enhanced offline-to-online improvement compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-12-25T06:31:11Z)
Resolving Conflicts in Lifelong Learning via Aligning Updates in Subspaces [12.630494786258842]
Low-Rank Adaptation (LoRA) enables efficient Continual Learning but often suffers from catastrophic forgetting.<n>We propose PS-LoRA, a framework designed to resolve conflicts by aligning updates within the optimization subspace.<n>Our approach employs a dual-regularization objective that penalizes conflicting directions and constrains magnitude deviations to ensure consistency with prior knowledge.
arXiv Detail & Related papers (2025-11-28T15:34:36Z)
Beyond Confidence: Adaptive and Coherent Decoding for Diffusion Language Models [64.92045568376705]
Coherent Contextual Decoding (CCD) is a novel inference framework built upon two core innovations.<n>CCD employs a trajectory rectification mechanism that leverages historical context to enhance sequence coherence.<n>Instead of rigid allocations based on diffusion steps, we introduce an adaptive sampling strategy that dynamically adjusts the unmasking budget for each step.
arXiv Detail & Related papers (2025-11-26T09:49:48Z)
Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces [57.466101098183884]
Reinforcement learning (RL) struggles to scale to large, action spaces common in many real-world problems.<n>This paper introduces a novel framework for training discrete diffusion models as highly effective policies in complex settings.
arXiv Detail & Related papers (2025-09-26T21:53:36Z)
Fine-Tuning Next-Scale Visual Autoregressive Models with Group Relative Policy Optimization [1.1510009152620668]
Fine-tuning pre-trained generative models with Reinforcement Learning (RL) has emerged as an effective approach for aligning outputs with human preferences.<n>We show that RL-based fine-tuning is both efficient and effective for VAR models, benefiting particularly from their fast inference speeds.
arXiv Detail & Related papers (2025-05-29T10:45:38Z)
Flow-GRPO: Training Flow Matching Models via Online RL [80.62659379624867]
We propose Flow-GRPO, the first method to integrate online policy reinforcement learning into flow matching models.<n>Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation into an equivalent Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original number of inference steps.
arXiv Detail & Related papers (2025-05-08T17:58:45Z)
Adaptive Multi-Fidelity Reinforcement Learning for Variance Reduction in Engineering Design Optimization [0.0]
Multi-fidelity Reinforcement Learning (RL) frameworks efficiently utilize computational resources by integrating analysis models of varying accuracy and costs.<n>This work proposes a novel adaptive multi-fidelity RL framework, in which multiple heterogeneous, non-hierarchical low-fidelity models are dynamically leveraged alongside a high-fidelity model.<n>The effectiveness of the approach is demonstrated in an octocopter design optimization problem, utilizing two low-fidelity models alongside a high-fidelity simulator.
arXiv Detail & Related papers (2025-03-23T22:29:08Z)
ROCM: RLHF on consistency models [8.905375742101707]
We propose a reward optimization framework for applying RLHF to consistency models.<n>We investigate various $f$-divergences as regularization strategies, striking a balance between reward and model consistency.
arXiv Detail & Related papers (2025-03-08T11:19:48Z)
Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy.<n>The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms.<n>We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z)
REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.<n>In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.<n>We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)
Adjustable Robust Reinforcement Learning for Online 3D Bin Packing [11.157035538606968]
Current deep reinforcement learning methods for online 3D-BPP fail in real-world settings where some worst-case scenarios can materialize. We propose an adjustable robust reinforcement learning framework that allows efficient adjustment of robustness weights. Experiments demonstrate that AR2L is versatile in the sense that it improves policy robustness while maintaining an acceptable level of performance for the nominal case.
arXiv Detail & Related papers (2023-10-06T15:34:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.