Learning What to Trust: Bayesian Prior-Guided Optimization for Visual Generation
- URL: http://arxiv.org/abs/2511.18919v1
- Date: Mon, 24 Nov 2025 09:29:30 GMT
- Title: Learning What to Trust: Bayesian Prior-Guided Optimization for Visual Generation
- Authors: Ruiying Liu, Yuanzhi Liang, Haibin Huang, Tianshu Yu, Chi Zhang,
- Abstract summary: We introduce BPGO, a novel extension of GRPO that explicitly models reward uncertainty through a semantic prior anchor.<n>BPGO delivers consistently stronger semantic alignment, enhanced perceptual fidelity, and faster convergence than standard GRPO and recent variants.
- Score: 18.178060190776858
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Group Relative Policy Optimization (GRPO) has emerged as an effective and lightweight framework for post-training visual generative models. However, its performance is fundamentally limited by the ambiguity of textual visual correspondence: a single prompt may validly describe diverse visual outputs, and a single image or video may support multiple equally correct interpretations. This many to many relationship leads reward models to generate uncertain and weakly discriminative signals, causing GRPO to underutilize reliable feedback and overfit noisy ones. We introduce Bayesian Prior-Guided Optimization (BPGO), a novel extension of GRPO that explicitly models reward uncertainty through a semantic prior anchor. BPGO adaptively modulates optimization trust at two levels: inter-group Bayesian trust allocation emphasizes updates from groups consistent with the prior while down-weighting ambiguous ones, and intra-group prior-anchored renormalization sharpens sample distinctions by expanding confident deviations and compressing uncertain scores. Across both image and video generation tasks, BPGO delivers consistently stronger semantic alignment, enhanced perceptual fidelity, and faster convergence than standard GRPO and recent variants.
Related papers
- iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z) - Latent Adversarial Regularization for Offline Preference Optimization [21.271580780278473]
We introduce GANPO, which achieves latent-space regularization by penalizing divergence between internal representations of a policy model and a reference model.<n>Experiments across multiple model architectures and tasks show consistent improvements from latent-space regularization.
arXiv Detail & Related papers (2026-01-29T18:21:57Z) - MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z) - GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [133.27496265096445]
We show how to apply Group Relative Policy Optimization under multi-reward setting without examining its suitability.<n>We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues.<n>GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
arXiv Detail & Related papers (2026-01-08T18:59:24Z) - Information-Consistent Language Model Recommendations through Group Relative Policy Optimization [3.969918993270049]
Large Language Models (LLMs) are increasingly deployed in business-critical domains such as finance, education, healthcare, and customer support.<n>LLMs often exhibit variability when prompts are phrased with minor differences, even when semantically equivalent.<n>We propose a reinforcement learning framework based on Group Relative Policy Optimization (GRPO) to optimize for consistency.
arXiv Detail & Related papers (2025-12-14T21:52:31Z) - Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment [61.80228667422234]
VGPO redefines value estimation across both temporal and group dimensions.<n>It transforms the sparse terminal reward into dense, process-aware value estimates.<n>It replaces standard group normalization with a novel process enhanced by absolute values to maintain a stable optimization signal.
arXiv Detail & Related papers (2025-12-13T16:31:26Z) - Multimodal Large Language Models with Adaptive Preference Optimization for Sequential Recommendation [60.33386541343322]
We propose a Multimodal Large Language Models framework that integrates Hardness-aware and Noise-regularized preference optimization for Recommendation (HaNoRec)<n>Specifically, HaNoRec dynamically adjusts optimization weights based on both the estimated hardness of each training sample and the policy model's real-time responsiveness.
arXiv Detail & Related papers (2025-11-24T04:10:46Z) - GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping [63.33669214116784]
GRPO-Guard is a simple yet effective enhancement to existing GRPO frameworks.<n>It restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates.<n>It substantially mitigates implicit over-optimization without relying on heavy KL regularization.
arXiv Detail & Related papers (2025-10-25T14:51:17Z) - Diverse Text-to-Image Generation via Contrastive Noise Optimization [60.48914865049489]
Text-to-image (T2I) diffusion models have demonstrated impressive performance in generating high-fidelity images.<n>Existing approaches typically optimize intermediate latents or text conditions during inference.<n>We introduce Contrastive Noise Optimization, a simple yet effective method that addresses the diversity issue from a distinct perspective.
arXiv Detail & Related papers (2025-10-04T13:51:32Z) - Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning [34.75717081153747]
Current methods for scoring generated images are susceptible to reward hacking.<n>We propose Pref-GRPO, which shifts the optimization objective from score to preference fitting, ensuring more stable training.<n>Existing T2I benchmarks are limited by coarse evaluation criteria, hindering comprehensive model assessment.<n>We introduce UniGenBench, a unified T2I benchmark comprising 600 prompts across 5 main themes and 20 subthemes.
arXiv Detail & Related papers (2025-08-28T13:11:24Z) - Feedback Guidance of Diffusion Models [14.162420300295365]
Interval-Free Guidance (CFG) has become standard for improving sample fidelity in conditional diffusion models.<n>We propose FeedBack Guidance (FBG), which uses a state-dependent coefficient to self-regulate guidance amounts based on need.
arXiv Detail & Related papers (2025-06-06T13:46:32Z) - Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z) - VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization [59.39976343879587]
VerIPO aims to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains.<n>The training loop benefits from GRPO's expansive search and DPO's targeted optimization.<n>Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs.
arXiv Detail & Related papers (2025-05-25T06:41:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.