Related papers: Relative Trajectory Balance is equivalent to Trust-PCL

Relative Trajectory Balance is equivalent to Trust-PCL

URL: http://arxiv.org/abs/2509.01632v1
Date: Mon, 01 Sep 2025 17:17:25 GMT
Title: Relative Trajectory Balance is equivalent to Trust-PCL
Authors: Tristan Deleu, Padideh Nouri, Yoshua Bengio, Doina Precup,
Abstract summary: Relative Trajectory Balance (RTB) aims to improve fine-tuning in sequential generative models.<n>This paper establishes an equivalence between RTB and Trust-PCL, an off-policy RL method with KL regularization.
Score: 72.58731629381032
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent progress in generative modeling has highlighted the importance of Reinforcement Learning (RL) for fine-tuning, with KL-regularized methods in particular proving to be highly effective for both autoregressive and diffusion models. Complementing this line of work, the Relative Trajectory Balance (RTB) objective was recently introduced in the context of Generative Flow Networks (GFlowNets) to serve the same role of improving fine-tuning in sequential generative models. Building on prior work linking GFlowNets and maximum-entropy RL, we establish in this paper an equivalence between RTB and Trust-PCL, an off-policy RL method with KL regularization. This equivalence situates RTB within the broader theoretical landscape of KL-regularized RL, and clarifies its relationship to earlier methods. Leveraging this insight, we revisit an illustrative example from the RTB paper and show that KL-regularized RL methods achieve comparable performance, offering an alternative perspective to what was previously reported.

Related papers

A Comedy of Estimators: On KL Regularization in RL Training of LLMs [81.7906270099878]
reinforcement learning (RL) can substantially improve the reasoning performance of large language models (LLMs)<n>The RL objective for LLM training involves a regularization term, which is the reverse Kullback-Leibler (KL) divergence between the trained policy and the reference policy.<n>Recent works show that prevailing practices for incorporating KL regularization do not provide correct gradients for stated objectives, creating a discrepancy between the objective and its implementation.<n>We study the gradients of several estimators configurations, revealing how design choices shape gradient bias.
arXiv Detail & Related papers (2025-12-26T04:20:58Z)
Coupled Variational Reinforcement Learning for Language Model General Reasoning [83.82392089177841]
We propose textitbCoupled bVari bReinforcement bLearning (CoVRL) to bridge variational inference and reinforcement learning.<n>CoVRL improves performance by 12.4% over the base model and achieves an additional 2.3% improvement over strong state-of-the-art verifier-free RL baselines.
arXiv Detail & Related papers (2025-12-14T07:03:51Z)
Principled and Tractable RL for Reasoning with Diffusion Language Models [0.0]
Diffusion large language models (dLLMs) are trained to predict multiple tokens in parallel and generate text via iterative unmasking.<n>Recent works have successfully pretrained dLLMs to parity with autoregressive LLMs at the 8B scale, but dLLMs have yet to benefit from modern post-training techniques.<n>We present Amortized Group Relative Policy Optimization (AGRPO), a principled on-policy RL algorithm designed specifically for dLLMs.
arXiv Detail & Related papers (2025-10-05T03:53:16Z)
Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends [64.71326476563213]
Off-policy reinforcement learning for large language models (LLMs) is attracting growing interest.<n>We present a first-principles derivation for grouprelative REINFORCE without assuming a specific training data distribution.<n>This perspective yields two general principles for adapting REINFORCE to off-policy settings.
arXiv Detail & Related papers (2025-09-29T02:34:54Z)
Inference-Time Alignment Control for Diffusion Models with Reinforcement Learning Guidance [46.06527859746679]
We introduce Reinforcement Learning Guidance (RLG), an inference-time method that adapts Dejin-Free Guidance (CFG)<n>RLG consistently improves the performance of RL fine-tuned models across various, RL algorithms, and downstream tasks, including human preferences, compositional control, compress, and text rendering.<n>Our approach provides a practical and theoretically sound solution for enhancing and controlling diffusion model alignment inference.
arXiv Detail & Related papers (2025-08-28T17:18:31Z)
Model Steering: Learning with a Reference Model Improves Generalization Bounds and Scaling Laws [52.10468229008941]
This paper formalizes an emerging learning paradigm that uses a trained model as a reference to guide and enhance the training of a target model through strategic data selection or weighting.<n>We provide theoretical insights into why this approach improves generalization and data efficiency compared to training without a reference model.<n>Building on these insights, we introduce a novel method for Contrastive Language-Image Pretraining with a reference model, termed DRRho-CLIP.
arXiv Detail & Related papers (2025-05-10T16:55:03Z)
Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization [14.320131946691268]
We propose an easy-to-use and theoretically sound fine-tuning method for flow-based generative models.<n>By introducing an online rewardweighting mechanism, our approach guides the model to prioritize high-reward regions in the data manifold.<n>Our method achieves optimal policy convergence while allowing controllable trade-offs between reward and diversity.
arXiv Detail & Related papers (2025-02-09T22:45:15Z)
Behavior-Regularized Diffusion Policy Optimization for Offline Reinforcement Learning [22.333460316347264]
We introduce BDPO, a principled behavior-regularized RL framework tailored for diffusion-based policies.<n>We develop an efficient two-time-scale actor-critic RL algorithm that produces the optimal policy while respecting the behavior constraint.
arXiv Detail & Related papers (2025-02-07T09:30:35Z)
Score as Action: Fine-Tuning Diffusion Generative Models by Continuous-time Reinforcement Learning [15.789898162610529]
Reinforcement learning from human feedback (RLHF) has become a crucial step in building reliable generative AI models.<n>This study is to develop a disciplined approach to fine-tune diffusion models using continuous-time RL.
arXiv Detail & Related papers (2025-02-03T20:50:05Z)
Understanding Reinforcement Learning-Based Fine-Tuning of Diffusion Models: A Tutorial and Review [63.31328039424469]
This tutorial provides a comprehensive survey of methods for fine-tuning diffusion models to optimize downstream reward functions. We explain the application of various RL algorithms, including PPO, differentiable optimization, reward-weighted MLE, value-weighted sampling, and path consistency learning.
arXiv Detail & Related papers (2024-07-18T17:35:32Z)
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint [56.74058752955209]
This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF) We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment. We propose efficient algorithms with finite-sample theoretical guarantees.
arXiv Detail & Related papers (2023-12-18T18:58:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.