Rectifying Reinforcement Learning for Reward Matching
- URL: http://arxiv.org/abs/2406.02213v1
- Date: Tue, 4 Jun 2024 11:11:53 GMT
- Title: Rectifying Reinforcement Learning for Reward Matching
- Authors: Haoran He, Emmanuel Bengio, Qingpeng Cai, Ling Pan,
- Abstract summary: We establish a new connection between GFlowNets and policy evaluation for a uniform policy.
We propose a novel rectified policy evaluation algorithm, which achieves the same reward-matching effect as GFlowNets.
- Score: 12.294107455811496
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Generative Flow Network (GFlowNet) is a probabilistic framework in which an agent learns a stochastic policy and flow functions to sample objects with probability proportional to an unnormalized reward function. GFlowNets share a strong resemblance to reinforcement learning (RL), that typically aims to maximize reward, due to their sequential decision-making processes. Recent works have studied connections between GFlowNets and maximum entropy (MaxEnt) RL, which modifies the standard objective of RL agents by learning an entropy-regularized objective. However, a critical theoretical gap persists: despite the apparent similarities in their sequential decision-making nature, a direct link between GFlowNets and standard RL has yet to be discovered, while bridging this gap could further unlock the potential of both fields. In this paper, we establish a new connection between GFlowNets and policy evaluation for a uniform policy. Surprisingly, we find that the resulting value function for the uniform policy has a close relationship to the flows in GFlowNets. Leveraging these insights, we further propose a novel rectified policy evaluation (RPE) algorithm, which achieves the same reward-matching effect as GFlowNets, offering a new perspective. We compare RPE, MaxEnt RL, and GFlowNets in a number of benchmarks, and show that RPE achieves competitive results compared to previous approaches. This work sheds light on the previously unexplored connection between (non-MaxEnt) RL and GFlowNets, potentially opening new avenues for future research in both fields.
Related papers
- Optimizing Backward Policies in GFlowNets via Trajectory Likelihood Maximization [4.158255103170876]
GFlowNets are a family of generative models that learn to sample objects proportional to a given reward function.
Recent results show a close relationship between GFlowNet training and entropy-regularized reinforcement learning problems.
We introduce a simple backward policy optimization algorithm that involves direct sequentially of the value function in an entropy-regularized Markov Decision Process.
arXiv Detail & Related papers (2024-10-20T19:12:14Z) - GFlowNet Training by Policy Gradients [11.02335801879944]
We propose a new GFlowNet training framework, with policy-dependent rewards, that bridges keeping flow balance of GFlowNets to optimizing the expected accumulated reward in traditional Reinforcement-Learning (RL)
This enables the derivation of new policy-based GFlowNet training methods, in contrast to existing ones resembling value-based RL.
arXiv Detail & Related papers (2024-08-12T01:24:49Z) - Looking Backward: Retrospective Backward Synthesis for Goal-Conditioned GFlowNets [27.33222647437964]
Generative Flow Networks (GFlowNets) are amortized sampling methods for learning a policy to sequentially generate objects with probabilities to their rewards.
GFlowNets exhibit a remarkable ability to generate diverse sets of high-reward proportional objects, in contrast to standard reinforcement learning approaches.
Recent works have arisen for learning goal-conditioned GFlowNets to acquire various useful properties, aiming to train a single GFlowNet capable of achieving different goals as the task specifies.
We propose a novel method named Retrospective Backward Synthesis (RBS) to address these challenges. Specifically, RBS synthesizes a new backward trajectory
arXiv Detail & Related papers (2024-06-03T09:44:10Z) - Generative Flow Networks as Entropy-Regularized RL [4.857649518812728]
generative flow networks (GFlowNets) are a method of training a policy to sample compositional objects with proportional probabilities to a given reward via a sequence of actions.
We demonstrate how the task of learning a generative flow network can be efficiently as an entropy-regularized reinforcement learning problem.
Contrary to previously reported results, we show that entropic RL approaches can be competitive against established GFlowNet training methods.
arXiv Detail & Related papers (2023-10-19T17:31:40Z) - An Empirical Study of the Effectiveness of Using a Replay Buffer on Mode
Discovery in GFlowNets [47.82697599507171]
Reinforcement Learning (RL) algorithms aim to learn an optimal policy by iteratively sampling actions to learn how to maximize the total expected return, $R(x)$.
GFlowNets are a special class of algorithms designed to generate diverse candidates, $x$, from a discrete set, by learning a policy that approximates the proportional sampling of $R(x)$.
arXiv Detail & Related papers (2023-07-15T01:17:14Z) - Towards Understanding and Improving GFlowNet Training [71.85707593318297]
We introduce an efficient evaluation strategy to compare the learned sampling distribution to the target reward distribution.
We propose prioritized replay training of high-reward $x$, relative edge flow policy parametrization, and a novel guided trajectory balance objective.
arXiv Detail & Related papers (2023-05-11T22:50:41Z) - Stochastic Generative Flow Networks [89.34644133901647]
Generative Flow Networks (or GFlowNets) learn to sample complex structures through the lens of "inference as control"
Existing GFlowNets can be applied only to deterministic environments, and fail in more general tasks with dynamics.
This paper introduces GFlowNets, a new algorithm that extends GFlowNets to environments.
arXiv Detail & Related papers (2023-02-19T03:19:40Z) - Distributional GFlowNets with Quantile Flows [73.73721901056662]
Generative Flow Networks (GFlowNets) are a new family of probabilistic samplers where an agent learns a policy for generating complex structure through a series of decision-making steps.
In this work, we adopt a distributional paradigm for GFlowNets, turning each flow function into a distribution, thus providing more informative learning signals during training.
Our proposed textitquantile matching GFlowNet learning algorithm is able to learn a risk-sensitive policy, an essential component for handling scenarios with risk uncertainty.
arXiv Detail & Related papers (2023-02-11T22:06:17Z) - A theory of continuous generative flow networks [104.93913776866195]
Generative flow networks (GFlowNets) are amortized variational inference algorithms that are trained to sample from unnormalized target distributions.
We present a theory for generalized GFlowNets, which encompasses both existing discrete GFlowNets and ones with continuous or hybrid state spaces.
arXiv Detail & Related papers (2023-01-30T00:37:56Z) - Generative Augmented Flow Networks [88.50647244459009]
We propose Generative Augmented Flow Networks (GAFlowNets) to incorporate intermediate rewards into GFlowNets.
GAFlowNets can leverage edge-based and state-based intrinsic rewards in a joint way to improve exploration.
arXiv Detail & Related papers (2022-10-07T03:33:56Z) - Learning GFlowNets from partial episodes for improved convergence and
stability [56.99229746004125]
Generative flow networks (GFlowNets) are algorithms for training a sequential sampler of discrete objects under an unnormalized target density.
Existing training objectives for GFlowNets are either local to states or transitions, or propagate a reward signal over an entire sampling trajectory.
Inspired by the TD($lambda$) algorithm in reinforcement learning, we introduce subtrajectory balance or SubTB($lambda$), a GFlowNet training objective that can learn from partial action subsequences of varying lengths.
arXiv Detail & Related papers (2022-09-26T15:44:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.