Improving Reward-Conditioned Policies for Multi-Armed Bandits using Normalized Weight Functions
- URL: http://arxiv.org/abs/2406.10795v1
- Date: Sun, 16 Jun 2024 03:43:55 GMT
- Title: Improving Reward-Conditioned Policies for Multi-Armed Bandits using Normalized Weight Functions
- Authors: Kai Xu, Farid Tajaddodianfar, Ben Allison,
- Abstract summary: Recently proposed reward-conditioned policies (RCPs) offer an appealing alternative in reinforcement learning.
We show that RCPs are slower to converge and have inferior expected rewards at convergence, compared with classic methods.
We refer to this technique as generalized marginalization, whose advantage is that negative weights for policies conditioned on low rewards can make the resulting policies more distinct from them.
- Score: 8.90692770076582
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently proposed reward-conditioned policies (RCPs) offer an appealing alternative in reinforcement learning. Compared with policy gradient methods, policy learning in RCPs is simpler since it is based on supervised learning, and unlike value-based methods, it does not require optimization in the action space to take actions. However, for multi-armed bandit (MAB) problems, we find that RCPs are slower to converge and have inferior expected rewards at convergence, compared with classic methods such as the upper confidence bound and Thompson sampling. In this work, we show that the performance of RCPs can be enhanced by constructing policies through the marginalization of rewards using normalized weight functions, whose sum or integral equal $1$, although the function values may be negative. We refer to this technique as generalized marginalization, whose advantage is that negative weights for policies conditioned on low rewards can make the resulting policies more distinct from them. Strategies to perform generalized marginalization in MAB with discrete action spaces are studied. Through simulations, we demonstrate that the proposed technique improves RCPs and makes them competitive with classic methods, showing superior performance on challenging MABs with large action spaces and sparse reward signals.
Related papers
- Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion [44.95386817008473]
We introduce Contrastive Policy Gradient, or CoPG, a simple and mathematically principled new RL algorithm that can estimate the optimal policy even from off-policy data.
We show this approach to generalize the direct alignment method IPO (identity preference optimization) and classic policy gradient.
We experiment with the proposed CoPG on a toy bandit problem to illustrate its properties, as well as for finetuning LLMs on a summarization task.
arXiv Detail & Related papers (2024-06-27T14:03:49Z) - WARP: On the Benefits of Weight Averaged Rewarded Policies [66.95013068137115]
We introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP)
WARP merges policies in the weight space at three distinct stages.
Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.
arXiv Detail & Related papers (2024-06-24T16:24:34Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Potential-Based Reward Shaping For Intrinsic Motivation [4.798097103214276]
Intrinsic motivation (IM) reward-shaping methods can inadvertently change the set of optimal policies in an environment, leading to suboptimal behavior.
We present an extension to PBRS that we prove preserves the set of optimal policies under a more general set of functions.
We also present em Potential-Based Intrinsic Motivation (PBIM), a method for converting IM rewards into a potential-based form that is useable without altering the set of optimal policies.
arXiv Detail & Related papers (2024-02-12T05:12:09Z) - Handling Cost and Constraints with Off-Policy Deep Reinforcement
Learning [2.793095554369282]
Most popular methods for off-policy learning include policy improvement steps where a learned state-action ($Q$) value function is maximized over selected batches of data.
We revisit this strategy in environments with "mixed-sign" reward functions.
We find that this second approach, when applied to continuous action spaces with mixed-sign rewards, consistently and significantly outperforms state-of-the-art methods augmented by resetting.
arXiv Detail & Related papers (2023-11-30T16:31:04Z) - Semi-On-Policy Training for Sample Efficient Multi-Agent Policy
Gradients [51.749831824106046]
We introduce semi-on-policy (SOP) training as an effective and computationally efficient way to address the sample inefficiency of on-policy policy gradient methods.
We show that our methods perform as well or better than state-of-the-art value-based methods on a variety of SMAC tasks.
arXiv Detail & Related papers (2021-04-27T19:37:01Z) - State Augmented Constrained Reinforcement Learning: Overcoming the
Limitations of Learning with Rewards [88.30521204048551]
A common formulation of constrained reinforcement learning involves multiple rewards that must individually accumulate to given thresholds.
We show a simple example in which the desired optimal policy cannot be induced by any weighted linear combination of rewards.
This work addresses this shortcoming by augmenting the state with Lagrange multipliers and reinterpreting primal-dual methods.
arXiv Detail & Related papers (2021-02-23T21:07:35Z) - Difference Rewards Policy Gradients [17.644110838053134]
We propose a novel algorithm that combines difference rewards with policy to allow for learning decentralized policies.
By differencing the reward function directly, Dr.Reinforce avoids difficulties associated with learning the Q-function.
We show the effectiveness of a version of Dr.Reinforce that learns an additional reward network that is used to estimate the difference rewards.
arXiv Detail & Related papers (2020-12-21T11:23:17Z) - Reward-Conditioned Policies [100.64167842905069]
imitation learning requires near-optimal expert data.
Can we learn effective policies via supervised learning without demonstrations?
We show how such an approach can be derived as a principled method for policy search.
arXiv Detail & Related papers (2019-12-31T18:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.