Model-free Policy Learning with Reward Gradients
- URL: http://arxiv.org/abs/2103.05147v4
- Date: Wed, 1 Nov 2023 18:34:00 GMT
- Title: Model-free Policy Learning with Reward Gradients
- Authors: Qingfeng Lan, Samuele Tosatto, Homayoon Farrahi, A. Rupam Mahmood
- Abstract summary: We develop the textitReward Policy Gradient estimator, a novel approach that integrates reward gradients without learning a model.
Our method also boosts the performance of Proximal Policy Optimization on different MuJoCo control tasks.
- Score: 9.847875182113137
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Despite the increasing popularity of policy gradient methods, they are yet to
be widely utilized in sample-scarce applications, such as robotics. The sample
efficiency could be improved by making best usage of available information. As
a key component in reinforcement learning, the reward function is usually
devised carefully to guide the agent. Hence, the reward function is usually
known, allowing access to not only scalar reward signals but also reward
gradients. To benefit from reward gradients, previous works require the
knowledge of environment dynamics, which are hard to obtain. In this work, we
develop the \textit{Reward Policy Gradient} estimator, a novel approach that
integrates reward gradients without learning a model. Bypassing the model
dynamics allows our estimator to achieve a better bias-variance trade-off,
which results in a higher sample efficiency, as shown in the empirical
analysis. Our method also boosts the performance of Proximal Policy
Optimization on different MuJoCo control tasks.
Related papers
- A Novel Variational Lower Bound for Inverse Reinforcement Learning [5.370126167091961]
Inverse reinforcement learning (IRL) seeks to learn the reward function from expert trajectories.
We present a new Variational Lower Bound for IRL (VLB-IRL)
Our method simultaneously learns the reward function and policy under the learned reward function.
arXiv Detail & Related papers (2023-11-07T03:50:43Z) - SURF: Semi-supervised Reward Learning with Data Augmentation for
Feedback-efficient Preference-based Reinforcement Learning [168.89470249446023]
We present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation.
In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor.
Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the preference-based method on a variety of locomotion and robotic manipulation tasks.
arXiv Detail & Related papers (2022-03-18T16:50:38Z) - Generative Adversarial Reward Learning for Generalized Behavior Tendency
Inference [71.11416263370823]
We propose a generative inverse reinforcement learning for user behavioral preference modelling.
Our model can automatically learn the rewards from user's actions based on discriminative actor-critic network and Wasserstein GAN.
arXiv Detail & Related papers (2021-05-03T13:14:25Z) - Replacing Rewards with Examples: Example-Based Policy Search via
Recursive Classification [133.20816939521941]
In the standard Markov decision process formalism, users specify tasks by writing down a reward function.
In many scenarios, the user is unable to describe the task in words or numbers, but can readily provide examples of what the world would look like if the task were solved.
Motivated by this observation, we derive a control algorithm that aims to visit states that have a high probability of leading to successful outcomes, given only examples of successful outcome states.
arXiv Detail & Related papers (2021-03-23T16:19:55Z) - On Proximal Policy Optimization's Heavy-tailed Gradients [150.08522793940708]
We study the heavy-tailed nature of the gradients of the Proximal Policy Optimization surrogate reward function.
In this paper, we study the effects of the standard PPO clippings, demonstrating that these tricks primarily serve to offset heavy-tailedness in gradients.
We propose incorporating GMOM, a high-dimensional robust estimator, into PPO as a substitute for three clipping tricks.
arXiv Detail & Related papers (2021-02-20T05:51:28Z) - Difference Rewards Policy Gradients [17.644110838053134]
We propose a novel algorithm that combines difference rewards with policy to allow for learning decentralized policies.
By differencing the reward function directly, Dr.Reinforce avoids difficulties associated with learning the Q-function.
We show the effectiveness of a version of Dr.Reinforce that learns an additional reward network that is used to estimate the difference rewards.
arXiv Detail & Related papers (2020-12-21T11:23:17Z) - f-IRL: Inverse Reinforcement Learning via State Marginal Matching [13.100127636586317]
We propose a method for learning the reward function (and the corresponding policy) to match the expert state density.
We present an algorithm, f-IRL, that recovers a stationary reward function from the expert density by gradient descent.
Our method outperforms adversarial imitation learning methods in terms of sample efficiency and the required number of expert trajectories.
arXiv Detail & Related papers (2020-11-09T19:37:48Z) - Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping [71.214923471669]
Reward shaping is an effective technique for incorporating domain knowledge into reinforcement learning (RL)
In this paper, we consider the problem of adaptively utilizing a given shaping reward function.
Experiments in sparse-reward cartpole and MuJoCo environments show that our algorithms can fully exploit beneficial shaping rewards.
arXiv Detail & Related papers (2020-11-05T05:34:14Z) - Batch Reinforcement Learning with a Nonparametric Off-Policy Policy
Gradient [34.16700176918835]
Off-policy Reinforcement Learning holds the promise of better data efficiency.
Current off-policy policy gradient methods either suffer from high bias or high variance, delivering often unreliable estimates.
We propose a nonparametric Bellman equation, which can be solved in closed form.
arXiv Detail & Related papers (2020-10-27T13:40:06Z) - Reward-Conditioned Policies [100.64167842905069]
imitation learning requires near-optimal expert data.
Can we learn effective policies via supervised learning without demonstrations?
We show how such an approach can be derived as a principled method for policy search.
arXiv Detail & Related papers (2019-12-31T18:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.