Monte Carlo Augmented Actor-Critic for Sparse Reward Deep Reinforcement
Learning from Suboptimal Demonstrations
- URL: http://arxiv.org/abs/2210.07432v1
- Date: Fri, 14 Oct 2022 00:23:37 GMT
- Title: Monte Carlo Augmented Actor-Critic for Sparse Reward Deep Reinforcement
Learning from Suboptimal Demonstrations
- Authors: Albert Wilcox, Ashwin Balakrishna, Jules Dedieu, Wyame Benslimane,
Daniel Brown, Ken Goldberg
- Abstract summary: Monte Carlo Augmented Actor Critic (MCAC) is a parameter free modification to standard actor-critic algorithms.
MCAC computes a modified $Q$-value by taking the maximum of the standard temporal distance (TD) target and a Monte Carlo estimate of the reward-to-go.
Experiments across $5$ continuous control domains suggest that MCAC can be used to significantly increase learning efficiency across $6$ commonly used RL and RL-from-demonstrations algorithms.
- Score: 17.08814685657957
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Providing densely shaped reward functions for RL algorithms is often
exceedingly challenging, motivating the development of RL algorithms that can
learn from easier-to-specify sparse reward functions. This sparsity poses new
exploration challenges. One common way to address this problem is using
demonstrations to provide initial signal about regions of the state space with
high rewards. However, prior RL from demonstrations algorithms introduce
significant complexity and many hyperparameters, making them hard to implement
and tune. We introduce Monte Carlo Augmented Actor Critic (MCAC), a parameter
free modification to standard actor-critic algorithms which initializes the
replay buffer with demonstrations and computes a modified $Q$-value by taking
the maximum of the standard temporal distance (TD) target and a Monte Carlo
estimate of the reward-to-go. This encourages exploration in the neighborhood
of high-performing trajectories by encouraging high $Q$-values in corresponding
regions of the state space. Experiments across $5$ continuous control domains
suggest that MCAC can be used to significantly increase learning efficiency
across $6$ commonly used RL and RL-from-demonstrations algorithms. See
https://sites.google.com/view/mcac-rl for code and supplementary material.
Related papers
- Uncertainty-Aware Reward-Free Exploration with General Function Approximation [69.27868448449755]
In this paper, we propose a reward-free reinforcement learning algorithm called alg.
The key idea behind our algorithm is an uncertainty-aware intrinsic reward for exploring the environment.
Experiment results show that GFA-RFE outperforms or is comparable to the performance of state-of-the-art unsupervised RL algorithms.
arXiv Detail & Related papers (2024-06-24T01:37:18Z) - Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis [16.288866201806382]
We develop a model-free RLHF best policy identification algorithm, called $mathsfBSAD$, without explicit reward model inference.
The algorithm identifies the optimal policy directly from human preference information in a backward manner.
arXiv Detail & Related papers (2024-06-11T17:01:41Z) - The Effective Horizon Explains Deep RL Performance in Stochastic Environments [21.148001945560075]
Reinforcement learning (RL) theory has largely focused on proving mini complexity sample bounds.
We introduce a new RL algorithm, SQIRL, that iteratively learns a nearoptimal policy by exploring randomly to collect rollouts.
We leverage SQIRL to derive instance-dependent sample complexity bounds for RL that are exponential only in an "effective horizon" look-ahead and on the complexity of the class used for approximation.
arXiv Detail & Related papers (2023-12-13T18:58:56Z) - Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL)
We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo.
Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - Supervised Advantage Actor-Critic for Recommender Systems [76.7066594130961]
We propose negative sampling strategy for training the RL component and combine it with supervised sequential learning.
Based on sampled (negative) actions (items), we can calculate the "advantage" of a positive action over the average case.
We instantiate SNQN and SA2C with four state-of-the-art sequential recommendation models and conduct experiments on two real-world datasets.
arXiv Detail & Related papers (2021-11-05T12:51:15Z) - MADE: Exploration via Maximizing Deviation from Explored Regions [48.49228309729319]
In online reinforcement learning (RL), efficient exploration remains challenging in high-dimensional environments with sparse rewards.
We propose a new exploration approach via textitmaximizing the deviation of the occupancy of the next policy from the explored regions.
Our approach significantly improves sample efficiency over state-of-the-art methods.
arXiv Detail & Related papers (2021-06-18T17:57:00Z) - On Using Hamiltonian Monte Carlo Sampling for Reinforcement Learning
Problems in High-dimension [7.200655637873445]
Hamiltonian Monte Carlo (HMC) sampling offers a tractable way to generate data for training RL algorithms.
We introduce a framework, called textitHamiltonian $Q$-Learning, that demonstrates, both theoretically and empirically, that $Q$ values can be learned from a dataset generated by HMC samples of actions, rewards, and state transitions.
arXiv Detail & Related papers (2020-11-11T17:35:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.