Reinforcement Learning with Sparse Rewards using Guidance from Offline
Demonstration
- URL: http://arxiv.org/abs/2202.04628v1
- Date: Wed, 9 Feb 2022 18:45:40 GMT
- Title: Reinforcement Learning with Sparse Rewards using Guidance from Offline
Demonstration
- Authors: Desik Rengarajan, Gargi Vaidya, Akshay Sarvesh, Dileep Kalathil,
Srinivas Shakkottai
- Abstract summary: A major challenge in real-world reinforcement learning (RL) is the sparsity of reward feedback.
We develop an algorithm that exploits the offline demonstration data generated by a sub-optimal behavior policy.
We demonstrate the superior performance of our algorithm over state-of-the-art approaches.
- Score: 9.017416068706579
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A major challenge in real-world reinforcement learning (RL) is the sparsity
of reward feedback. Often, what is available is an intuitive but sparse reward
function that only indicates whether the task is completed partially or fully.
However, the lack of carefully designed, fine grain feedback implies that most
existing RL algorithms fail to learn an acceptable policy in a reasonable time
frame. This is because of the large number of exploration actions that the
policy has to perform before it gets any useful feedback that it can learn
from. In this work, we address this challenging problem by developing an
algorithm that exploits the offline demonstration data generated by a
sub-optimal behavior policy for faster and efficient online RL in such sparse
reward settings. The proposed algorithm, which we call the Learning Online with
Guidance Offline (LOGO) algorithm, merges a policy improvement step with an
additional policy guidance step by using the offline demonstration data. The
key idea is that by obtaining guidance from - not imitating - the offline data,
LOGO orients its policy in the manner of the sub-optimal {policy}, while yet
being able to learn beyond and approach optimality. We provide a theoretical
analysis of our algorithm, and provide a lower bound on the performance
improvement in each learning episode. We also extend our algorithm to the even
more challenging incomplete observation setting, where the demonstration data
contains only a censored version of the true state observation. We demonstrate
the superior performance of our algorithm over state-of-the-art approaches on a
number of benchmark environments with sparse rewards and censored state.
Further, we demonstrate the value of our approach via implementing LOGO on a
mobile robot for trajectory tracking and obstacle avoidance, where it shows
excellent performance.
Related papers
- Rethinking Optimal Transport in Offline Reinforcement Learning [64.56896902186126]
In offline reinforcement learning, the data is provided by various experts and some of them can be sub-optimal.
To extract an efficient policy, it is necessary to emphstitch the best behaviors from the dataset.
We present an algorithm that aims to find a policy that maps states to a emphpartial distribution of the best expert actions for each given state.
arXiv Detail & Related papers (2024-10-17T22:36:43Z) - Understanding the performance gap between online and offline alignment algorithms [63.137832242488926]
We show that offline algorithms train policy to become good at pairwise classification, while online algorithms are good at generations.
This hints at a unique interplay between discriminative and generative capabilities, which is greatly impacted by the sampling process.
Our study sheds light on the pivotal role of on-policy sampling in AI alignment, and hints at certain fundamental challenges of offline alignment algorithms.
arXiv Detail & Related papers (2024-05-14T09:12:30Z) - Trajectory-Oriented Policy Optimization with Sparse Rewards [2.9602904918952695]
We introduce an approach leveraging offline demonstration trajectories for swifter and more efficient online RL in environments with sparse rewards.
Our pivotal insight involves treating offline demonstration trajectories as guidance, rather than mere imitation.
We then illustrate that this optimization problem can be streamlined into a policy-gradient algorithm, integrating rewards shaped by insights from offline demonstrations.
arXiv Detail & Related papers (2024-01-04T12:21:01Z) - Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z) - Efficient Online Reinforcement Learning with Offline Data [78.92501185886569]
We show that we can simply apply existing off-policy methods to leverage offline data when learning online.
We extensively ablate these design choices, demonstrating the key factors that most affect performance.
We see that correct application of these simple recommendations can provide a $mathbf2.5times$ improvement over existing approaches.
arXiv Detail & Related papers (2023-02-06T17:30:22Z) - Behavior Prior Representation learning for Offline Reinforcement
Learning [23.200489608592694]
We introduce a simple, yet effective approach for learning state representations.
Our method, Behavior Prior Representation (BPR), learns state representations with an easy-to-integrate objective based on behavior cloning of the dataset.
We show that BPR combined with existing state-of-the-art Offline RL algorithms leads to significant improvements across several offline control benchmarks.
arXiv Detail & Related papers (2022-11-02T04:15:20Z) - Offline Stochastic Shortest Path: Learning, Evaluation and Towards
Optimality [57.91411772725183]
In this paper, we consider the offline shortest path problem when the state space and the action space are finite.
We design the simple value-based algorithms for tackling both offline policy evaluation (OPE) and offline policy learning tasks.
Our analysis of these simple algorithms yields strong instance-dependent bounds which can imply worst-case bounds that are near-minimax optimal.
arXiv Detail & Related papers (2022-06-10T07:44:56Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.