A State-Distribution Matching Approach to Non-Episodic Reinforcement
Learning
- URL: http://arxiv.org/abs/2205.05212v1
- Date: Wed, 11 May 2022 00:06:29 GMT
- Title: A State-Distribution Matching Approach to Non-Episodic Reinforcement
Learning
- Authors: Archit Sharma, Rehaan Ahmad, Chelsea Finn
- Abstract summary: A major hurdle to real-world application arises from the development of algorithms in an episodic setting.
We propose a new method, MEDAL, that trains the backward policy to match the state distribution in the provided demonstrations.
Our experiments show that MEDAL matches or outperforms prior methods on three sparse-reward continuous control tasks.
- Score: 61.406020873047794
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While reinforcement learning (RL) provides a framework for learning through
trial and error, translating RL algorithms into the real world has remained
challenging. A major hurdle to real-world application arises from the
development of algorithms in an episodic setting where the environment is reset
after every trial, in contrast with the continual and non-episodic nature of
the real-world encountered by embodied agents such as humans and robots. Prior
works have considered an alternating approach where a forward policy learns to
solve the task and the backward policy learns to reset the environment, but
what initial state distribution should the backward policy reset the agent to?
Assuming access to a few demonstrations, we propose a new method, MEDAL, that
trains the backward policy to match the state distribution in the provided
demonstrations. This keeps the agent close to the task-relevant states,
allowing for a mix of easy and difficult starting states for the forward
policy. Our experiments show that MEDAL matches or outperforms prior methods on
three sparse-reward continuous control tasks from the EARL benchmark, with 40%
gains on the hardest task, while making fewer assumptions than prior works.
Related papers
- No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery [53.08822154199948]
Unsupervised Environment Design (UED) methods have gained recent attention as their adaptive curricula promise to enable agents to be robust to in- and out-of-distribution tasks.
This work investigates how existing UED methods select training environments, focusing on task prioritisation metrics.
We develop a method that directly trains on scenarios with high learnability.
arXiv Detail & Related papers (2024-08-27T14:31:54Z) - Belief-Enriched Pessimistic Q-Learning against Adversarial State
Perturbations [5.076419064097735]
Recent work shows that a well-trained RL agent can be easily manipulated by strategically perturbing its state observations at the test stage.
Existing solutions either introduce a regularization term to improve the smoothness of the trained policy against perturbations or alternatively train the agent's policy and the attacker's policy.
We propose a new robust RL algorithm for deriving a pessimistic policy to safeguard against an agent's uncertainty about true states.
arXiv Detail & Related papers (2024-03-06T20:52:49Z) - Self-Supervised Curriculum Generation for Autonomous Reinforcement
Learning without Task-Specific Knowledge [25.168236693829783]
A significant bottleneck in applying current reinforcement learning algorithms to real-world scenarios is the need to reset the environment between every episode.
We propose a novel ARL algorithm that can generate a curriculum adaptive to the agent's learning progress without task-specific knowledge.
arXiv Detail & Related papers (2023-11-15T18:40:10Z) - Behavior Prior Representation learning for Offline Reinforcement
Learning [23.200489608592694]
We introduce a simple, yet effective approach for learning state representations.
Our method, Behavior Prior Representation (BPR), learns state representations with an easy-to-integrate objective based on behavior cloning of the dataset.
We show that BPR combined with existing state-of-the-art Offline RL algorithms leads to significant improvements across several offline control benchmarks.
arXiv Detail & Related papers (2022-11-02T04:15:20Z) - You Only Live Once: Single-Life Reinforcement Learning [124.1738675154651]
In many real-world situations, the goal might not be to learn a policy that can do the task repeatedly, but simply to perform a new task successfully once in a single trial.
We formalize this problem setting, where an agent must complete a task within a single episode without interventions.
We propose an algorithm, $Q$-weighted adversarial learning (QWALE), which employs a distribution matching strategy.
arXiv Detail & Related papers (2022-10-17T09:00:11Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - Robust Learning from Observation with Model Misspecification [33.92371002674386]
Imitation learning (IL) is a popular paradigm for training policies in robotic systems.
We propose a robust IL algorithm to learn policies that can effectively transfer to the real environment without fine-tuning.
arXiv Detail & Related papers (2022-02-12T07:04:06Z) - Persistent Reinforcement Learning via Subgoal Curricula [114.83989499740193]
Value-accelerated Persistent Reinforcement Learning (VaPRL) generates a curriculum of initial states.
VaPRL reduces the interventions required by three orders of magnitude compared to episodic reinforcement learning.
arXiv Detail & Related papers (2021-07-27T16:39:45Z) - Guided Uncertainty-Aware Policy Optimization: Combining Learning and
Model-Based Strategies for Sample-Efficient Policy Learning [75.56839075060819]
Traditional robotic approaches rely on an accurate model of the environment, a detailed description of how to perform the task, and a robust perception system to keep track of the current state.
reinforcement learning approaches can operate directly from raw sensory inputs with only a reward signal to describe the task, but are extremely sample-inefficient and brittle.
In this work, we combine the strengths of model-based methods with the flexibility of learning-based methods to obtain a general method that is able to overcome inaccuracies in the robotics perception/actuation pipeline.
arXiv Detail & Related papers (2020-05-21T19:47:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.