The Value-Improvement Path: Towards Better Representations for
Reinforcement Learning
- URL: http://arxiv.org/abs/2006.02243v2
- Date: Mon, 4 Jan 2021 12:32:29 GMT
- Title: The Value-Improvement Path: Towards Better Representations for
Reinforcement Learning
- Authors: Will Dabney, Andr\'e Barreto, Mark Rowland, Robert Dadashi, John Quan,
Marc G. Bellemare, David Silver
- Abstract summary: We argue that the value prediction problems faced by an RL agent should not be addressed in isolation, but as a single, holistic, prediction problem.
An RL algorithm generates a sequence of policies that, at least approximately, improve towards the optimal policy.
We demonstrate that a representation that spans the past value-improvement path will also provide an accurate value approximation for future policy improvements.
- Score: 46.70945548475075
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In value-based reinforcement learning (RL), unlike in supervised learning,
the agent faces not a single, stationary, approximation problem, but a sequence
of value prediction problems. Each time the policy improves, the nature of the
problem changes, shifting both the distribution of states and their values. In
this paper we take a novel perspective, arguing that the value prediction
problems faced by an RL agent should not be addressed in isolation, but rather
as a single, holistic, prediction problem. An RL algorithm generates a sequence
of policies that, at least approximately, improve towards the optimal policy.
We explicitly characterize the associated sequence of value functions and call
it the value-improvement path. Our main idea is to approximate the
value-improvement path holistically, rather than to solely track the value
function of the current policy. Specifically, we discuss the impact that this
holistic view of RL has on representation learning. We demonstrate that a
representation that spans the past value-improvement path will also provide an
accurate value approximation for future policy improvements. We use this
insight to better understand existing approaches to auxiliary tasks and to
propose new ones. To test our hypothesis empirically, we augmented a standard
deep RL agent with an auxiliary task of learning the value-improvement path. In
a study of Atari 2600 games, the augmented agent achieved approximately double
the mean and median performance of the baseline agent.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Seizing Serendipity: Exploiting the Value of Past Success in Off-Policy Actor-Critic [42.57662196581823]
Learning high-quality $Q$-value functions plays a key role in the success of many modern off-policy deep reinforcement learning (RL) algorithms.
Deviating from the common viewpoint, we observe that $Q$-values are often underestimated in the latter stage of the RL training process.
We propose the Blended Exploitation and Exploration (BEE) operator, a simple yet effective approach that updates $Q$-value using both historical best-performing actions and the current policy.
arXiv Detail & Related papers (2023-06-05T13:38:14Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z) - Chaining Value Functions for Off-Policy Learning [22.54793586116019]
We discuss a novel family of off-policy prediction algorithms which are convergent by construction.
We prove that the proposed scheme is convergent and corresponds to an iterative decomposition of the inverse key matrix.
Empirically we evaluate the idea on challenging MDPs such as Baird's counter example and observe favourable results.
arXiv Detail & Related papers (2022-01-17T15:26:47Z) - Dealing with the Unknown: Pessimistic Offline Reinforcement Learning [25.30634466168587]
We propose a Pessimistic Offline Reinforcement Learning (PessORL) algorithm to actively lead the agent back to the area where it is familiar.
We focus on problems caused by out-of-distribution (OOD) states, and deliberately penalize high values at states that are absent in the training dataset.
arXiv Detail & Related papers (2021-11-09T22:38:58Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - DisCor: Corrective Feedback in Reinforcement Learning via Distribution
Correction [96.90215318875859]
We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from corrective feedback.
We propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training.
arXiv Detail & Related papers (2020-03-16T16:18:52Z) - Value-driven Hindsight Modelling [68.658900923595]
Value estimation is a critical component of the reinforcement learning (RL) paradigm.
Model learning can make use of the rich transition structure present in sequences of observations, but this approach is usually not sensitive to the reward function.
We develop an approach for representation learning in RL that sits in between these two extremes.
This provides tractable prediction targets that are directly relevant for a task, and can thus accelerate learning the value function.
arXiv Detail & Related papers (2020-02-19T18:10:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.