Related papers: The Value-Improvement Path: Towards Better Representations for Reinforcement Learning

The Value-Improvement Path: Towards Better Representations for Reinforcement Learning

URL: http://arxiv.org/abs/2006.02243v2
Date: Mon, 4 Jan 2021 12:32:29 GMT
Title: The Value-Improvement Path: Towards Better Representations for Reinforcement Learning
Authors: Will Dabney, Andr\'e Barreto, Mark Rowland, Robert Dadashi, John Quan, Marc G. Bellemare, David Silver
Abstract summary: We argue that the value prediction problems faced by an RL agent should not be addressed in isolation, but as a single, holistic, prediction problem. An RL algorithm generates a sequence of policies that, at least approximately, improve towards the optimal policy. We demonstrate that a representation that spans the past value-improvement path will also provide an accurate value approximation for future policy improvements.
Score: 46.70945548475075
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In value-based reinforcement learning (RL), unlike in supervised learning, the agent faces not a single, stationary, approximation problem, but a sequence of value prediction problems. Each time the policy improves, the nature of the problem changes, shifting both the distribution of states and their values. In this paper we take a novel perspective, arguing that the value prediction problems faced by an RL agent should not be addressed in isolation, but rather as a single, holistic, prediction problem. An RL algorithm generates a sequence of policies that, at least approximately, improve towards the optimal policy. We explicitly characterize the associated sequence of value functions and call it the value-improvement path. Our main idea is to approximate the value-improvement path holistically, rather than to solely track the value function of the current policy. Specifically, we discuss the impact that this holistic view of RL has on representation learning. We demonstrate that a representation that spans the past value-improvement path will also provide an accurate value approximation for future policy improvements. We use this insight to better understand existing approaches to auxiliary tasks and to propose new ones. To test our hypothesis empirically, we augmented a standard deep RL agent with an auxiliary task of learning the value-improvement path. In a study of Atari 2600 games, the augmented agent achieved approximately double the mean and median performance of the baseline agent.

Related papers

Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world. Recent methods aim to mitigate misalignment by learning reward functions from human preferences. We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
Seizing Serendipity: Exploiting the Value of Past Success in Off-Policy Actor-Critic [42.57662196581823]
Learning high-quality $Q$-value functions plays a key role in the success of many modern off-policy deep reinforcement learning (RL) algorithms. Deviating from the common viewpoint, we observe that $Q$-values are often underestimated in the latter stage of the RL training process. We propose the Blended Exploitation and Exploration (BEE) operator, a simple yet effective approach that updates $Q$-value using both historical best-performing actions and the current policy.
arXiv Detail & Related papers (2023-06-05T13:38:14Z)
Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences. Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z)
Chaining Value Functions for Off-Policy Learning [22.54793586116019]
We discuss a novel family of off-policy prediction algorithms which are convergent by construction. We prove that the proposed scheme is convergent and corresponds to an iterative decomposition of the inverse key matrix. Empirically we evaluate the idea on challenging MDPs such as Baird's counter example and observe favourable results.
arXiv Detail & Related papers (2022-01-17T15:26:47Z)
Dealing with the Unknown: Pessimistic Offline Reinforcement Learning [25.30634466168587]
We propose a Pessimistic Offline Reinforcement Learning (PessORL) algorithm to actively lead the agent back to the area where it is familiar. We focus on problems caused by out-of-distribution (OOD) states, and deliberately penalize high values at states that are absent in the training dataset.
arXiv Detail & Related papers (2021-11-09T22:38:58Z)
Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy. We propose an offline RL method that never needs to evaluate actions outside of the dataset. This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z)
DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction [96.90215318875859]
We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from corrective feedback. We propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training.
arXiv Detail & Related papers (2020-03-16T16:18:52Z)
Value-driven Hindsight Modelling [68.658900923595]
Value estimation is a critical component of the reinforcement learning (RL) paradigm. Model learning can make use of the rich transition structure present in sequences of observations, but this approach is usually not sensitive to the reward function. We develop an approach for representation learning in RL that sits in between these two extremes. This provides tractable prediction targets that are directly relevant for a task, and can thus accelerate learning the value function.
arXiv Detail & Related papers (2020-02-19T18:10:20Z)
Relative Importance Sampling for off-Policy Actor-Critic in Deep Reinforcement Learning [32.66049977978746]
Off-policy learning exhibits greater instability when compared to on-policy learning in reinforcement learning (RL) We propose a smooth form of importance sampling, specifically relative importance sampling (RIS), which mitigates variance and stabilizes learning. Our methods performed better than or equal to several state-of-the-art RL benchmarks on OpenAI Gym challenges and synthetic datasets.
arXiv Detail & Related papers (2018-10-30T07:41:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.