Proper Value Equivalence
- URL: http://arxiv.org/abs/2106.10316v1
- Date: Fri, 18 Jun 2021 19:05:20 GMT
- Title: Proper Value Equivalence
- Authors: Christopher Grimm, Andr\'e Barreto, Gregory Farquhar, David Silver,
Satinder Singh
- Abstract summary: We argue that popular algorithms such as MuZero and Muesli can be understood as minimizing an upper bound for this loss.
We propose a modification to MuZero to propose a modification to MuZero and show that it can lead to improved performance in practice.
- Score: 37.565244088924906
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One of the main challenges in model-based reinforcement learning (RL) is to
decide which aspects of the environment should be modeled. The
value-equivalence (VE) principle proposes a simple answer to this question: a
model should capture the aspects of the environment that are relevant for
value-based planning. Technically, VE distinguishes models based on a set of
policies and a set of functions: a model is said to be VE to the environment if
the Bellman operators it induces for the policies yield the correct result when
applied to the functions. As the number of policies and functions increase, the
set of VE models shrinks, eventually collapsing to a single point corresponding
to a perfect model. A fundamental question underlying the VE principle is thus
how to select the smallest sets of policies and functions that are sufficient
for planning. In this paper we take an important step towards answering this
question. We start by generalizing the concept of VE to order-$k$ counterparts
defined with respect to $k$ applications of the Bellman operator. This leads to
a family of VE classes that increase in size as $k \rightarrow \infty$. In the
limit, all functions become value functions, and we have a special
instantiation of VE which we call proper VE or simply PVE. Unlike VE, the PVE
class may contain multiple models even in the limit when all value functions
are used. Crucially, all these models are sufficient for planning, meaning that
they will yield an optimal policy despite the fact that they may ignore many
aspects of the environment. We construct a loss function for learning PVE
models and argue that popular algorithms such as MuZero and Muesli can be
understood as minimizing an upper bound for this loss. We leverage this
connection to propose a modification to MuZero and show that it can lead to
improved performance in practice.
Related papers
- $V_0$: A Generalist Value Model for Any Policy at State Zero [80.7505802128501]
Policy methods rely on a baseline to measure the relative advantage of an action.<n>This baseline is typically estimated by a Value Model (Critic) often as large as the policy model itself.<n>We propose a Generalist Value Model capable of estimating the expected performance of any model on unseen prompts.
arXiv Detail & Related papers (2026-02-03T14:35:23Z) - Is there Value in Reinforcement Learning? [1.534667887016089]
Action-values play a central role in popular Reinforcement Learing (RL) models of behavior.<n>Critics had suggested that policy-gradient (PG) models should be favored over value-based (VB) ones.
arXiv Detail & Related papers (2025-05-07T21:50:27Z) - Partial Identifiability and Misspecification in Inverse Reinforcement Learning [64.13583792391783]
The aim of Inverse Reinforcement Learning is to infer a reward function $R$ from a policy $pi$.
This paper provides a comprehensive analysis of partial identifiability and misspecification in IRL.
arXiv Detail & Related papers (2024-11-24T18:35:46Z) - Tackling Decision Processes with Non-Cumulative Objectives using Reinforcement Learning [0.0]
We introduce a general mapping of non-cumulative Markov decision processes to standard MDPs.
This allows all techniques developed to find optimal policies for MDPs to be directly applied to the larger class of NCMDPs.
We show applications in a diverse set of tasks, including classical control, portfolio optimization in finance, and discrete optimization problems.
arXiv Detail & Related papers (2024-05-22T13:01:37Z) - Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks.
We study the problem from a model-based Bayesian reinforcement learning perspective.
We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z) - IdealGPT: Iteratively Decomposing Vision and Language Reasoning via
Large Language Models [77.0577928874177]
We develop a framework that decomposes vision-and-language (VL) reasoning using large language models (LLMs)
In particular, our IdealGPT outperforms the best existing GPT-4-like models by an absolute 10% on VCR and 15% on SNLI-VE.
arXiv Detail & Related papers (2023-05-24T10:19:57Z) - Misspecification in Inverse Reinforcement Learning [80.91536434292328]
The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function $R$ from a policy $pi$.
One of the primary motivations behind IRL is to infer human preferences from human behaviour.
This means that they are misspecified, which raises the worry that they might lead to unsound inferences if applied to real-world data.
arXiv Detail & Related papers (2022-12-06T18:21:47Z) - Model Selection in Reinforcement Learning with General Function
Approximations [10.97775622611135]
We consider model selection for Reinforcement Learning environments -- Multi Armed Bandits (MABs) and Markov Decision Processes (MDPs)
In the model selection framework, we do not know the function classes, denoted by $mathcalF$ and $mathcalM$, where the true models lie.
We show that the cumulative regret of our adaptive algorithms match to that of an oracle which knows the correct function classes.
arXiv Detail & Related papers (2022-07-06T21:52:07Z) - On Query-efficient Planning in MDPs under Linear Realizability of the
Optimal State-value Function [14.205660708980988]
We consider the problem of local planning in fixed-horizon Markov Decision Processes (MDPs) with a generative model.
A recent lower bound established that the related problem when the action-value function of the optimal policy is linearly realizable requires an exponential number of queries.
In this work, we establish that poly$(H, d)$ learning is possible (with state value function realizability) whenever the action set is small.
arXiv Detail & Related papers (2021-02-03T13:23:15Z) - The Value Equivalence Principle for Model-Based Reinforcement Learning [29.368870568214007]
We argue that the limited representational resources of model-based RL agents are better used to build models that are directly useful for value-based planning.
We show that, as we augment the set of policies and functions considered, the class of value equivalent models shrinks.
We argue that the principle of value equivalence underlies a number of recent empirical successes in RL.
arXiv Detail & Related papers (2020-11-06T18:25:54Z) - Exploiting Submodular Value Functions For Scaling Up Active Perception [60.81276437097671]
In active perception tasks, agent aims to select sensory actions that reduce uncertainty about one or more hidden variables.
Partially observable Markov decision processes (POMDPs) provide a natural model for such problems.
As the number of sensors available to the agent grows, the computational cost of POMDP planning grows exponentially.
arXiv Detail & Related papers (2020-09-21T09:11:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.