Related papers: Proper Value Equivalence

Proper Value Equivalence

URL: http://arxiv.org/abs/2106.10316v1
Date: Fri, 18 Jun 2021 19:05:20 GMT
Title: Proper Value Equivalence
Authors: Christopher Grimm, Andr\'e Barreto, Gregory Farquhar, David Silver, Satinder Singh
Abstract summary: We argue that popular algorithms such as MuZero and Muesli can be understood as minimizing an upper bound for this loss. We propose a modification to MuZero to propose a modification to MuZero and show that it can lead to improved performance in practice.
Score: 37.565244088924906
License: http://creativecommons.org/licenses/by/4.0/
Abstract: One of the main challenges in model-based reinforcement learning (RL) is to decide which aspects of the environment should be modeled. The value-equivalence (VE) principle proposes a simple answer to this question: a model should capture the aspects of the environment that are relevant for value-based planning. Technically, VE distinguishes models based on a set of policies and a set of functions: a model is said to be VE to the environment if the Bellman operators it induces for the policies yield the correct result when applied to the functions. As the number of policies and functions increase, the set of VE models shrinks, eventually collapsing to a single point corresponding to a perfect model. A fundamental question underlying the VE principle is thus how to select the smallest sets of policies and functions that are sufficient for planning. In this paper we take an important step towards answering this question. We start by generalizing the concept of VE to order-$k$ counterparts defined with respect to $k$ applications of the Bellman operator. This leads to a family of VE classes that increase in size as $k \rightarrow \infty$. In the limit, all functions become value functions, and we have a special instantiation of VE which we call proper VE or simply PVE. Unlike VE, the PVE class may contain multiple models even in the limit when all value functions are used. Crucially, all these models are sufficient for planning, meaning that they will yield an optimal policy despite the fact that they may ignore many aspects of the environment. We construct a loss function for learning PVE models and argue that popular algorithms such as MuZero and Muesli can be understood as minimizing an upper bound for this loss. We leverage this connection to propose a modification to MuZero and show that it can lead to improved performance in practice.

Related papers

$V_0$: A Generalist Value Model for Any Policy at State Zero [80.7505802128501]
Policy methods rely on a baseline to measure the relative advantage of an action.<n>This baseline is typically estimated by a Value Model (Critic) often as large as the policy model itself.<n>We propose a Generalist Value Model capable of estimating the expected performance of any model on unseen prompts.
arXiv Detail & Related papers (2026-02-03T14:35:23Z)
Is there Value in Reinforcement Learning? [1.534667887016089]
Action-values play a central role in popular Reinforcement Learing (RL) models of behavior.<n>Critics had suggested that policy-gradient (PG) models should be favored over value-based (VB) ones.
arXiv Detail & Related papers (2025-05-07T21:50:27Z)
Partial Identifiability and Misspecification in Inverse Reinforcement Learning [64.13583792391783]
The aim of Inverse Reinforcement Learning is to infer a reward function $R$ from a policy $pi$. This paper provides a comprehensive analysis of partial identifiability and misspecification in IRL.
arXiv Detail & Related papers (2024-11-24T18:35:46Z)
Tackling Decision Processes with Non-Cumulative Objectives using Reinforcement Learning [0.0]
We introduce a general mapping of non-cumulative Markov decision processes to standard MDPs. This allows all techniques developed to find optimal policies for MDPs to be directly applied to the larger class of NCMDPs. We show applications in a diverse set of tasks, including classical control, portfolio optimization in finance, and discrete optimization problems.
arXiv Detail & Related papers (2024-05-22T13:01:37Z)
Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks. We study the problem from a model-based Bayesian reinforcement learning perspective. We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z)
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models [77.0577928874177]
We develop a framework that decomposes vision-and-language (VL) reasoning using large language models (LLMs) In particular, our IdealGPT outperforms the best existing GPT-4-like models by an absolute 10% on VCR and 15% on SNLI-VE.
arXiv Detail & Related papers (2023-05-24T10:19:57Z)
Misspecification in Inverse Reinforcement Learning [80.91536434292328]
The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function $R$ from a policy $pi$. One of the primary motivations behind IRL is to infer human preferences from human behaviour. This means that they are misspecified, which raises the worry that they might lead to unsound inferences if applied to real-world data.
arXiv Detail & Related papers (2022-12-06T18:21:47Z)
Model Selection in Reinforcement Learning with General Function Approximations [10.97775622611135]
We consider model selection for Reinforcement Learning environments -- Multi Armed Bandits (MABs) and Markov Decision Processes (MDPs) In the model selection framework, we do not know the function classes, denoted by $mathcalF$ and $mathcalM$, where the true models lie. We show that the cumulative regret of our adaptive algorithms match to that of an oracle which knows the correct function classes.
arXiv Detail & Related papers (2022-07-06T21:52:07Z)
On Query-efficient Planning in MDPs under Linear Realizability of the Optimal State-value Function [14.205660708980988]
We consider the problem of local planning in fixed-horizon Markov Decision Processes (MDPs) with a generative model. A recent lower bound established that the related problem when the action-value function of the optimal policy is linearly realizable requires an exponential number of queries. In this work, we establish that poly$(H, d)$ learning is possible (with state value function realizability) whenever the action set is small.
arXiv Detail & Related papers (2021-02-03T13:23:15Z)
The Value Equivalence Principle for Model-Based Reinforcement Learning [29.368870568214007]
We argue that the limited representational resources of model-based RL agents are better used to build models that are directly useful for value-based planning. We show that, as we augment the set of policies and functions considered, the class of value equivalent models shrinks. We argue that the principle of value equivalence underlies a number of recent empirical successes in RL.
arXiv Detail & Related papers (2020-11-06T18:25:54Z)
Exploiting Submodular Value Functions For Scaling Up Active Perception [60.81276437097671]
In active perception tasks, agent aims to select sensory actions that reduce uncertainty about one or more hidden variables. Partially observable Markov decision processes (POMDPs) provide a natural model for such problems. As the number of sensors available to the agent grows, the computational cost of POMDP planning grows exponentially.
arXiv Detail & Related papers (2020-09-21T09:11:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.