Learning Bellman Complete Representations for Offline Policy Evaluation
- URL: http://arxiv.org/abs/2207.05837v1
- Date: Tue, 12 Jul 2022 21:02:02 GMT
- Title: Learning Bellman Complete Representations for Offline Policy Evaluation
- Authors: Jonathan D. Chang and Kaiwen Wang and Nathan Kallus and Wen Sun
- Abstract summary: Two sufficient conditions for sample-efficient OPE are Bellman completeness and coverage.
We show our representation enables better OPE compared to previous representation learning methods developed for off-policy RL.
- Score: 51.96704525783913
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study representation learning for Offline Reinforcement Learning (RL),
focusing on the important task of Offline Policy Evaluation (OPE). Recent work
shows that, in contrast to supervised learning, realizability of the Q-function
is not enough for learning it. Two sufficient conditions for sample-efficient
OPE are Bellman completeness and coverage. Prior work often assumes that
representations satisfying these conditions are given, with results being
mostly theoretical in nature. In this work, we propose BCRL, which directly
learns from data an approximately linear Bellman complete representation with
good coverage. With this learned representation, we perform OPE using Least
Square Policy Evaluation (LSPE) with linear functions in our learned
representation. We present an end-to-end theoretical analysis, showing that our
two-stage algorithm enjoys polynomial sample complexity provided some
representation in the rich class considered is linear Bellman complete.
Empirically, we extensively evaluate our algorithm on challenging, image-based
continuous control tasks from the Deepmind Control Suite. We show our
representation enables better OPE compared to previous representation learning
methods developed for off-policy RL (e.g., CURL, SPR). BCRL achieve competitive
OPE error with the state-of-the-art method Fitted Q-Evaluation (FQE), and beats
FQE when evaluating beyond the initial state distribution. Our ablations show
that both linear Bellman complete and coverage components of our method are
crucial.
Related papers
- iQRL -- Implicitly Quantized Representations for Sample-efficient Reinforcement Learning [24.684363928059113]
We propose an efficient representation learning method using only a self-supervised latent-state consistency loss.
We achieve high performance and prevent representation collapse by quantizing the latent representation.
Our method, named iQRL: implicitly Quantized Reinforcement Learning, is straightforward, compatible with any model-free RL algorithm.
arXiv Detail & Related papers (2024-06-04T18:15:44Z) - Free from Bellman Completeness: Trajectory Stitching via Model-based
Return-conditioned Supervised Learning [22.287106840756483]
We show how off-policy learning techniques based on return-conditioned supervised learning (RCSL) are able to circumvent challenges of Bellman completeness.
We propose a simple framework called MBRCSL, granting RCSL methods the ability of dynamic programming to stitch together segments from distinct trajectories.
arXiv Detail & Related papers (2023-10-30T07:03:14Z) - Hierarchical Decomposition of Prompt-Based Continual Learning:
Rethinking Obscured Sub-optimality [55.88910947643436]
Self-supervised pre-training is essential for handling vast quantities of unlabeled data in practice.
HiDe-Prompt is an innovative approach that explicitly optimize the hierarchical components with an ensemble of task-specific prompts and statistics.
Our experiments demonstrate the superior performance of HiDe-Prompt and its robustness to pre-training paradigms in continual learning.
arXiv Detail & Related papers (2023-10-11T06:51:46Z) - Stackelberg Batch Policy Learning [3.5426153040167754]
Batch reinforcement learning (RL) defines the task of learning from a fixed batch of data lacking exhaustive exploration.
Worst-case optimality algorithms, which calibrate a value-function model class from logged experience, have emerged as a promising paradigm for batch RL.
We propose a novel gradient-based learning algorithm: StackelbergLearner, in which the leader player updates according to the total derivative of its objective instead of the usual individual gradient.
arXiv Detail & Related papers (2023-09-28T06:18:34Z) - Provable Benefit of Multitask Representation Learning in Reinforcement
Learning [46.11628795660159]
This paper theoretically characterizes the benefit of representation learning under the low-rank Markov decision process (MDP) model.
To the best of our knowledge, this is the first theoretical study that characterizes the benefit of representation learning in exploration-based reward-free multitask reinforcement learning.
arXiv Detail & Related papers (2022-06-13T04:29:02Z) - Near-optimal Offline Reinforcement Learning with Linear Representation:
Leveraging Variance Information with Pessimism [65.46524775457928]
offline reinforcement learning seeks to utilize offline/historical data to optimize sequential decision-making strategies.
We study the statistical limits of offline reinforcement learning with linear model representations.
arXiv Detail & Related papers (2022-03-11T09:00:12Z) - Supervised Advantage Actor-Critic for Recommender Systems [76.7066594130961]
We propose negative sampling strategy for training the RL component and combine it with supervised sequential learning.
Based on sampled (negative) actions (items), we can calculate the "advantage" of a positive action over the average case.
We instantiate SNQN and SA2C with four state-of-the-art sequential recommendation models and conduct experiments on two real-world datasets.
arXiv Detail & Related papers (2021-11-05T12:51:15Z) - Provably Efficient Representation Selection in Low-rank Markov Decision
Processes: From Online to Offline RL [84.14947307790361]
We propose an efficient algorithm, called ReLEX, for representation learning in both online and offline reinforcement learning.
We show that the online version of ReLEX, called Re-UCB, always performs no worse than the state-of-the-art algorithm without representation selection.
For the offline counterpart, ReLEX-LCB, we show that the algorithm can find the optimal policy if the representation class can cover the state-action space.
arXiv Detail & Related papers (2021-06-22T17:16:50Z) - Provably Efficient Reward-Agnostic Navigation with Linear Value
Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function.
We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.