Related papers: A Unifying View of Coverage in Linear Off-Policy Evaluation

A Unifying View of Coverage in Linear Off-Policy Evaluation

URL: http://arxiv.org/abs/2601.19030v1
Date: Mon, 26 Jan 2026 23:30:24 GMT
Title: A Unifying View of Coverage in Linear Off-Policy Evaluation
Authors: Philip Amortila, Audrey Huang, Akshay Krishnamurthy, Nan Jiang,
Abstract summary: We provide a novel finite-sample analysis of a canonical algorithm for this setting, LSTDQ.<n>Inspired by an instrumental-variable view, we develop error bounds that depend on a novel coverage parameter, the feature-dynamics coverage.
Score: 36.79977028763131
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Off-policy evaluation (OPE) is a fundamental task in reinforcement learning (RL). In the classic setting of linear OPE, finite-sample guarantees often take the form $$ \textrm{Evaluation error} \le \textrm{poly}(C^π, d, 1/n,\log(1/δ)), $$ where $d$ is the dimension of the features and $C^π$ is a coverage parameter that characterizes the degree to which the visited features lie in the span of the data distribution. While such guarantees are well-understood for several popular algorithms under stronger assumptions (e.g. Bellman completeness), the understanding is lacking and fragmented in the minimal setting where only the target value function is linearly realizable in the features. Despite recent interest in tight characterizations of the statistical rate in this setting, the right notion of coverage remains unclear, and candidate definitions from prior analyses have undesirable properties and are starkly disconnected from more standard definitions in the literature. We provide a novel finite-sample analysis of a canonical algorithm for this setting, LSTDQ. Inspired by an instrumental-variable view, we develop error bounds that depend on a novel coverage parameter, the feature-dynamics coverage, which can be interpreted as linear coverage in an induced dynamical system for feature evolution. With further assumptions -- such as Bellman-completeness -- our definition successfully recovers the coverage parameters specialized to those settings, finally yielding a unified understanding for coverage in linear OPE.

Related papers

A Covering Framework for Offline POMDPs Learning using Belief Space Metric [3.540245474029962]
This paper introduces a novel covering analysis framework that exploits the intrinsic metric structure of the belief space.<n>By assuming value relevant functions are Lipschitz continuous in the belief space, we derive error bounds that exponential blow ups in horizon and memory length.
arXiv Detail & Related papers (2026-03-03T17:48:20Z)
Bi-Level Offline Policy Optimization with Limited Exploration [1.8130068086063336]
We study offline reinforcement learning (RL) which seeks to learn a good policy based on a fixed, pre-collected dataset. We propose a bi-level structured policy optimization algorithm that models a hierarchical interaction between the policy (upper-level) and the value function (lower-level) We evaluate our model using a blend of synthetic, benchmark, and real-world datasets for offline RL, showing that it performs competitively with state-of-the-art methods.
arXiv Detail & Related papers (2023-10-10T02:45:50Z)
Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage [100.8180383245813]
We propose value-based algorithms for offline reinforcement learning (RL) We show an analogous result for vanilla Q-functions under a soft margin condition. Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying.
arXiv Detail & Related papers (2023-02-05T14:22:41Z)
Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
We study offline reinforcement learning (RL) in partially observable Markov decision processes. We propose the underlineProxy variable underlinePessimistic underlinePolicy underlineOptimization (textttP3O) algorithm. textttP3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
arXiv Detail & Related papers (2022-05-26T19:13:55Z)
A Sharp Characterization of Linear Estimators for Offline Policy Evaluation [33.37672297925897]
offline policy evaluation is a fundamental statistical problem in reinforcement learning. We identify simple control-theoretic and linear-algebraic conditions that are necessary and sufficient for classical methods. Our results provide a complete picture of the behavior of linear estimators for offline policy evaluation.
arXiv Detail & Related papers (2022-03-08T17:52:57Z)
Pessimistic Model-based Offline RL: PAC Bounds and Posterior Sampling under Partial Coverage [33.766012922307084]
We study model-based offline Reinforcement Learning with general function approximation. We present an algorithm named Constrained Policy Optimization (CPPO) which leverages a general function class and uses a constraint to encode pessimism.
arXiv Detail & Related papers (2021-07-13T16:30:01Z)
What are the Statistical Limits of Offline RL with Linear Function Approximation? [70.33301077240763]
offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of sequential decision making strategies. This work focuses on the basic question of what are necessary representational and distributional conditions that permit provable sample-efficient offline reinforcement learning.
arXiv Detail & Related papers (2020-10-22T17:32:13Z)
Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation. We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.