Related papers: A maximum-entropy approach to off-policy evaluation in average-reward MDPs

A maximum-entropy approach to off-policy evaluation in average-reward MDPs

URL: http://arxiv.org/abs/2006.12620v1
Date: Wed, 17 Jun 2020 18:13:37 GMT
Title: A maximum-entropy approach to off-policy evaluation in average-reward MDPs
Authors: Nevena Lazic, Dong Yin, Mehrdad Farajtabar, Nir Levine, Dilan Gorur, Chris Harris, Dale Schuurmans
Abstract summary: This work focuses on off-policy evaluation (OPE) with function approximation in infinite-horizon undiscounted Markov decision processes (MDPs) We provide the first finite-sample OPE error bound, extending existing results beyond the episodic and discounted cases. We show that this results in an exponential-family distribution whose sufficient statistics are the features, paralleling maximum-entropy approaches in supervised learning.
Score: 54.967872716145656
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This work focuses on off-policy evaluation (OPE) with function approximation in infinite-horizon undiscounted Markov decision processes (MDPs). For MDPs that are ergodic and linear (i.e. where rewards and dynamics are linear in some known features), we provide the first finite-sample OPE error bound, extending existing results beyond the episodic and discounted cases. In a more general setting, when the feature dynamics are approximately linear and for arbitrary rewards, we propose a new approach for estimating stationary distributions with function approximation. We formulate this problem as finding the maximum-entropy distribution subject to matching feature expectations under empirical dynamics. We show that this results in an exponential-family distribution whose sufficient statistics are the features, paralleling maximum-entropy approaches in supervised learning. We demonstrate the effectiveness of the proposed OPE approaches in multiple environments.

Related papers

Outcome-Based Online Reinforcement Learning: Algorithms and Fundamental Limits [58.63897489864948]
Reinforcement learning with outcome-based feedback faces a fundamental challenge.<n>How do we assign credit to the right actions?<n>This paper provides the first comprehensive analysis of this problem in online RL with general function approximation.
arXiv Detail & Related papers (2025-05-26T17:44:08Z)
Off-policy estimation with adaptively collected data: the power of online learning [20.023469636707635]
We consider estimation of a linear functional of the treatment effect using adaptively collected data. We propose a general reduction scheme that allows one to produce a sequence of estimates for the treatment effect via online learning.
arXiv Detail & Related papers (2024-11-19T10:18:27Z)
Maximum a Posteriori Estimation for Linear Structural Dynamics Models Using Bayesian Optimization with Rational Polynomial Chaos Expansions [0.01578888899297715]
We propose an extension to an existing sparse Bayesian learning approach for MAP estimation. We introduce a Bayesian optimization approach, which allows to adaptively enrich the experimental design. By combining the sparsity-inducing learning procedure with the experimental design, we effectively reduce the number of model evaluations.
arXiv Detail & Related papers (2024-08-07T06:11:37Z)
Multi-objective Reinforcement Learning with Nonlinear Preferences: Provable Approximation for Maximizing Expected Scalarized Return [1.3162012586770577]
We study multi-objective reinforcement learning with nonlinear preferences over trajectories. We derive an extended form of Bellman optimality for nonlinear optimization. We show that there can be a substantial gap between the optimal policy computed by our algorithm and alternative baselines.
arXiv Detail & Related papers (2023-11-05T02:11:07Z)
Entropic Matching for Expectation Propagation of Markov Jump Processes [31.376561087029454]
We propose a novel, tractable latent inference scheme for Markov jump processes. Our approach is based on an entropic matching framework that can be embedded into the well-known expectation propagation algorithm.
arXiv Detail & Related papers (2023-09-27T12:07:21Z)
Efficient expectation propagation for posterior approximation in high-dimensional probit models [1.433758865948252]
We focus on the expectation propagation (EP) approximation of the posterior distribution in Bayesian probit regression. We show how to leverage results on the extended multivariate skew-normal distribution to derive an efficient implementation of the EP routine. This makes EP computationally feasible also in challenging high-dimensional settings, as shown in a detailed simulation study.
arXiv Detail & Related papers (2023-09-04T14:07:19Z)
Truncating Trajectories in Monte Carlo Reinforcement Learning [48.97155920826079]
In Reinforcement Learning (RL), an agent acts in an unknown environment to maximize the expected cumulative discounted sum of an external reward signal. We propose an a-priori budget allocation strategy that leads to the collection of trajectories of different lengths. We show that an appropriate truncation of the trajectories can succeed in improving performance.
arXiv Detail & Related papers (2023-05-07T19:41:57Z)
Sparse high-dimensional linear regression with a partitioned empirical Bayes ECM algorithm [62.997667081978825]
We propose a computationally efficient and powerful Bayesian approach for sparse high-dimensional linear regression. Minimal prior assumptions on the parameters are used through the use of plug-in empirical Bayes estimates. The proposed approach is implemented in the R package probe.
arXiv Detail & Related papers (2022-09-16T19:15:50Z)
Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
We study offline reinforcement learning (RL) in partially observable Markov decision processes. We propose the underlineProxy variable underlinePessimistic underlinePolicy underlineOptimization (textttP3O) algorithm. textttP3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
arXiv Detail & Related papers (2022-05-26T19:13:55Z)
Under-Approximating Expected Total Rewards in POMDPs [68.8204255655161]
We consider the optimal expected total reward to reach a goal state in a partially observable Markov decision process (POMDP) We use mixed-integer linear programming (MILP) to find such minimal probability shifts and experimentally show that our techniques scale quite well.
arXiv Detail & Related papers (2022-01-21T16:43:03Z)
Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation. We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z)
Finite Sample Analysis of Minimax Offline Reinforcement Learning: Completeness, Fast Rates and First-Order Efficiency [83.02999769628593]
We offer a theoretical characterization of off-policy evaluation (OPE) in reinforcement learning. We show that the minimax approach enables us to achieve a fast rate of convergence for weights and quality functions. We present the first finite-sample result with first-order efficiency in non-tabular environments.
arXiv Detail & Related papers (2021-02-05T03:20:39Z)
Distributed Stochastic Nonconvex Optimization and Learning based on Successive Convex Approximation [26.11677569331688]
We introduce a novel framework for the distributed algorithmic minimization of the sum of the sum of the agents in a network. We show that the proposed method can be applied to distributed neural networks.
arXiv Detail & Related papers (2020-04-30T15:36:46Z)
Communication-Efficient Distributed Estimator for Generalized Linear Models with a Diverging Number of Covariates [7.427903819459701]
A novel method is proposed to obtain anally efficient estimator for large-scale distributed data by two rounds of communication. In this novel method, the assumption on the number of servers is more relaxed and thus practical for real-world applications.
arXiv Detail & Related papers (2020-01-17T08:51:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.