Towards Optimal Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2503.12283v1
- Date: Sat, 15 Mar 2025 22:41:55 GMT
- Title: Towards Optimal Offline Reinforcement Learning
- Authors: Mengmeng Li, Daniel Kuhn, Tobias Sutter,
- Abstract summary: We study offline reinforcement learning problems with a long-run average reward objective.<n>The state-action pairs generated by any fixed behavioral policy thus follow a Markov chain.<n>We use the rate function of this large deviations principle to construct an uncertainty set for the unknown em true state-action-next-state distribution.
- Score: 9.13232872223434
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study offline reinforcement learning problems with a long-run average reward objective. The state-action pairs generated by any fixed behavioral policy thus follow a Markov chain, and the {\em empirical} state-action-next-state distribution satisfies a large deviations principle. We use the rate function of this large deviations principle to construct an uncertainty set for the unknown {\em true} state-action-next-state distribution. We also construct a distribution shift transformation that maps any distribution in this uncertainty set to a state-action-next-state distribution of the Markov chain generated by a fixed evaluation policy, which may differ from the unknown behavioral policy. We prove that the worst-case average reward of the evaluation policy with respect to all distributions in the shifted uncertainty set provides, in a rigorous statistical sense, the least conservative estimator for the average reward under the unknown true distribution. This guarantee is available even if one has only access to one single trajectory of serially correlated state-action pairs. The emerging robust optimization problem can be viewed as a robust Markov decision process with a non-rectangular uncertainty set. We adapt an efficient policy gradient algorithm to solve this problem. Numerical experiments show that our methods compare favorably against state-of-the-art methods.
Related papers
- Likelihood Reward Redistribution [0.0]
We propose a emphLikelihood Reward Redistribution (LRR) framework for reward redistribution.
When integrated with an off-policy algorithm such as Soft Actor-Critic, LRR yields dense and informative reward signals.
arXiv Detail & Related papers (2025-03-20T20:50:49Z) - Distributionally Robust Policy Learning under Concept Drifts [33.44768994272614]
This paper studies a more nuanced problem -- robust policy learning under the concept drift.<n>We first provide a doubly-robust estimator for evaluating the worst-case average reward of a given policy.<n>We then propose a learning algorithm that outputs the policy maximizing the estimated policy value within a given policy class.
arXiv Detail & Related papers (2024-12-18T19:53:56Z) - Strongly-polynomial time and validation analysis of policy gradient methods [3.722665817361884]
This paper proposes a novel termination criterion, termed the advantage gap function, for finite state and action Markov decision processes (MDP) and reinforcement learning (RL)<n>By incorporating this advantage gap function into the design of step size rules, we deriving a new linear rate of convergence that is independent of the stationary state distribution of the optimal policy.<n>This is the first time that such strong convergence properties have been established for policy gradient methods.
arXiv Detail & Related papers (2024-09-28T18:56:48Z) - Probabilistic Conformal Prediction with Approximate Conditional Validity [81.30551968980143]
We develop a new method for generating prediction sets that combines the flexibility of conformal methods with an estimate of the conditional distribution.
Our method consistently outperforms existing approaches in terms of conditional coverage.
arXiv Detail & Related papers (2024-07-01T20:44:48Z) - Simultaneous Inference for Local Structural Parameters with Random Forests [19.014535120129338]
We construct simultaneous confidence intervals for solutions to conditional moment equations.
We obtain several new order-explicit results on the concentration and normal approximation of high-dimensional U.S.
As a by-product, we obtain several new order-explicit results on the concentration and normal approximation of high-dimensional U.S.
arXiv Detail & Related papers (2024-05-13T15:46:11Z) - Uncertainty Quantification via Stable Distribution Propagation [60.065272548502]
We propose a new approach for propagating stable probability distributions through neural networks.
Our method is based on local linearization, which we show to be an optimal approximation in terms of total variation distance for the ReLU non-linearity.
arXiv Detail & Related papers (2024-02-13T09:40:19Z) - Off-Policy Evaluation in Markov Decision Processes under Weak
Distributional Overlap [5.0401589279256065]
We re-visit the task of off-policy evaluation in Markov decision processes (MDPs) under a weaker notion of distributional overlap.
We introduce a class of truncated doubly robust (TDR) estimators which we find to perform well in this setting.
arXiv Detail & Related papers (2024-02-13T03:55:56Z) - Correcting discount-factor mismatch in on-policy policy gradient methods [2.9005223064604078]
We introduce a novel distribution correction to account for the discounted stationary distribution.
Our algorithm consistently matches or exceeds the original performance on several OpenAI gym and DeepMind suite benchmarks.
arXiv Detail & Related papers (2023-06-23T04:10:58Z) - Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs)
Semi-implicit actor (SIA) powered by a flexible policy distribution.
We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z) - A Distributional Analysis of Sampling-Based Reinforcement Learning
Algorithms [67.67377846416106]
We present a distributional approach to theoretical analyses of reinforcement learning algorithms for constant step-sizes.
We show that value-based methods such as TD($lambda$) and $Q$-Learning have update rules which are contractive in the space of distributions of functions.
arXiv Detail & Related papers (2020-03-27T05:13:29Z) - GenDICE: Generalized Offline Estimation of Stationary Values [108.17309783125398]
We show that effective estimation can still be achieved in important applications.
Our approach is based on estimating a ratio that corrects for the discrepancy between the stationary and empirical distributions.
The resulting algorithm, GenDICE, is straightforward and effective.
arXiv Detail & Related papers (2020-02-21T00:27:52Z) - Statistically Efficient Off-Policy Policy Gradients [80.42316902296832]
We consider the statistically efficient estimation of policy gradients from off-policy data.
We propose a meta-algorithm that achieves the lower bound without any parametric assumptions.
We establish guarantees on the rate at which we approach a stationary point when we take steps in the direction of our new estimated policy gradient.
arXiv Detail & Related papers (2020-02-10T18:41:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.