Offline Constrained Reinforcement Learning under Partial Data Coverage
- URL: http://arxiv.org/abs/2505.17506v1
- Date: Fri, 23 May 2025 06:00:01 GMT
- Title: Offline Constrained Reinforcement Learning under Partial Data Coverage
- Authors: Kihyuk Hong, Ambuj Tewari,
- Abstract summary: We study offline constrained reinforcement learning (RL) with general function approximation.<n>We propose an oracle-efficient primal-dual algorithm based on a linear programming (LP) formulation.
- Score: 18.449996575976993
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study offline constrained reinforcement learning (RL) with general function approximation. We aim to learn a policy from a pre-collected dataset that maximizes the expected discounted cumulative reward for a primary reward signal while ensuring that expected discounted returns for multiple auxiliary reward signals are above predefined thresholds. Existing algorithms either require fully exploratory data, are computationally inefficient, or depend on an additional auxiliary function classes to obtain an $\epsilon$-optimal policy with sample complexity $O(\epsilon^{-2})$. In this paper, we propose an oracle-efficient primal-dual algorithm based on a linear programming (LP) formulation, achieving $O(\epsilon^{-2})$ sample complexity under partial data coverage. By introducing a realizability assumption, our approach ensures that all saddle points of the Lagrangian are optimal, removing the need for regularization that complicated prior analyses. Through Lagrangian decomposition, our method extracts policies without requiring knowledge of the data-generating distribution, enhancing practical applicability.
Related papers
- Outcome-Based Online Reinforcement Learning: Algorithms and Fundamental Limits [58.63897489864948]
Reinforcement learning with outcome-based feedback faces a fundamental challenge.<n>How do we assign credit to the right actions?<n>This paper provides the first comprehensive analysis of this problem in online RL with general function approximation.
arXiv Detail & Related papers (2025-05-26T17:44:08Z) - A Primal-Dual Algorithm for Offline Constrained Reinforcement Learning with Linear MDPs [18.449996575976993]
We propose a primal dual algorithm for offline RL with linear MDPs in the infinite-horizon discounted setting.
Our algorithm is the first computationally efficient algorithm in this setting that achieves sample complexity of $O(epsilon-2)$ with partial data coverage assumption.
We extend our algorithm to work in the offline constrained RL setting that enforces constraints on additional reward signals.
arXiv Detail & Related papers (2024-02-07T00:33:11Z) - Offline Primal-Dual Reinforcement Learning for Linear MDPs [16.782625445546273]
Offline Reinforcement Learning (RL) aims to learn a near-optimal policy from a fixed dataset of transitions collected by another policy.
This paper proposes a primal-dual optimization method based on the linear programming formulation of RL.
arXiv Detail & Related papers (2023-05-22T11:45:23Z) - Offline Reinforcement Learning with Additional Covering Distributions [0.0]
We study learning optimal policies from a logged dataset, i.e., offline RL, with function approximation.
We show that sample-efficient offline RL for general MDPs is possible with only a partial coverage dataset and weak realizable function classes.
arXiv Detail & Related papers (2023-05-22T03:31:03Z) - Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs)
At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward.
We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z) - Improved Regret for Efficient Online Reinforcement Learning with Linear
Function Approximation [69.0695698566235]
We study reinforcement learning with linear function approximation and adversarially changing cost functions.
We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback.
arXiv Detail & Related papers (2023-01-30T17:26:39Z) - Near-Optimal Deployment Efficiency in Reward-Free Reinforcement Learning
with Linear Function Approximation [16.871660060209674]
We study the problem of deployment efficient reinforcement learning (RL) with linear function approximation under the emphreward-free exploration setting.
We propose a new algorithm that collects at most $widetildeO(fracd2H5epsilon2)$ trajectories within $H$ deployments to identify $epsilon$-optimal policy for any (possibly data-dependent) choice of reward functions.
arXiv Detail & Related papers (2022-10-03T03:48:26Z) - Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
We study offline reinforcement learning (RL) in partially observable Markov decision processes.
We propose the underlineProxy variable underlinePessimistic underlinePolicy underlineOptimization (textttP3O) algorithm.
textttP3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
arXiv Detail & Related papers (2022-05-26T19:13:55Z) - On Reward-Free RL with Kernel and Neural Function Approximations:
Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function.
We tackle this problem under the context of function approximation, leveraging powerful function approximators.
We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z) - Online Sub-Sampling for Reinforcement Learning with General Function
Approximation [111.01990889581243]
In this paper, we establish an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm.
For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $proptooperatornamepolylog(K)$ times.
In contrast to existing approaches that update the policy for at least $Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy.
arXiv Detail & Related papers (2021-06-14T07:36:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.