Offline Oracle-Efficient Learning for Contextual MDPs via Layerwise Exploration-Exploitation Tradeoff
- URL: http://arxiv.org/abs/2405.17796v1
- Date: Tue, 28 May 2024 03:47:41 GMT
- Title: Offline Oracle-Efficient Learning for Contextual MDPs via Layerwise Exploration-Exploitation Tradeoff
- Authors: Jian Qian, Haichen Hu, David Simchi-Levi,
- Abstract summary: We introduce a reduction from CMDPs to offline density estimation under the realizability assumption.
A notable feature of our algorithm is the design of a layerwise exploration-exploitation tradeoff tailored to address the layerwise structure of CMDPs.
- Score: 12.847844923530577
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Motivated by the recent discovery of a statistical and computational reduction from contextual bandits to offline regression (Simchi-Levi and Xu, 2021), we address the general (stochastic) Contextual Markov Decision Process (CMDP) problem with horizon H (as known as CMDP with H layers). In this paper, we introduce a reduction from CMDPs to offline density estimation under the realizability assumption, i.e., a model class M containing the true underlying CMDP is provided in advance. We develop an efficient, statistically near-optimal algorithm requiring only O(HlogT) calls to an offline density estimation algorithm (or oracle) across all T rounds of interaction. This number can be further reduced to O(HloglogT) if T is known in advance. Our results mark the first efficient and near-optimal reduction from CMDPs to offline density estimation without imposing any structural assumptions on the model class. A notable feature of our algorithm is the design of a layerwise exploration-exploitation tradeoff tailored to address the layerwise structure of CMDPs. Additionally, our algorithm is versatile and applicable to pure exploration tasks in reward-free reinforcement learning.
Related papers
- A Hybrid Early-Exit Algorithm for Large Language Models Based on Space Alignment Decoding (SPADE) [3.1775609005777024]
Large language models are computationally expensive due to their deep structures.<n>We propose SPADE, a novel decoding method that aligns intermediate layer representations with the output layer.<n>We create a hybrid early-exit algorithm that monitors confidence levels and stops inference at intermediate layers while using SPADE to generate high-quality outputs.
arXiv Detail & Related papers (2025-07-23T15:49:03Z) - Provably Efficient RL under Episode-Wise Safety in Constrained MDPs with Linear Function Approximation [24.299769025346368]
We study the reinforcement learning problem in a constrained decision process (CMDP)
We propose an RL algorithm for linear CMDPs that achieves $tildemathcalO(sqrtK)$ regret with an episode-wise zero-violation guarantee.
Our results significantly improve upon recent linear CMDP algorithms, which either violate the constraint or incur exponential computational costs.
arXiv Detail & Related papers (2025-02-14T13:07:25Z) - Learning for Cross-Layer Resource Allocation in MEC-Aided Cell-Free Networks [71.30914500714262]
Cross-layer resource allocation over mobile edge computing (MEC)-aided cell-free networks can sufficiently exploit the transmitting and computing resources to promote the data rate.
Joint subcarrier allocation and beamforming optimization are investigated for the MEC-aided cell-free network from the perspective of deep learning.
arXiv Detail & Related papers (2024-12-21T10:18:55Z) - Near-Optimal Learning and Planning in Separated Latent MDPs [70.88315649628251]
We study computational and statistical aspects of learning Latent Markov Decision Processes (LMDPs)
In this model, the learner interacts with an MDP drawn at the beginning of each epoch from an unknown mixture of MDPs.
arXiv Detail & Related papers (2024-06-12T06:41:47Z) - Offline RL via Feature-Occupancy Gradient Ascent [9.983014605039658]
We study offline Reinforcement Learning in large infinite-horizon discounted Markov Decision Processes (MDPs)
We develop a new algorithm that performs a form of gradient ascent in the space of feature occupancies.
We show that the resulting simple algorithm satisfies strong computational and sample complexity guarantees.
arXiv Detail & Related papers (2024-05-22T15:39:05Z) - Q-SHED: Distributed Optimization at the Edge via Hessian Eigenvectors
Quantization [5.404315085380945]
Newton-type (NT) methods have been advocated as enablers of robust convergence rates in DO problems.
We propose Q-SHED, an original NT algorithm for DO featuring a novel bit-allocation scheme based on incremental Hessian eigenvectors quantization.
We show that Q-SHED can reduce by up to 60% the number of communication rounds required for convergence.
arXiv Detail & Related papers (2023-05-18T10:15:03Z) - PAC Statistical Model Checking of Mean Payoff in Discrete- and
Continuous-Time MDP [0.34410212782758043]
We provide the first algorithm to compute mean payoff probably approximately correctly in unknown MDP.
We do not require any knowledge of the state space, only a lower bound on the minimum transition probability.
In addition to providing probably approximately correct (PAC) bounds for our algorithm, we also demonstrate its practical nature by running experiments on standard benchmarks.
arXiv Detail & Related papers (2022-06-03T09:13:27Z) - Semi-Markov Offline Reinforcement Learning for Healthcare [57.15307499843254]
We introduce three offline RL algorithms, namely, SDQN, SDDQN, and SBCQ.
We experimentally demonstrate that only these algorithms learn the optimal policy in variable-time environments.
We apply our new algorithms to a real-world offline dataset pertaining to warfarin dosing for stroke prevention.
arXiv Detail & Related papers (2022-03-17T14:51:21Z) - DESTRESS: Computation-Optimal and Communication-Efficient Decentralized
Nonconvex Finite-Sum Optimization [43.31016937305845]
Internet-of-things, networked sensing, autonomous systems and federated learning call for decentralized algorithms for finite-sum optimizations.
We develop DEcentralized STochastic REcurSive methodDESTRESS for non finite-sum optimization.
Detailed theoretical and numerical comparisons show that DESTRESS improves upon prior decentralized algorithms.
arXiv Detail & Related papers (2021-10-04T03:17:41Z) - Safe Exploration by Solving Early Terminated MDP [77.10563395197045]
We introduce a new approach to address safe RL problems under the framework of Early TerminatedP (ET-MDP)
We first define the ET-MDP as an unconstrained algorithm with the same optimal value function as its corresponding CMDP.
An off-policy algorithm based on context models is then proposed to solve the ET-MDP, which thereby solves the corresponding CMDP with better performance and improved learning efficiency.
arXiv Detail & Related papers (2021-07-09T04:24:40Z) - STRIDE along Spectrahedral Vertices for Solving Large-Scale Rank-One
Semidefinite Relaxations [27.353023427198806]
We consider solving high-order semidefinite programming relaxations of nonconstrained optimization problems (POPs)
Existing approaches, which solve the SDP independently from the POP, either cannot scale to large problems or suffer from slow convergence due to the typical uneneracy of such SDPs.
We propose a new algorithmic framework called SpecTrahedral vErtices (STRIDE)
arXiv Detail & Related papers (2021-05-28T18:07:16Z) - RL for Latent MDPs: Regret Guarantees and a Lower Bound [74.41782017817808]
We consider the regret problem for reinforcement learning in latent Markov Decision Processes (LMDP)
In an LMDP, an MDP is randomly drawn from a set of $M$ possible MDPs at the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent.
We show that the key link is a notion of separation between the MDP system dynamics.
arXiv Detail & Related papers (2021-02-09T16:49:58Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Exploration-Exploitation in Constrained MDPs [79.23623305214275]
We investigate the exploration-exploitation dilemma in Constrained Markov Decision Processes (CMDPs)
While learning in an unknown CMDP, an agent should trade-off exploration to discover new information about the MDP.
While the agent will eventually learn a good or optimal policy, we do not want the agent to violate the constraints too often during the learning process.
arXiv Detail & Related papers (2020-03-04T17:03:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.