Exploiting Exogenous Structure for Sample-Efficient Reinforcement Learning
- URL: http://arxiv.org/abs/2409.14557v3
- Date: Wed, 05 Feb 2025 15:49:21 GMT
- Title: Exploiting Exogenous Structure for Sample-Efficient Reinforcement Learning
- Authors: Jia Wan, Sean R. Sinclair, Devavrat Shah, Martin J. Wainwright,
- Abstract summary: We study Exo-MDPs, a structured class of Markov Decision Processes (MDPs)
Exogenous states evolve independently of the agent's actions, while endogenous states evolve deterministically based on both state components and actions.
Exo-MDPs are useful for applications including inventory control, portfolio management, and ride-sharing.
- Score: 44.17068570786194
- License:
- Abstract: We study Exo-MDPs, a structured class of Markov Decision Processes (MDPs) where the state space is partitioned into exogenous and endogenous components. Exogenous states evolve stochastically, independent of the agent's actions, while endogenous states evolve deterministically based on both state components and actions. Exo-MDPs are useful for applications including inventory control, portfolio management, and ride-sharing. Our first result is structural, establishing a representational equivalence between the classes of discrete MDPs, Exo-MDPs, and discrete linear mixture MDPs. Specifically, any discrete MDP can be represented as an Exo-MDP, and the transition and reward dynamics can be written as linear functions of the exogenous state distribution, showing that Exo-MDPs are instances of linear mixture MDPs. For unobserved exogenous states, we prove a regret upper bound of $O(H^{3/2}d\sqrt{K})$ over $K$ trajectories of horizon $H$, with $d$ as the size of the exogenous state space, and establish nearly-matching lower bounds. Our findings demonstrate how Exo-MDPs decouple sample complexity from action and endogenous state sizes, and we validate our theoretical insights with experiments on inventory control.
Related papers
- A New Interpretation of the Certainty-Equivalence Approach for PAC Reinforcement Learning with a Generative Model [5.238591085233903]
This paper presents a theoretical investigation that stems from the surprising finding that CEM may indeed be viewed as an application of TTM.
We obtain (3) improvements in the sample-complexity upper bounds for CEM both for non-stationary and stationary MDPs.
We also show (4) a lower bound on the sample complexity for finite-horizon MDPs, which establishes the minimax-optimality of our upper bound for non-stationary MDPs.
arXiv Detail & Related papers (2025-01-05T20:37:34Z) - Learning a Fast Mixing Exogenous Block MDP using a Single Trajectory [87.62730694973696]
STEEL is the first provably sample-efficient algorithm for learning the controllable dynamics of an Exogenous Block Markov Decision Process from a single trajectory.
We prove that STEEL is correct and sample-efficient, and demonstrate STEEL on two toy problems.
arXiv Detail & Related papers (2024-10-03T21:57:21Z) - Reinforcement Learning with Exogenous States and Rewards [15.18610763024837]
Exogenous state variables and rewards can slow reinforcement learning by injecting uncontrolled variation into the reward signal.
This paper formalizes endogenous state variables and rewards and shows that if the reward function decomposes additively into endogenous and endogenous components, the MDP can be decomposed into two processes.
arXiv Detail & Related papers (2023-03-22T23:37:28Z) - Optimality Guarantees for Particle Belief Approximation of POMDPs [55.83001584645448]
Partially observable Markov decision processes (POMDPs) provide a flexible representation for real-world decision and control problems.
POMDPs are notoriously difficult to solve, especially when the state and observation spaces are continuous or hybrid.
We propose a theory characterizing the approximation error of the particle filtering techniques that these algorithms use.
arXiv Detail & Related papers (2022-10-10T21:11:55Z) - MMD-ReID: A Simple but Effective Solution for Visible-Thermal Person
ReID [20.08880264104061]
We propose a simple but effective framework, MMD-ReID, that reduces the modality gap by an explicit discrepancy reduction constraint.
We conduct extensive experiments to demonstrate both qualitatively and quantitatively the effectiveness of MMD-ReID.
The proposed framework significantly outperforms the state-of-the-art methods on SYSU-MM01 and RegDB datasets.
arXiv Detail & Related papers (2021-11-09T11:33:32Z) - Provable RL with Exogenous Distractors via Multistep Inverse Dynamics [85.52408288789164]
Real-world applications of reinforcement learning (RL) require the agent to deal with high-dimensional observations such as those generated from a megapixel camera.
Prior work has addressed such problems with representation learning, through which the agent can provably extract endogenous, latent state information from raw observations.
However, such approaches can fail in the presence of temporally correlated noise in the observations.
arXiv Detail & Related papers (2021-10-17T15:21:27Z) - A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs [117.82903457289584]
We derive a novel problem-dependent lower-bound for regret in finite-horizon Markov Decision Processes (MDPs)
We show that our lower-bound is considerably smaller than in the general case and it does not scale with the minimum action gap at all.
We show that this last result is attainable (up to $poly(H)$ terms, where $H$ is the horizon) by providing a regret upper-bound based on policy gaps for an optimistic algorithm.
arXiv Detail & Related papers (2021-06-24T13:46:09Z) - Plannable Approximations to MDP Homomorphisms: Equivariance under
Actions [72.30921397899684]
We introduce a contrastive loss function that enforces action equivariance on the learned representations.
We prove that when our loss is zero, we have a homomorphism of a deterministic Markov Decision Process.
We show experimentally that for deterministic MDPs, the optimal policy in the abstract MDP can be successfully lifted to the original MDP.
arXiv Detail & Related papers (2020-02-27T08:29:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.