Exploiting Exogenous Structure for Sample-Efficient Reinforcement Learning
- URL: http://arxiv.org/abs/2409.14557v2
- Date: Mon, 14 Oct 2024 23:46:09 GMT
- Title: Exploiting Exogenous Structure for Sample-Efficient Reinforcement Learning
- Authors: Jia Wan, Sean R. Sinclair, Devavrat Shah, Martin J. Wainwright,
- Abstract summary: We study a class of structured Markov Decision Processes (MDPs) known as Exo-MDPs.
Exo-MDPs provide a natural model for various applications, including inventory control, portfolio management, power systems, and ride-sharing.
- Score: 44.17068570786194
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study a class of structured Markov Decision Processes (MDPs) known as Exo-MDPs. They are characterized by a partition of the state space into two components: the exogenous states evolve stochastically in a manner not affected by the agent's actions, whereas the endogenous states can be affected by actions, and evolve according to deterministic dynamics involving both the endogenous and exogenous states. Exo-MDPs provide a natural model for various applications, including inventory control, portfolio management, power systems, and ride-sharing, among others. While seemingly restrictive on the surface, our first result establishes that any discrete MDP can be represented as an Exo-MDP. The underlying argument reveals how transition and reward dynamics can be written as linear functions of the exogenous state distribution, showing how Exo-MDPs are instances of linear mixture MDPs, thereby showing a representational equivalence between discrete MDPs, Exo-MDPs, and linear mixture MDPs. The connection between Exo-MDPs and linear mixture MDPs leads to algorithms that are near sample-optimal, with regret guarantees scaling with the (effective) size of the exogenous state space $d$, independent of the sizes of the endogenous state and action spaces, even when the exogenous state is {\em unobserved}. When the exogenous state is unobserved, we establish a regret upper bound of $O(H^{3/2}d\sqrt{K})$ with $K$ trajectories of horizon $H$ and unobserved exogenous state of dimension $d$. We also establish a matching regret lower bound of $\Omega(H^{3/2}d\sqrt{K})$ for non-stationary Exo-MDPs and a lower bound of $\Omega(Hd\sqrt{K})$ for stationary Exo-MDPs. We complement our theoretical findings with an experimental study on inventory control problems.
Related papers
- A New Interpretation of the Certainty-Equivalence Approach for PAC Reinforcement Learning with a Generative Model [5.238591085233903]
This paper presents a theoretical investigation that stems from the surprising finding that CEM may indeed be viewed as an application of TTM.
We obtain (3) improvements in the sample-complexity upper bounds for CEM both for non-stationary and stationary MDPs.
We also show (4) a lower bound on the sample complexity for finite-horizon MDPs, which establishes the minimax-optimality of our upper bound for non-stationary MDPs.
arXiv Detail & Related papers (2025-01-05T20:37:34Z) - Learning a Fast Mixing Exogenous Block MDP using a Single Trajectory [87.62730694973696]
STEEL is the first provably sample-efficient algorithm for learning the controllable dynamics of an Exogenous Block Markov Decision Process from a single trajectory.
We prove that STEEL is correct and sample-efficient, and demonstrate STEEL on two toy problems.
arXiv Detail & Related papers (2024-10-03T21:57:21Z) - Reinforcement Learning with Exogenous States and Rewards [15.18610763024837]
Exogenous state variables and rewards can slow reinforcement learning by injecting uncontrolled variation into the reward signal.
This paper formalizes endogenous state variables and rewards and shows that if the reward function decomposes additively into endogenous and endogenous components, the MDP can be decomposed into two processes.
arXiv Detail & Related papers (2023-03-22T23:37:28Z) - Optimality Guarantees for Particle Belief Approximation of POMDPs [55.83001584645448]
Partially observable Markov decision processes (POMDPs) provide a flexible representation for real-world decision and control problems.
POMDPs are notoriously difficult to solve, especially when the state and observation spaces are continuous or hybrid.
We propose a theory characterizing the approximation error of the particle filtering techniques that these algorithms use.
arXiv Detail & Related papers (2022-10-10T21:11:55Z) - MMD-ReID: A Simple but Effective Solution for Visible-Thermal Person
ReID [20.08880264104061]
We propose a simple but effective framework, MMD-ReID, that reduces the modality gap by an explicit discrepancy reduction constraint.
We conduct extensive experiments to demonstrate both qualitatively and quantitatively the effectiveness of MMD-ReID.
The proposed framework significantly outperforms the state-of-the-art methods on SYSU-MM01 and RegDB datasets.
arXiv Detail & Related papers (2021-11-09T11:33:32Z) - Provable RL with Exogenous Distractors via Multistep Inverse Dynamics [85.52408288789164]
Real-world applications of reinforcement learning (RL) require the agent to deal with high-dimensional observations such as those generated from a megapixel camera.
Prior work has addressed such problems with representation learning, through which the agent can provably extract endogenous, latent state information from raw observations.
However, such approaches can fail in the presence of temporally correlated noise in the observations.
arXiv Detail & Related papers (2021-10-17T15:21:27Z) - A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs [117.82903457289584]
We derive a novel problem-dependent lower-bound for regret in finite-horizon Markov Decision Processes (MDPs)
We show that our lower-bound is considerably smaller than in the general case and it does not scale with the minimum action gap at all.
We show that this last result is attainable (up to $poly(H)$ terms, where $H$ is the horizon) by providing a regret upper-bound based on policy gaps for an optimistic algorithm.
arXiv Detail & Related papers (2021-06-24T13:46:09Z) - Plannable Approximations to MDP Homomorphisms: Equivariance under
Actions [72.30921397899684]
We introduce a contrastive loss function that enforces action equivariance on the learned representations.
We prove that when our loss is zero, we have a homomorphism of a deterministic Markov Decision Process.
We show experimentally that for deterministic MDPs, the optimal policy in the abstract MDP can be successfully lifted to the original MDP.
arXiv Detail & Related papers (2020-02-27T08:29:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.