Is Pure Exploitation Sufficient in Exogenous MDPs with Linear Function Approximation?
- URL: http://arxiv.org/abs/2601.20694v1
- Date: Wed, 28 Jan 2026 15:23:50 GMT
- Title: Is Pure Exploitation Sufficient in Exogenous MDPs with Linear Function Approximation?
- Authors: Hao Liang, Jiayu Cheng, Sean R. Sinclair, Yali Du,
- Abstract summary: Exogenous MDPs (Exo-MDPs) capture sequential decision-making where uncertainty comes solely from inputs that evolve independently of learner's actions.<n>Despite decades of empirical evidence that greedy, exploitation-only methods work remarkably well in these settings, theory has lagged behind.<n>We propose Pure Exploitation Learning (PEL) and prove the first general finite-sample regret bounds for exploitation-only algorithms in Exo-MDPs.
- Score: 10.117197604524465
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Exogenous MDPs (Exo-MDPs) capture sequential decision-making where uncertainty comes solely from exogenous inputs that evolve independently of the learner's actions. This structure is especially common in operations research applications such as inventory control, energy storage, and resource allocation, where exogenous randomness (e.g., demand, arrivals, or prices) drives system behavior. Despite decades of empirical evidence that greedy, exploitation-only methods work remarkably well in these settings, theory has lagged behind: all existing regret guarantees for Exo-MDPs rely on explicit exploration or tabular assumptions. We show that exploration is unnecessary. We propose Pure Exploitation Learning (PEL) and prove the first general finite-sample regret bounds for exploitation-only algorithms in Exo-MDPs. In the tabular case, PEL achieves $\widetilde{O}(H^2|Ξ|\sqrt{K})$. For large, continuous endogenous state spaces, we introduce LSVI-PE, a simple linear-approximation method whose regret is polynomial in the feature dimension, exogenous state space, and horizon, independent of the endogenous state and action spaces. Our analysis introduces two new tools: counterfactual trajectories and Bellman-closed feature transport, which together allow greedy policies to have accurate value estimates without optimism. Experiments on synthetic and resource-management tasks show that PEL consistently outperforming baselines. Overall, our results overturn the conventional wisdom that exploration is required, demonstrating that in Exo-MDPs, pure exploitation is enough.
Related papers
- Greedy Is Enough: Sparse Action Discovery in Agentic LLMs [11.62669179647184]
empirical evidence suggests that only a small subset of actions meaningfully influences performance in a given deployment.<n>Motivated by this observation, we study a contextual linear reward model in which action is governed by a structured sparsity assumption.<n>Our results identify sparse action discovery as a fundamental principle underlying large-action decision-making.
arXiv Detail & Related papers (2026-01-13T07:15:32Z) - Exploiting Exogenous Structure for Sample-Efficient Reinforcement Learning [44.17068570786194]
We study Exo-MDPs, a structured class of Markov Decision Processes (MDPs)<n> Exogenous states evolve independently of the agent's actions, while endogenous states evolve deterministically based on both state components and actions.<n> Exo-MDPs are useful for applications including inventory control, portfolio management, and ride-sharing.
arXiv Detail & Related papers (2024-09-22T18:45:38Z) - Submodular Reinforcement Learning [38.40138241424851]
In reinforcement learning (RL), rewards of states are typically considered additive, and following the Markov assumption, they are $textitindependent$ states visited previously.
In many important applications, such as coverage control, experiment design and informative path planning, rewards naturally have diminishing returns, i.e., their value decreases in light of similar states visited previously.
We propose $textitsubmodular RL$ (SubRL), a paradigm which seeks to optimize more general, non-additive (and history-dependent) rewards modelled via submodular set functions which capture diminishing returns
arXiv Detail & Related papers (2023-07-25T09:46:02Z) - Maximize to Explore: One Objective Function Fusing Estimation, Planning,
and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX.
textttMEX integrates estimation and planning components while balancing exploration exploitation automatically.
It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z) - Cost-Driven Representation Learning for Linear Quadratic Gaussian Control: Part I [57.29427648134142]
We study the task of learning state representations from potentially high-dimensional observations.<n>We pursue a cost-driven approach, where a dynamic model in some latent state space is learned by predicting the costs without predicting the observations or actions.
arXiv Detail & Related papers (2022-12-30T01:42:04Z) - Guarantees for Epsilon-Greedy Reinforcement Learning with Function
Approximation [69.1524391595912]
Myopic exploration policies such as epsilon-greedy, softmax, or Gaussian noise fail to explore efficiently in some reinforcement learning tasks.
This paper presents a theoretical analysis of such policies and provides the first regret and sample-complexity bounds for reinforcement learning with myopic exploration.
arXiv Detail & Related papers (2022-06-19T14:44:40Z) - Reinforcement Learning from Partial Observation: Linear Function Approximation with Provable Sample Efficiency [111.83670279016599]
We study reinforcement learning for partially observed decision processes (POMDPs) with infinite observation and state spaces.
We make the first attempt at partial observability and function approximation for a class of POMDPs with a linear structure.
arXiv Detail & Related papers (2022-04-20T21:15:38Z) - Provable RL with Exogenous Distractors via Multistep Inverse Dynamics [85.52408288789164]
Real-world applications of reinforcement learning (RL) require the agent to deal with high-dimensional observations such as those generated from a megapixel camera.
Prior work has addressed such problems with representation learning, through which the agent can provably extract endogenous, latent state information from raw observations.
However, such approaches can fail in the presence of temporally correlated noise in the observations.
arXiv Detail & Related papers (2021-10-17T15:21:27Z) - Learning to Stop with Surprisingly Few Samples [17.46537996825982]
We consider a discounted infinite horizon optimal stopping problem.
If the underlying distribution is known a priori, the solution of this problem is obtained via dynamic programming.
When information on this distribution is lacking, a natural (though naive) approach is "explore-then-exploit"
arXiv Detail & Related papers (2021-02-19T16:51:07Z) - Provably Efficient Reward-Agnostic Navigation with Linear Value
Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function.
We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z) - Kernel-Based Reinforcement Learning: A Finite-Time Analysis [53.47210316424326]
We introduce Kernel-UCBVI, a model-based optimistic algorithm that leverages the smoothness of the MDP and a non-parametric kernel estimator of the rewards.
We empirically validate our approach in continuous MDPs with sparse rewards.
arXiv Detail & Related papers (2020-04-12T12:23:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.