Related papers: $κ$-Explorer: A Unified Framework for Active Model Estimation in MDPs

$κ$-Explorer: A Unified Framework for Active Model Estimation in MDPs

URL: http://arxiv.org/abs/2602.20404v1
Date: Mon, 23 Feb 2026 22:56:32 GMT
Title: $κ$-Explorer: A Unified Framework for Active Model Estimation in MDPs
Authors: Xihe Gu, Urbashi Mitra, Tara Javidi,
Abstract summary: We introduce a parameterized family of objective functions $U_$ that explicitly incorporate intrinsic estimation complexity and visitation frequency.<n>We propose $$-Explorer, an active exploration algorithm that performs Frank-Wolfe-style optimization over state-action occupancy measures.<n>Experiments on benchmark MDPs demonstrate that $$-Explorer provides superior performance compared to existing exploration strategies.
Score: 20.944349513772067
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In tabular Markov decision processes (MDPs) with perfect state observability, each trajectory provides active samples from the transition distributions conditioned on state-action pairs. Consequently, accurate model estimation depends on how the exploration policy allocates visitation frequencies in accordance with the intrinsic complexity of each transition distribution. Building on recent work on coverage-based exploration, we introduce a parameterized family of decomposable and concave objective functions $U_κ$ that explicitly incorporate both intrinsic estimation complexity and extrinsic visitation frequency. Moreover, the curvature $κ$ provides a unified treatment of various global objectives, such as the average-case and worst-case estimation error objectives. Using the closed-form characterization of the gradient of $U_κ$, we propose $κ$-Explorer, an active exploration algorithm that performs Frank-Wolfe-style optimization over state-action occupancy measures. The diminishing-returns structure of $U_κ$ naturally prioritizes underexplored and high-variance transitions, while preserving smoothness properties that enable efficient optimization. We establish tight regret guarantees for $κ$-Explorer and further introduce a fully online and computationally efficient surrogate algorithm for practical use. Experiments on benchmark MDPs demonstrate that $κ$-Explorer provides superior performance compared to existing exploration strategies.

Related papers

$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization [54.94545757220999]
$f$-PO is a novel framework that generalizes and extends existing approaches.<n>We conduct experiments on state-of-the-art language models using benchmark datasets.
arXiv Detail & Related papers (2024-10-29T02:11:45Z)
Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy. We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z)
Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation [53.17668583030862]
We study infinite-horizon average-reward Markov decision processes (AMDPs) in the context of general function approximation. We propose a novel algorithmic framework named Local-fitted Optimization with OPtimism (LOOP) We show that LOOP achieves a sublinear $tildemathcalO(mathrmpoly(d, mathrmsp(V*)) sqrtTbeta )$ regret, where $d$ and $beta$ correspond to AGEC and log-covering number of the hypothesis class respectively
arXiv Detail & Related papers (2024-04-19T06:24:22Z)
Scalable Online Exploration via Coverability [45.66375686120087]
Exploration is a major challenge in reinforcement learning, especially for high-dimensional domains that require function approximation. We introduce a new objective, $L_Coverage, which generalizes previous exploration schemes and supports three fundamental desideratas. $L_Coverage enables the first computationally efficient model-based and model-free algorithms for online (reward-free or reward-driven) reinforcement learning in MDPs with low coverability.
arXiv Detail & Related papers (2024-03-11T10:14:06Z)
Minimax Optimal Online Imitation Learning via Replay Estimation [47.83919594113314]
We introduce a technique of replay estimation to reduce this empirical variance. We show that our approach achieves the optimal $widetildeO left( min(H3/2 / N, H / sqrtN$)$ dependency.
arXiv Detail & Related papers (2022-05-30T19:29:56Z)
On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function. We tackle this problem under the context of function approximation, leveraging powerful function approximators. We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z)
Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
We study optimization of areas under precision-recall curves (AUPRC), which is widely used for imbalanced tasks. We develop novel momentum methods with a better iteration of $O (1/epsilon4)$ for finding an $epsilon$stationary solution. We also design a novel family of adaptive methods with the same complexity of $O (1/epsilon4)$, which enjoy faster convergence in practice.
arXiv Detail & Related papers (2021-07-02T16:21:52Z)
Low-rank State-action Value-function Approximation [11.026561518386025]
Several high-dimensional state problems can be well-approximated by an intrinsic low-rank structure. This paper proposes different algorithms to estimate a low-rank factorization of the $Q(s, a)$ matrix.
arXiv Detail & Related papers (2021-04-18T10:31:39Z)
Using Distance Correlation for Efficient Bayesian Optimization [0.0]
We propose a BO scheme named BDC, which integrates BO with a statistical measure of association of two random variables called Distance Correlation.<n>BDC exploration balances and exploitation automatically, and requires no manual hyper parameter tuning.<n>We evaluate BDC on a range of benchmark tests and observe that it performs on per with popular BO methods.
arXiv Detail & Related papers (2021-02-17T19:37:35Z)
Bayesian Optimization of Risk Measures [7.799648230758491]
We consider Bayesian optimization of objective functions of the form $rho[ F(x, W) ]$, where $F$ is a black-box expensive-to-evaluate function. We propose a family of novel Bayesian optimization algorithms that exploit the structure of the objective function to substantially improve sampling efficiency.
arXiv Detail & Related papers (2020-07-10T18:20:46Z)
Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation. We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
Efficient Rollout Strategies for Bayesian Optimization [15.050692645517998]
Most acquisition functions are myopic, meaning that they only consider the impact of the next function evaluation. We show that a combination of quasi-Monte Carlo, common random numbers, and control variables significantly reduce the computational burden of rollout. We then formulate a policy-search based approach that removes the need to optimize the rollout acquisition function.
arXiv Detail & Related papers (2020-02-24T20:54:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.