Related papers: Q-Measure-Learning for Continuous State RL: Efficient Implementation and Convergence

Q-Measure-Learning for Continuous State RL: Efficient Implementation and Convergence

URL: http://arxiv.org/abs/2603.03523v1
Date: Tue, 03 Mar 2026 21:15:22 GMT
Title: Q-Measure-Learning for Continuous State RL: Efficient Implementation and Convergence
Authors: Shengbo Wang,
Abstract summary: We study reinforcement learning in infinite-horizon discounted Markov decision processes with continuous state spaces.<n>We propose the novel Q-Measure-Learning, which learns a signed empirical measure supported on visited state-action pairs.
Score: 10.189658648290257
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study reinforcement learning in infinite-horizon discounted Markov decision processes with continuous state spaces, where data are generated online from a single trajectory under a Markovian behavior policy. To avoid maintaining an infinite-dimensional, function-valued estimate, we propose the novel Q-Measure-Learning, which learns a signed empirical measure supported on visited state-action pairs and reconstructs an action-value estimate via kernel integration. The method jointly estimates the stationary distribution of the behavior chain and the Q-measure through coupled stochastic approximation, leading to an efficient weight-based implementation with $O(n)$ memory and $O(n)$ computation cost per iteration. Under uniform ergodicity of the behavior chain, we prove almost sure sup-norm convergence of the induced Q-function to the fixed point of a kernel-smoothed Bellman operator. We also bound the approximation error between this limit and the optimal $Q^*$ as a function of the kernel bandwidth. To assess the performance of our proposed algorithm, we conduct RL experiments in a two-item inventory control setting.

Related papers

Stabilizing Fixed-Point Iteration for Markov Chain Poisson Equations [49.702772230127465]
We study finite-state Markov chains with $n$ states and transition matrix $P$.<n>We show that all non-decaying modes are captured by a real peripheral invariant subspace $mathcalK(P)$, and that the induced operator on the quotient space $mathbbRn/mathcalK(P) is strictly contractive, yielding a unique quotient solution.
arXiv Detail & Related papers (2026-01-31T02:57:01Z)
Reinforcement Learning with Function Approximation for Non-Markov Processes [2.0136462287587675]
We study reinforcement learning methods with linear function approximation under non-Markov state and cost processes.<n>We show that the algorithm converges under suitable ergodicity conditions on the underlying non-Markov processes.<n>We derive explicit error bounds for the limits of the resulting learning algorithms.
arXiv Detail & Related papers (2026-01-01T00:56:18Z)
Finite-Time Bounds for Distributionally Robust TD Learning with Linear Function Approximation [5.638124543342179]
We present the first robust temporal-difference learning with linear function approximation.<n>Our results close a key gap between the empirical success of robust RL algorithms and the non-asymptotic guarantees enjoyed by their non-robust counterparts.
arXiv Detail & Related papers (2025-10-02T07:01:41Z)
Uncertainty quantification for Markov chain induced martingales with application to temporal difference learning [55.197497603087065]
We analyze the performance of the Temporal Difference (TD) learning algorithm with linear function approximations.<n>We establish novel and general high-dimensional concentration inequalities and Berry-Esseen bounds for vector-valued martingales induced by Markov chains.
arXiv Detail & Related papers (2025-02-19T15:33:55Z)
Provably Efficient RL under Episode-Wise Safety in Constrained MDPs with Linear Function Approximation [32.74649239695449]
We study the reinforcement learning problem in a constrained decision process (CMDP)<n>We propose an RL algorithm for linear CMDPs that achieves $tildemathcalO(sqrtK)$ regret with an episode-wise zero-violation guarantee.<n>Our results significantly improve upon recent linear CMDP algorithms, which either violate the constraint or incur exponential computational costs.
arXiv Detail & Related papers (2025-02-14T13:07:25Z)
Stochastic Q-learning for Large Discrete Action Spaces [79.1700188160944]
In complex environments with discrete action spaces, effective decision-making is critical in reinforcement learning (RL) We present value-based RL approaches which, as opposed to optimizing over the entire set of $n$ actions, only consider a variable set of actions, possibly as small as $mathcalO(log(n)$)$. The presented value-based RL methods include, among others, Q-learning, StochDQN, StochDDQN, all of which integrate this approach for both value-function updates and action selection.
arXiv Detail & Related papers (2024-05-16T17:58:44Z)
Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees [39.287388288477096]
We consider the task of estimating a structural model of dynamic decisions by a human agent based upon the observable history of implemented actions and visited states. This problem has an inherent nested structure: in the inner problem, an optimal policy for a given reward function is identified while in the outer problem, a measure of fit is maximized. We propose a single-loop estimation algorithm with finite time guarantees that is equipped to deal with high-dimensional state spaces.
arXiv Detail & Related papers (2022-10-04T00:11:38Z)
Reinforcement Learning from Partial Observation: Linear Function Approximation with Provable Sample Efficiency [111.83670279016599]
We study reinforcement learning for partially observed decision processes (POMDPs) with infinite observation and state spaces. We make the first attempt at partial observability and function approximation for a class of POMDPs with a linear structure.
arXiv Detail & Related papers (2022-04-20T21:15:38Z)
Optimal policy evaluation using kernel-based temporal difference methods [78.83926562536791]
We use kernel Hilbert spaces for estimating the value function of an infinite-horizon discounted Markov reward process. We derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator. We prove minimax lower bounds over sub-classes of MRPs.
arXiv Detail & Related papers (2021-09-24T14:48:20Z)
Tightening the Dependence on Horizon in the Sample Complexity of Q-Learning [59.71676469100807]
This work sharpens the sample complexity of synchronous Q-learning to an order of $frac|mathcalS|| (1-gamma)4varepsilon2$ for any $0varepsilon 1$. Our finding unveils the effectiveness of vanilla Q-learning, which matches that of speedy Q-learning without requiring extra computation and storage.
arXiv Detail & Related papers (2021-02-12T14:22:05Z)
Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation. We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.