A Bit of Freedom Goes a Long Way: Classical and Quantum Algorithms for Reinforcement Learning under a Generative Model
- URL: http://arxiv.org/abs/2507.22854v1
- Date: Wed, 30 Jul 2025 17:24:23 GMT
- Title: A Bit of Freedom Goes a Long Way: Classical and Quantum Algorithms for Reinforcement Learning under a Generative Model
- Authors: Andris Ambainis, Joao F. Doriguello, Debbie Lim,
- Abstract summary: We propose novel classical and quantum online algorithms for learning finite-horizon and infinite-horizon average-reward Markov Decision Processes (MDPs)<n>Our algorithms are based on a hybrid exploration-generative reinforcement learning model wherein the agent can, from time to time, freely interact with the environment in a generative sampling fashion, i.e., by having access to a "simulator"<n>We show that it is possible to avoid several paradigms from RL like "optimism in the face of uncertainty" and "posterior sampling" and instead compute and use optimal policies directly, which yields better regret bounds compared to previous works
- Score: 0.40964539027092906
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose novel classical and quantum online algorithms for learning finite-horizon and infinite-horizon average-reward Markov Decision Processes (MDPs). Our algorithms are based on a hybrid exploration-generative reinforcement learning (RL) model wherein the agent can, from time to time, freely interact with the environment in a generative sampling fashion, i.e., by having access to a "simulator". By employing known classical and new quantum algorithms for approximating optimal policies under a generative model within our learning algorithms, we show that it is possible to avoid several paradigms from RL like "optimism in the face of uncertainty" and "posterior sampling" and instead compute and use optimal policies directly, which yields better regret bounds compared to previous works. For finite-horizon MDPs, our quantum algorithms obtain regret bounds which only depend logarithmically on the number of time steps $T$, thus breaking the $O(\sqrt{T})$ classical barrier. This matches the time dependence of the prior quantum works of Ganguly et al. (arXiv'23) and Zhong et al. (ICML'24), but with improved dependence on other parameters like state space size $S$ and action space size $A$. For infinite-horizon MDPs, our classical and quantum bounds still maintain the $O(\sqrt{T})$ dependence but with better $S$ and $A$ factors. Nonetheless, we propose a novel measure of regret for infinite-horizon MDPs with respect to which our quantum algorithms have $\operatorname{poly}\log{T}$ regret, exponentially better compared to classical algorithms. Finally, we generalise all of our results to compact state spaces.
Related papers
- Sum-of-Squares inspired Quantum Metaheuristic for Polynomial Optimization with the Hadamard Test and Approximate Amplitude Constraints [76.53316706600717]
Recently proposed quantum algorithm arXiv:2206.14999 is based on semidefinite programming (SDP)
We generalize the SDP-inspired quantum algorithm to sum-of-squares.
Our results show that our algorithm is suitable for large problems and approximate the best known classicals.
arXiv Detail & Related papers (2024-08-14T19:04:13Z) - Solving The Travelling Salesman Problem Using A Single Qubit [0.0]
The travelling salesman problem (TSP) is a popular NP-hard-combinatorial optimization problem.<n>We present an algorithm that solves an arbitrary TSP using a single qubit by invoking the principle of quantum parallelism.<n>The underlying framework of our algorithm is a quantum version of the classical Brachistochrone approach.
arXiv Detail & Related papers (2024-07-24T12:06:37Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space [0.0]
We study the problem of optimal control of a family of discrete-time countable state-space Markov Decision Processes.
We propose an algorithm based on Thompson sampling with dynamically-sized episodes.
We show that our algorithm can be applied to develop approximately optimal control algorithms.
arXiv Detail & Related papers (2023-06-05T03:57:16Z) - Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs)
At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward.
We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z) - Provably Efficient Exploration in Quantum Reinforcement Learning with Logarithmic Worst-Case Regret [23.418957451727255]
We propose a novel UCRL-style algorithm for quantum reinforcement learning (RL)
We establish an $mathcalO(mathrmpoly(S, A, H, log T))$ worst-case regret for it, where $T$ is the number of episodes.
Specifically, we develop a quantum algorithm based on value target regression (VTR) for linear mixture MDPs with $d$-dimensional linear representation.
arXiv Detail & Related papers (2023-02-21T16:23:11Z) - Quantum Algorithms for Reinforcement Learning with a Generative Model [16.168901236223117]
Reinforcement learning studies how an agent should interact with an environment to maximize its cumulative reward.
We design quantum algorithms that approximate an optimal policy ($pi*$), the optimal value function ($v*$), and the optimal $Q$-function ($q*$)
We show that our quantum algorithm for computing $q*$ is optimal by proving a matching quantum lower bound.
arXiv Detail & Related papers (2021-12-15T19:51:49Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - More Practical and Adaptive Algorithms for Online Quantum State Learning [11.836183463815653]
In this paper, we develop algorithms to advance the online learning of quantum states.
First, we show that Regularized Follow-the-Leader (RFTL) method with Tallis-2 entropy can achieve an $O(sqrtMT)$ total loss with perfect hindsight.
Second, we propose a parameter-free algorithm based on a classical adjusting learning rate schedule.
arXiv Detail & Related papers (2020-06-01T15:17:55Z) - Model Selection in Contextual Stochastic Bandit Problems [51.94632035240787]
We develop a meta-algorithm that selects between base algorithms.
We show through a lower bound that even when one of the base algorithms has $O(sqrtT)$ regret, in general it is impossible to get better than $Omega(sqrtT)$ regret.
arXiv Detail & Related papers (2020-03-03T18:46:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.