Sample Complexity of Average-Reward Q-Learning: From Single-agent to Federated Reinforcement Learning
- URL: http://arxiv.org/abs/2601.13642v1
- Date: Tue, 20 Jan 2026 06:21:54 GMT
- Title: Sample Complexity of Average-Reward Q-Learning: From Single-agent to Federated Reinforcement Learning
- Authors: Yuchen Jiao, Jiin Woo, Gen Li, Gauri Joshi, Yuejie Chi,
- Abstract summary: We study a simple but effective Q-learning algorithm for average-reward MDPs with finite state and action spaces under the weakly communicating assumption.<n>We prove that collaboration reduces the per-agent sample complexity to $widetildeOleft(frac|mathcalA||hstar|_mathsfsp3Mvarepsilon3right)$, with only $widetildeOleft(frac|hstar|_mathsfsp3Mvareps
- Score: 51.820005667020354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Average-reward reinforcement learning offers a principled framework for long-term decision-making by maximizing the mean reward per time step. Although Q-learning is a widely used model-free algorithm with established sample complexity in discounted and finite-horizon Markov decision processes (MDPs), its theoretical guarantees for average-reward settings remain limited. This work studies a simple but effective Q-learning algorithm for average-reward MDPs with finite state and action spaces under the weakly communicating assumption, covering both single-agent and federated scenarios. For the single-agent case, we show that Q-learning with carefully chosen parameters achieves sample complexity $\widetilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|\|h^{\star}\|_{\mathsf{sp}}^3}{\varepsilon^3}\right)$, where $\|h^{\star}\|_{\mathsf{sp}}$ is the span norm of the bias function, improving previous results by at least a factor of $\frac{\|h^{\star}\|_{\mathsf{sp}}^2}{\varepsilon^2}$. In the federated setting with $M$ agents, we prove that collaboration reduces the per-agent sample complexity to $\widetilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|\|h^{\star}\|_{\mathsf{sp}}^3}{M\varepsilon^3}\right)$, with only $\widetilde{O}\left(\frac{\|h^{\star}\|_{\mathsf{sp}}}{\varepsilon}\right)$ communication rounds required. These results establish the first federated Q-learning algorithm for average-reward MDPs, with provable efficiency in both sample and communication complexity.
Related papers
- A finite time analysis of distributed Q-learning [6.663174194579773]
Multi-agent reinforcement learning (MARL) has witnessed a remarkable surge in interest, fueled by the empirical success achieved in applications of single-agent reinforcement learning (RL)<n>In this study, we consider a distributed Q-learning scenario, wherein a number of agents cooperatively solve a sequential decision making problem without access to the central reward function which is an average of the local rewards.
arXiv Detail & Related papers (2024-05-23T00:52:38Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs [6.996002801232415]
We study the sample complexity of learning an $varepsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model.
For weakly communicating MDPs, we establish the complexity bound $widetildeO(SAfracHvarepsilon2 )$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space.
arXiv Detail & Related papers (2024-03-18T04:52:11Z) - Optimal Sample Complexity for Average Reward Markov Decision Processes [1.0445957451908694]
We develop an estimator for the optimal policy of average reward MDPs with a sample complexity of $widetilde O(|S||A|t_textmix2 epsilon-2)$.
This marks the first algorithm and analysis to reach the literature's lower bound.
arXiv Detail & Related papers (2023-10-13T03:08:59Z) - Sharper Model-free Reinforcement Learning for Average-reward Markov
Decision Processes [21.77276136591518]
We develop provably efficient model-free reinforcement learning (RL) algorithms for Markov Decision Processes (MDPs)
In the simulator setting, we propose a model-free RL algorithm that finds an $epsilon$-optimal policy using $widetildeO left(fracSAmathrmsp(h*)epsilon2+fracS2Amathrmsp(h*)epsilon2right)$ samples.
arXiv Detail & Related papers (2023-06-28T17:43:19Z) - Adversarial Online Multi-Task Reinforcement Learning [12.421997449847153]
We consider the adversarial online multi-task reinforcement learning setting.
In each of $K$ episodes the learner is given an unknown task taken from a finite set of $M$ unknown finite-horizon MDP models.
The learner's objective is to generalize its regret with respect to the optimal policy for each task.
arXiv Detail & Related papers (2023-01-11T02:18:26Z) - Near Sample-Optimal Reduction-based Policy Learning for Average Reward
MDP [58.13930707612128]
This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP)
We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
arXiv Detail & Related papers (2022-12-01T15:57:58Z) - Settling the Sample Complexity of Model-Based Offline Reinforcement
Learning [50.5790774201146]
offline reinforcement learning (RL) learns using pre-collected data without further exploration.
Prior algorithms or analyses either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality.
We demonstrate that the model-based (or "plug-in") approach achieves minimax-optimal sample complexity without burn-in cost.
arXiv Detail & Related papers (2022-04-11T17:26:19Z) - Tightening the Dependence on Horizon in the Sample Complexity of
Q-Learning [59.71676469100807]
This work sharpens the sample complexity of synchronous Q-learning to an order of $frac|mathcalS|| (1-gamma)4varepsilon2$ for any $0varepsilon 1$.
Our finding unveils the effectiveness of vanilla Q-learning, which matches that of speedy Q-learning without requiring extra computation and storage.
arXiv Detail & Related papers (2021-02-12T14:22:05Z) - Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample
Complexity [59.34067736545355]
Given an MDP with $S$ states, $A$ actions, the discount factor $gamma in (0,1)$, and an approximation threshold $epsilon > 0$, we provide a model-free algorithm to learn an $epsilon$-optimal policy.
For small enough $epsilon$, we show an improved algorithm with sample complexity.
arXiv Detail & Related papers (2020-06-06T13:34:41Z) - Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.