Multi-Agent Off-Policy TD Learning: Finite-Time Analysis with
Near-Optimal Sample Complexity and Communication Complexity
- URL: http://arxiv.org/abs/2103.13147v1
- Date: Wed, 24 Mar 2021 12:48:08 GMT
- Title: Multi-Agent Off-Policy TD Learning: Finite-Time Analysis with
Near-Optimal Sample Complexity and Communication Complexity
- Authors: Ziyi Chen, Yi Zhou, Rongrong Chen
- Abstract summary: We develop two decentralized TD with correction (TDC) algorithms for multi-agent off-policy TD learning.
Our algorithms preserve full privacy of the actions, policies and rewards of the agents, and adopt mini-batch sampling to reduce the sampling variance and communication frequency.
- Score: 13.100926925535578
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The finite-time convergence of off-policy TD learning has been
comprehensively studied recently. However, such a type of convergence has not
been well established for off-policy TD learning in the multi-agent setting,
which covers broader applications and is fundamentally more challenging. This
work develops two decentralized TD with correction (TDC) algorithms for
multi-agent off-policy TD learning under Markovian sampling. In particular, our
algorithms preserve full privacy of the actions, policies and rewards of the
agents, and adopt mini-batch sampling to reduce the sampling variance and
communication frequency. Under Markovian sampling and linear function
approximation, we proved that the finite-time sample complexity of both
algorithms for achieving an $\epsilon$-accurate solution is in the order of
$\mathcal{O}(\epsilon^{-1}\ln \epsilon^{-1})$, matching the near-optimal sample
complexity of centralized TD(0) and TDC. Importantly, the communication
complexity of our algorithms is in the order of $\mathcal{O}(\ln
\epsilon^{-1})$, which is significantly lower than the communication complexity
$\mathcal{O}(\epsilon^{-1}\ln \epsilon^{-1})$ of the existing decentralized
TD(0). Experiments corroborate our theoretical findings.
Related papers
- Fully First-Order Methods for Decentralized Bilevel Optimization [17.20330936572045]
This paper focuses on decentralized bilevel optimization (DSBO) where agents only communicate with their neighbors.
We propose Decentralized Gradient Descent and Ascent with Gradient Tracking (DSGDA-GT), a novel algorithm that only requires first-order oracles that are much cheaper than second-order oracles widely adopted in existing works.
arXiv Detail & Related papers (2024-10-25T06:11:43Z) - Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs)
At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward.
We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z) - PRECISION: Decentralized Constrained Min-Max Learning with Low
Communication and Sample Complexities [25.153506493249854]
We show an adaptive multi-agent learning technique for min-max optimization problems.
We also propose an algorithm called PRECISION that enjoys a reduction in the number of iterations.
arXiv Detail & Related papers (2023-03-05T00:26:10Z) - Differentially-Private Hierarchical Clustering with Provable
Approximation Guarantees [79.59010418610625]
We study differentially private approximation algorithms for hierarchical clustering.
We show strong lower bounds for the problem: that any $epsilon$-DP algorithm must exhibit $O(|V|2/ epsilon)$-additive error for an input dataset.
We propose a private $1+o(1)$ approximation algorithm which also recovers the blocks exactly.
arXiv Detail & Related papers (2023-01-31T19:14:30Z) - Finite-Time Analysis of Fully Decentralized Single-Timescale
Actor-Critic [4.94128206910124]
We introduce a fully decentralized Actor-Critic (AC) algorithm, where actor, critic, and global reward estimator are updated in an alternating manner.
We show that our algorithm has sample complexity of $tildemathcalO(epsilon-2)$ under Markovian sampling.
We also provide a local action privacy-preserving version of our algorithm and its analysis.
arXiv Detail & Related papers (2022-06-12T13:14:14Z) - Sample and Communication-Efficient Decentralized Actor-Critic Algorithms
with Finite-Time Analysis [27.21581944906418]
Actor-critic (AC) algorithms have been widely adopted in decentralized multi-agent systems.
We develop two decentralized AC and natural AC (NAC) algorithms that are private, and sample and communication-efficient.
arXiv Detail & Related papers (2021-09-08T15:02:21Z) - Variance-Reduced Off-Policy TDC Learning: Non-Asymptotic Convergence
Analysis [27.679514676804057]
We develop a variance reduction scheme for the two time-scale TDC algorithm in the off-policy setting.
Experiments demonstrate that the proposed variance-reduced TDC achieves a smaller convergence error than both the conventional TDC and the variance-reduced TD.
arXiv Detail & Related papers (2020-10-26T01:33:05Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - Improving Sample Complexity Bounds for (Natural) Actor-Critic Algorithms [58.57004511121862]
This paper characterizes the convergence rate and sample complexity of AC and NAC under Markovian sampling.
We show that AC and NAC attain orderwise performance improvement over PG and NPG under infinite horizon due to the incorporation of critic.
arXiv Detail & Related papers (2020-04-27T17:11:06Z) - Non-asymptotic Convergence of Adam-type Reinforcement Learning
Algorithms under Markovian Sampling [56.394284787780364]
This paper provides the first theoretical convergence analysis for two fundamental RL algorithms of policy gradient (PG) and temporal difference (TD) learning.
Under general nonlinear function approximation, PG-AMSGrad with a constant stepsize converges to a neighborhood of a stationary point at the rate of $mathcalO(log T/sqrtT)$.
Under linear function approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood of the global optimum at the rate of $mathcalO(log T/sqrtT
arXiv Detail & Related papers (2020-02-15T00:26:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.