Related papers: Scaling Distributed Multi-task Reinforcement Learning with Experience Sharing

Scaling Distributed Multi-task Reinforcement Learning with Experience Sharing

URL: http://arxiv.org/abs/2307.05834v1
Date: Tue, 11 Jul 2023 22:58:53 GMT
Title: Scaling Distributed Multi-task Reinforcement Learning with Experience Sharing
Authors: Sanae Amani, Khushbu Pahwa, Vladimir Braverman, Lin F. Yang
Abstract summary: DARPA launched the ShELL program, which aims to explore how experience sharing can benefit distributed lifelong learning agents. We conduct both theoretical and empirical research on distributed multi-task reinforcement learning (RL), where a group of $N$ agents collaboratively solves $M$ tasks. We propose an algorithm called DistMT-LSVI, where each agent independently learns $epsilon$-optimal policies for all $M$ tasks.
Score: 38.883540444516605
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, DARPA launched the ShELL program, which aims to explore how experience sharing can benefit distributed lifelong learning agents in adapting to new challenges. In this paper, we address this issue by conducting both theoretical and empirical research on distributed multi-task reinforcement learning (RL), where a group of $N$ agents collaboratively solves $M$ tasks without prior knowledge of their identities. We approach the problem by formulating it as linearly parameterized contextual Markov decision processes (MDPs), where each task is represented by a context that specifies the transition dynamics and rewards. To tackle this problem, we propose an algorithm called DistMT-LSVI. First, the agents identify the tasks, and then they exchange information through a central server to derive $\epsilon$-optimal policies for the tasks. Our research demonstrates that to achieve $\epsilon$-optimal policies for all $M$ tasks, a single agent using DistMT-LSVI needs to run a total number of episodes that is at most $\tilde{\mathcal{O}}({d^3H^6(\epsilon^{-2}+c_{\rm sep}^{-2})}\cdot M/N)$, where $c_{\rm sep}>0$ is a constant representing task separability, $H$ is the horizon of each episode, and $d$ is the feature dimension of the dynamics and rewards. Notably, DistMT-LSVI improves the sample complexity of non-distributed settings by a factor of $1/N$, as each agent independently learns $\epsilon$-optimal policies for all $M$ tasks using $\tilde{\mathcal{O}}(d^3H^6M\epsilon^{-2})$ episodes. Additionally, we provide numerical experiments conducted on OpenAI Gym Atari environments that validate our theoretical findings.

Related papers

Simple Semi-supervised Knowledge Distillation from Vision-Language Models via $\mathbf{\ exttt{D}}$ual-$\mathbf{\ exttt{H}}$ead $\mathbf{\ exttt{O}}$ptimization [49.2338910653152]
Vision-constrained models (VLMs) have achieved remarkable success across diverse tasks by leveraging rich textual information with minimal labeled data.<n> Knowledge distillation (KD) offers a well-established solution to this problem; however, recent KD approaches from VLMs often involve multi-stage training or additional tuning.<n>We propose $mathbftextttDHO$ -- a simple yet effective KD framework that transfers knowledge from VLMs to compact, task-specific models in semi-language settings.
arXiv Detail & Related papers (2025-05-12T15:39:51Z)
Beyond Task Diversity: Provable Representation Transfer for Sequential Multi-Task Linear Bandits [17.970177214029473]
We study lifelong learning in linear bandits, where a learner interacts with a sequence of linear bandit tasks. Current literature assumes that the tasks are diverse, i.e., their parameters uniformly span the $m$-dimensional subspace. We present the first nontrivial result for sequential multi-task linear bandits without the task diversity assumption.
arXiv Detail & Related papers (2025-01-23T05:21:27Z)
Sweeping Heterogeneity with Smart MoPs: Mixture of Prompts for LLM Task Adaptation [45.90925587972781]
Large Language Models (LLMs) have the ability to solve a variety of tasks, such as text summarization and mathematical questions. Due to high computational costs, the current trend is to use prompt instruction tuning to better adjust monolithic, pretrained LLMs for new -- but often individual -- downstream tasks. MoPs can simultaneously mitigate prompt training "interference" in multi-task, multi-source scenarios.
arXiv Detail & Related papers (2023-10-04T14:11:12Z)
Improved Active Multi-Task Representation Learning via Lasso [44.607652031235716]
In this paper, we show the dominance of the L1-regularized-relevance-based ($nu1$) strategy by giving a lower bound for the $nu2$-based strategy. We also characterize the potential of our $nu1$-based strategy in sample-cost-sensitive settings.
arXiv Detail & Related papers (2023-06-05T03:08:29Z)
Multi-Task Imitation Learning for Linear Dynamical Systems [50.124394757116605]
We study representation learning for efficient imitation learning over linear systems. We find that the imitation gap over trajectories generated by the learned target policy is bounded by $tildeOleft( frack n_xHN_mathrmshared + frack n_uN_mathrmtargetright)$.
arXiv Detail & Related papers (2022-12-01T00:14:35Z)
Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs) Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z)
Joint Representation Training in Sequential Tasks with Shared Structure [40.1056491921582]
We introduce the Shared-MatrixRL algorithm for the setting of Multitask MatrixRL. We show the regret on the the $P$ tasks can be improved from $O(PHdsqrtNH)$ to $O((HdsqrtrP + HPsqrtrd)sqrtNH)$ over $N$ episodes of horizon $H$.
arXiv Detail & Related papers (2022-06-24T18:10:00Z)
Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation [92.99933928528797]
We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs) In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy. We show that to obtain an $epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $tilde O(H4d(H + d)epsilon-2)$ episodes.
arXiv Detail & Related papers (2021-10-12T23:03:58Z)
On the Power of Multitask Representation Learning in Linear MDP [61.58929164172968]
This paper presents analyses for the statistical benefit of multitask representation learning in linear Markov Decision Process (MDP) We first discover a emphLeast-Activated-Feature-Abundance (LAFA) criterion, denoted as $kappa$, with which we prove that a straightforward least-square algorithm learns a policy which is $tildeO(H2sqrtfrackappa mathcalC(Phi)2 kappa dNT+frackappa dn)
arXiv Detail & Related papers (2021-06-15T11:21:06Z)
Multitask Online Mirror Descent [35.93027027759005]
We analyze MT-OMD, a multitask generalization of Online Mirror Descent (OMD) which operates by sharing updates between tasks. Our extensions of Online Gradient Descent and Exponentiated Gradient, two important instances of OMD, are shown to enjoy closed-form updates.
arXiv Detail & Related papers (2021-06-04T10:14:57Z)
On the Theory of Transfer Learning: The Importance of Task Diversity [114.656572506859]
We consider $t+1$ tasks parameterized by functions of the form $f_j circ h$ in a general function class $mathcalF circ mathcalH$. We show that for diverse training tasks the sample complexity needed to learn the shared representation across the first $t$ training tasks scales as $C(mathcalH) + t C(mathcalF)$.
arXiv Detail & Related papers (2020-06-20T20:33:59Z)
Task-agnostic Exploration in Reinforcement Learning [35.403304641170386]
We present an efficient task-agnostic reinforcement learning algorithm, textscUCBZero. It finds $epsilon$-optimal policies for $N$ arbitrary tasks after at most $tilde O(log(N)H5SA/epsilon2)$ exploration episodes. We also provide an $Omega(log (N)H2SA/epsilon2)$ lower bound, showing that the $log$ dependency on $N$ is unavoidable.
arXiv Detail & Related papers (2020-06-16T20:23:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.