Multitask Online Mirror Descent
- URL: http://arxiv.org/abs/2106.02393v1
- Date: Fri, 4 Jun 2021 10:14:57 GMT
- Title: Multitask Online Mirror Descent
- Authors: Nicol\`o Cesa-Bianchi, Pierre Laforgue, Andrea Paudice, Massimiliano
Pontil
- Abstract summary: We analyze MT-OMD, a multitask generalization of Online Mirror Descent (OMD) which operates by sharing updates between tasks.
Our extensions of Online Gradient Descent and Exponentiated Gradient, two important instances of OMD, are shown to enjoy closed-form updates.
- Score: 35.93027027759005
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce and analyze MT-OMD, a multitask generalization of Online Mirror
Descent (OMD) which operates by sharing updates between tasks. We prove that
the regret of MT-OMD is of order $\sqrt{1 + \sigma^2(N-1)}\sqrt{T}$, where
$\sigma^2$ is the task variance according to the geometry induced by the
regularizer, $N$ is the number of tasks, and $T$ is the time horizon. Whenever
tasks are similar, that is, $\sigma^2 \le 1$, this improves upon the
$\sqrt{NT}$ bound obtained by running independent OMDs on each task. Our
multitask extensions of Online Gradient Descent and Exponentiated Gradient, two
important instances of OMD, are shown to enjoy closed-form updates, making them
easy to use in practice. Finally, we provide numerical experiments on four
real-world datasets which support our theoretical findings.
Related papers
- Regret Analysis of Multi-task Representation Learning for Linear-Quadratic Adaptive Control [3.7603027627883363]
We analyze the regret of multi-task representation learning for linear-quadratic control.
We demonstrate that for settings where exploration is "benign", the regret of any agent after $T$ timesteps scales as $tilde O(sqrtT/H)$.
In settings with "difficult" exploration, the regret scales as $tilde O(sqrtd_u d_theta sqrtT + T3/4/H1/5)$, where $d_the
arXiv Detail & Related papers (2024-07-08T09:41:42Z) - Metalearning with Very Few Samples Per Task [19.78398372660794]
We consider a binary classification setting where tasks are related by a shared representation.
Here, the amount of data is measured in terms of the number of tasks $t$ that we need to see and the number of samples $n$ per task.
Our work also yields a characterization of distribution-free multitask learning and reductions between meta and multitask learning.
arXiv Detail & Related papers (2023-12-21T16:06:44Z) - Provable Multi-Task Representation Learning by Two-Layer ReLU Neural Networks [69.38572074372392]
We present the first results proving that feature learning occurs during training with a nonlinear model on multiple tasks.
Our key insight is that multi-task pretraining induces a pseudo-contrastive loss that favors representations that align points that typically have the same label across tasks.
arXiv Detail & Related papers (2023-07-13T16:39:08Z) - Scaling Distributed Multi-task Reinforcement Learning with Experience
Sharing [38.883540444516605]
DARPA launched the ShELL program, which aims to explore how experience sharing can benefit distributed lifelong learning agents.
We conduct both theoretical and empirical research on distributed multi-task reinforcement learning (RL), where a group of $N$ agents collaboratively solves $M$ tasks.
We propose an algorithm called DistMT-LSVI, where each agent independently learns $epsilon$-optimal policies for all $M$ tasks.
arXiv Detail & Related papers (2023-07-11T22:58:53Z) - FAMO: Fast Adaptive Multitask Optimization [48.59232177073481]
We introduce Fast Adaptive Multitask Optimization FAMO, a dynamic weighting method that decreases task losses in a balanced way.
Our results indicate that FAMO achieves comparable or superior performance to state-of-the-art gradient manipulation techniques.
arXiv Detail & Related papers (2023-06-06T15:39:54Z) - Simplifying and Understanding State Space Models with Diagonal Linear
RNNs [56.33053691749856]
This work disposes of the discretization step, and proposes a model based on vanilla Diagonal Linear RNNs.
We empirically show that, despite being conceptually much simpler, $mathrmDLR$ is as performant as previously-proposed SSMs.
We also characterize the expressivity of SSMs and attention-based models via a suite of $13$ synthetic sequence-to-sequence tasks.
arXiv Detail & Related papers (2022-12-01T18:53:06Z) - Joint Representation Training in Sequential Tasks with Shared Structure [40.1056491921582]
We introduce the Shared-MatrixRL algorithm for the setting of Multitask MatrixRL.
We show the regret on the the $P$ tasks can be improved from $O(PHdsqrtNH)$ to $O((HdsqrtrP + HPsqrtrd)sqrtNH)$ over $N$ episodes of horizon $H$.
arXiv Detail & Related papers (2022-06-24T18:10:00Z) - On Steering Multi-Annotations per Sample for Multi-Task Learning [79.98259057711044]
The study of multi-task learning has drawn great attention from the community.
Despite the remarkable progress, the challenge of optimally learning different tasks simultaneously remains to be explored.
Previous works attempt to modify the gradients from different tasks. Yet these methods give a subjective assumption of the relationship between tasks, and the modified gradient may be less accurate.
In this paper, we introduce Task Allocation(STA), a mechanism that addresses this issue by a task allocation approach, in which each sample is randomly allocated a subset of tasks.
For further progress, we propose Interleaved Task Allocation(ISTA) to iteratively allocate all
arXiv Detail & Related papers (2022-03-06T11:57:18Z) - Multi-task Representation Learning with Stochastic Linear Bandits [29.8208189270894]
We study the problem of transfer-learning in the setting of linear bandit tasks.
We consider that a low dimensional linear representation is shared across the tasks, and study the benefit of learning this representation in the multi-task learning setting.
arXiv Detail & Related papers (2022-02-21T09:26:34Z) - On the Theory of Transfer Learning: The Importance of Task Diversity [114.656572506859]
We consider $t+1$ tasks parameterized by functions of the form $f_j circ h$ in a general function class $mathcalF circ mathcalH$.
We show that for diverse training tasks the sample complexity needed to learn the shared representation across the first $t$ training tasks scales as $C(mathcalH) + t C(mathcalF)$.
arXiv Detail & Related papers (2020-06-20T20:33:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.