Scalable Multitask Learning Using Gradient-based Estimation of Task Affinity
- URL: http://arxiv.org/abs/2409.06091v1
- Date: Mon, 9 Sep 2024 21:59:27 GMT
- Title: Scalable Multitask Learning Using Gradient-based Estimation of Task Affinity
- Authors: Dongyue Li, Aneesh Sharma, Hongyang R. Zhang,
- Abstract summary: Grad-TAG can estimate task affinities without repeatedly training on data from various task combinations.
We show that Grad-TAG achieves excellent performance and runtime tradeoffs compared to existing approaches.
- Score: 16.643892206707854
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multitask learning is a widely used paradigm for training models on diverse tasks, with applications ranging from graph neural networks to language model fine-tuning. Since tasks may interfere with each other, a key notion for modeling their relationships is task affinity. This includes pairwise task affinity, computed among pairs of tasks, and higher-order affinity, computed among subsets of tasks. Naively computing either of them requires repeatedly training on data from various task combinations, which is computationally intensive. We present a new algorithm Grad-TAG that can estimate task affinities without this repeated training. The key idea of Grad-TAG is to train a "base" model for all tasks and then use a linearization technique to estimate the loss of the model for a specific task combination. The linearization works by computing a gradient-based approximation of the loss, using low-dimensional projections of gradients as features in a logistic regression to predict labels for the task combination. We show that the linearized model can provably approximate the loss when the gradient-based approximation is accurate, and also empirically verify that on several large models. Then, given the estimated task affinity, we design a semi-definite program for clustering similar tasks by maximizing the average density of clusters. We evaluate Grad-TAG's performance across seven datasets, including multi-label classification on graphs, and instruction fine-tuning of language models. Our task affinity estimates are within 2.7% distance to the true affinities while needing only 3% of FLOPs in full training. On our largest graph with 21M edges and 500 labeling tasks, our algorithm delivers estimates within 5% distance to the true affinities, using only 112 GPU hours. Our results show that Grad-TAG achieves excellent performance and runtime tradeoffs compared to existing approaches.
Related papers
- ATM: Improving Model Merging by Alternating Tuning and Merging [16.12778778313037]
We motivate the effectiveness of task vectors by linking them to multi-task gradients.
In a single-epoch scenario, task vectors are mathematically equivalent to the gradients obtained via gradient descent in a multi-task setting.
We show that task vectors perform optimally when equality is maintained, and their effectiveness is largely driven by the first epoch's gradient.
arXiv Detail & Related papers (2024-11-05T12:42:42Z) - Decoupling Weighing and Selecting for Integrating Multiple Graph
Pre-training Tasks [58.65410800008769]
This paper proposes a novel instance-level framework for integrating multiple graph pre-training tasks, Weigh And Select (WAS)
It first adaptively learns an optimal combination of tasks for each instance from a given task pool, based on which a customized instance-level task weighing strategy is learned.
Experiments on 16 graph datasets across node-level and graph-level downstream tasks have demonstrated that WAS can achieve comparable performance to other leading counterparts.
arXiv Detail & Related papers (2024-03-03T05:29:49Z) - Bayesian Uncertainty for Gradient Aggregation in Multi-Task Learning [39.4348419684885]
Multi-task learning (MTL) aims at learning a single model that solves several tasks efficiently.
We introduce a novel gradient aggregation approach using Bayesian inference.
We empirically demonstrate the benefits of our approach in a variety of datasets.
arXiv Detail & Related papers (2024-02-06T14:00:43Z) - Provable Multi-Task Representation Learning by Two-Layer ReLU Neural Networks [69.38572074372392]
We present the first results proving that feature learning occurs during training with a nonlinear model on multiple tasks.
Our key insight is that multi-task pretraining induces a pseudo-contrastive loss that favors representations that align points that typically have the same label across tasks.
arXiv Detail & Related papers (2023-07-13T16:39:08Z) - Boosting Multitask Learning on Graphs through Higher-Order Task Affinities [17.70434437597516]
Predicting node labels on a given graph is a widely studied problem with many applications, including community detection and molecular graph prediction.
This paper considers predicting multiple node labeling functions on graphs simultaneously and revisits this problem from a multitask learning perspective.
We develop an algorithm to cluster tasks into groups based on a higher-order task affinity measure.
arXiv Detail & Related papers (2023-06-24T15:53:38Z) - Condensing Graphs via One-Step Gradient Matching [50.07587238142548]
We propose a one-step gradient matching scheme, which performs gradient matching for only one single step without training the network weights.
Our theoretical analysis shows this strategy can generate synthetic graphs that lead to lower classification loss on real graphs.
In particular, we are able to reduce the dataset size by 90% while approximating up to 98% of the original performance.
arXiv Detail & Related papers (2022-06-15T18:20:01Z) - Conflict-Averse Gradient Descent for Multi-task Learning [56.379937772617]
A major challenge in optimizing a multi-task model is the conflicting gradients.
We introduce Conflict-Averse Gradient descent (CAGrad) which minimizes the average loss function.
CAGrad balances the objectives automatically and still provably converges to a minimum over the average loss.
arXiv Detail & Related papers (2021-10-26T22:03:51Z) - EGL++: Extending Expected Gradient Length to Active Learning for Human
Pose Estimation [2.0305676256390934]
State of the art human pose estimation models rely on large quantities of labelled data for robust performance.
EGL++ is a novel algorithm that extends expected gradient length to tasks where discrete labels are not available.
arXiv Detail & Related papers (2021-04-19T17:56:59Z) - Multi-task Supervised Learning via Cross-learning [102.64082402388192]
We consider a problem known as multi-task learning, consisting of fitting a set of regression functions intended for solving different tasks.
In our novel formulation, we couple the parameters of these functions, so that they learn in their task specific domains while staying close to each other.
This facilitates cross-fertilization in which data collected across different domains help improving the learning performance at each other task.
arXiv Detail & Related papers (2020-10-24T21:35:57Z) - Adaptive Task Sampling for Meta-Learning [79.61146834134459]
Key idea of meta-learning for few-shot classification is to mimic the few-shot situations faced at test time.
We propose an adaptive task sampling method to improve the generalization performance.
arXiv Detail & Related papers (2020-07-17T03:15:53Z) - Conditional Channel Gated Networks for Task-Aware Continual Learning [44.894710899300435]
Convolutional Neural Networks experience catastrophic forgetting when optimized on a sequence of learning problems.
We introduce a novel framework to tackle this problem with conditional computation.
We validate our proposal on four continual learning datasets.
arXiv Detail & Related papers (2020-03-31T19:35:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.