Scalable Bilevel Loss Balancing for Multi-Task Learning
- URL: http://arxiv.org/abs/2502.08585v1
- Date: Wed, 12 Feb 2025 17:18:14 GMT
- Title: Scalable Bilevel Loss Balancing for Multi-Task Learning
- Authors: Peiyao Xiao, Chaosheng Dong, Shaofeng Zou, Kaiyi Ji,
- Abstract summary: Multi-task learning (MTL) has been widely adopted for its ability to simultaneously learn multiple tasks.<n>We propose BiLB4MTL, a simple and scalable loss balancing approach for MTL.<n>BiLB4MTL achieves state-of-the-art performance in both accuracy and efficiency.
- Score: 30.689399230097667
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-task learning (MTL) has been widely adopted for its ability to simultaneously learn multiple tasks. While existing gradient manipulation methods often yield more balanced solutions than simple scalarization-based approaches, they typically incur a significant computational overhead of $\mathcal{O}(K)$ in both time and memory, where $K$ is the number of tasks. In this paper, we propose BiLB4MTL, a simple and scalable loss balancing approach for MTL, formulated from a novel bilevel optimization perspective. Our method incorporates three key components: (i) an initial loss normalization, (ii) a bilevel loss-balancing formulation, and (iii) a scalable first-order algorithm that requires only $\mathcal{O}(1)$ time and memory. Theoretically, we prove that BiLB4MTL guarantees convergence not only to a stationary point of the bilevel loss balancing problem but also to an $\epsilon$-accurate Pareto stationary point for all $K$ loss functions under mild conditions. Extensive experiments on diverse multi-task datasets demonstrate that BiLB4MTL achieves state-of-the-art performance in both accuracy and efficiency. Code is available at https://github.com/OptMN-Lab/-BiLB4MTL.
Related papers
- Simple Semi-supervised Knowledge Distillation from Vision-Language Models via $\mathbf{\ exttt{D}}$ual-$\mathbf{\ exttt{H}}$ead $\mathbf{\ exttt{O}}$ptimization [49.2338910653152]
Vision-constrained models (VLMs) have achieved remarkable success across diverse tasks by leveraging rich textual information with minimal labeled data.<n> Knowledge distillation (KD) offers a well-established solution to this problem; however, recent KD approaches from VLMs often involve multi-stage training or additional tuning.<n>We propose $mathbftextttDHO$ -- a simple yet effective KD framework that transfers knowledge from VLMs to compact, task-specific models in semi-language settings.
arXiv Detail & Related papers (2025-05-12T15:39:51Z) - Near-Optimal Online Learning for Multi-Agent Submodular Coordination: Tight Approximation and Communication Efficiency [52.60557300927007]
We present a $textbfMA-OSMA$ algorithm to transfer the discrete submodular problem into a continuous optimization.<n>We also introduce a projection-free $textbfMA-OSEA$ algorithm, which effectively utilizes the KL divergence by mixing a uniform distribution.<n>Our algorithms significantly improve the $(frac11+c)$-approximation provided by the state-of-the-art OSG algorithm.
arXiv Detail & Related papers (2025-02-07T15:57:56Z) - CoBa: Convergence Balancer for Multitask Finetuning of Large Language Models [23.50705152648991]
Multi-task learning (MTL) benefits the fine-tuning of large language models (LLMs)
Existing MTL strategies for LLMs often fall short by either being computationally intensive or failing to ensure simultaneous task convergence.
This paper presents CoBa, a new MTL approach designed to effectively manage task convergence balance with minimal computational overhead.
arXiv Detail & Related papers (2024-10-09T10:20:32Z) - MGDA Converges under Generalized Smoothness, Provably [27.87166415148172]
Multi-objective optimization (MOO) is receiving more attention in various fields such as multi-task learning.
Recent works provide some effective algorithms with theoretical analysis but they are limited by the standard $L$-smooth or bounded-gradient assumptions.
We study a more general and realistic class of generalized $ell$-smooth loss functions, where $ell$ is a general non-decreasing function of gradient norm.
arXiv Detail & Related papers (2024-05-29T18:36:59Z) - Scalable 3D Registration via Truncated Entry-wise Absolute Residuals [65.04922801371363]
A $3$D registration approach can process more than ten million ($107$) point pairs with over $99%$ random outliers.
We call our method TEAR, as it involves minimizing an outlier-robust loss that computes Truncated Entry-wise Absolute Residuals.
arXiv Detail & Related papers (2024-04-01T04:43:39Z) - A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning
with General Function Approximation [66.26739783789387]
We propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for reinforcement learning.
MQL-UCB achieves minimax optimal regret of $tildeO(dsqrtHK)$ when $K$ is sufficiently large and near-optimal policy switching cost.
Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.
arXiv Detail & Related papers (2023-11-26T08:31:57Z) - Low-Rank Multitask Learning based on Tensorized SVMs and LSSVMs [65.42104819071444]
Multitask learning (MTL) leverages task-relatedness to enhance performance.
We employ high-order tensors, with each mode corresponding to a task index, to naturally represent tasks referenced by multiple indices.
We propose a general framework of low-rank MTL methods with tensorized support vector machines (SVMs) and least square support vector machines (LSSVMs)
arXiv Detail & Related papers (2023-08-30T14:28:26Z) - Distributed Extra-gradient with Optimal Complexity and Communication
Guarantees [60.571030754252824]
We consider monotone variational inequality (VI) problems in multi-GPU settings where multiple processors/workers/clients have access to local dual vectors.
Extra-gradient, which is a de facto algorithm for monotone VI problems, has not been designed to be communication-efficient.
We propose a quantized generalized extra-gradient (Q-GenX), which is an unbiased and adaptive compression method tailored to solve VIs.
arXiv Detail & Related papers (2023-08-17T21:15:04Z) - Improvable Gap Balancing for Multi-Task Learning [15.582333026781004]
We propose two novel improvable gap balancing (IGB) algorithms for multi-task learning (MTL)
One takes a simple, and the other (for the first time) deploys deep reinforcement learning for MTL.
Our IGB algorithms lead to the best results in MTL via loss balancing and achieve further improvements when combined with gradient balancing.
arXiv Detail & Related papers (2023-07-28T09:26:03Z) - FAMO: Fast Adaptive Multitask Optimization [48.59232177073481]
We introduce Fast Adaptive Multitask Optimization FAMO, a dynamic weighting method that decreases task losses in a balanced way.
Our results indicate that FAMO achieves comparable or superior performance to state-of-the-art gradient manipulation techniques.
arXiv Detail & Related papers (2023-06-06T15:39:54Z) - Independent Component Alignment for Multi-Task Learning [2.5234156040689237]
In a multi-task learning (MTL) setting, a single model is trained to tackle a diverse set of tasks jointly.
We propose using a condition number of a linear system of gradients as a stability criterion of an MTL optimization.
We present Aligned-MTL, a novel MTL optimization approach based on the proposed criterion.
arXiv Detail & Related papers (2023-05-30T12:56:36Z) - Blockwise Stochastic Variance-Reduced Methods with Parallel Speedup for
Multi-Block Bilevel Optimization [43.74656748515853]
Non-stationary multi-block bilevel optimization problems involve $mgg 1$ lower level problems and have important applications in machine learning.
We aim to achieve three properties for our algorithm: a) matching the state-of-the-art complexity of standard BO problems with a single block; (b) achieving parallel speedup by sampling $I$ samples for each sampled block per-iteration; and (c) avoiding the computation of the inverse of a high-dimensional Hessian matrix estimator.
arXiv Detail & Related papers (2023-05-30T04:10:11Z) - Probabilistic Contrastive Learning for Domain Adaptation [42.33633916857581]
Contrastive learning has shown impressive success in enhancing feature discriminability for various visual tasks in a self-supervised manner.
Standard contrastive paradigm (features+$ell_2$ normalization) has limited benefits when applied in domain adaptation.
We propose Probabilistic Contrastive Learning (PCL), which moves beyond the standard paradigm by removing $ell_2$ normalization and replacing the features with probabilities.
PCL can guide the probability distribution towards a one-hot configuration, thus minimizing the discrepancy between features and class weights.
arXiv Detail & Related papers (2021-11-11T02:08:07Z) - SLAW: Scaled Loss Approximate Weighting for Efficient Multi-Task
Learning [0.0]
Multi-task learning (MTL) is a subfield of machine learning with important applications.
The best MTL optimization methods require individually computing the gradient of each task's loss function.
We propose Scaled Loss Approximate Weighting (SLAW), a method for multi-task optimization that matches the performance of the best existing methods while being much more efficient.
arXiv Detail & Related papers (2021-09-16T20:58:40Z) - BiAdam: Fast Adaptive Bilevel Optimization Methods [104.96004056928474]
Bilevel optimization has attracted increased interest in machine learning due to its many applications.
We provide a useful analysis framework for both the constrained and unconstrained optimization.
arXiv Detail & Related papers (2021-06-21T20:16:40Z) - Sample Efficient Linear Meta-Learning by Alternating Minimization [74.40553081646995]
We study a simple alternating minimization method (MLLAM) which alternately learns the low-dimensional subspace and the regressors.
We show that for a constant subspace dimension MLLAM obtains nearly-optimal estimation error, despite requiring only $Omega(log d)$ samples per task.
We propose a novel task subset selection scheme that ensures the same strong statistical guarantee as MLLAM.
arXiv Detail & Related papers (2021-05-18T06:46:48Z) - A Momentum-Assisted Single-Timescale Stochastic Approximation Algorithm
for Bilevel Optimization [112.59170319105971]
We propose a new algorithm -- the Momentum- Single-timescale Approximation (MSTSA) -- for tackling problems.
MSTSA allows us to control the error in iterations due to inaccurate solution to the lower level subproblem.
arXiv Detail & Related papers (2021-02-15T07:10:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.