Related papers: MaZO: Masked Zeroth-Order Optimization for Multi-Task Fine-Tuning of Large Language Models

MaZO: Masked Zeroth-Order Optimization for Multi-Task Fine-Tuning of Large Language Models

URL: http://arxiv.org/abs/2502.11513v1
Date: Mon, 17 Feb 2025 07:28:52 GMT
Title: MaZO: Masked Zeroth-Order Optimization for Multi-Task Fine-Tuning of Large Language Models
Authors: Zhen Zhang, Yifan Yang, Kai Zhen, Nathan Susanj, Athanasios Mouchtaris, Siegfried Kunzmann, Zheng Zhang,
Abstract summary: We present MaZO, the first framework specifically designed for multi-task LLM fine-tuning under ZO optimization.<n>MaZO tackles these challenges at the parameter level through two key innovations: a weight importance metric to identify critical parameters and a multi-task weight update mask to selectively update these parameters.<n> Experiments demonstrate that MaZO achieves state-of-the-art performance, surpassing even multi-task learning methods designed for first-order optimization.
Score: 26.980104922985326
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models have demonstrated exceptional capabilities across diverse tasks, but their fine-tuning demands significant memory, posing challenges for resource-constrained environments. Zeroth-order (ZO) optimization provides a memory-efficient alternative by eliminating the need for backpropagation. However, ZO optimization suffers from high gradient variance, and prior research has largely focused on single-task learning, leaving its application to multi-task learning unexplored. Multi-task learning is crucial for leveraging shared knowledge across tasks to improve generalization, yet it introduces unique challenges under ZO settings, such as amplified gradient variance and collinearity. In this paper, we present MaZO, the first framework specifically designed for multi-task LLM fine-tuning under ZO optimization. MaZO tackles these challenges at the parameter level through two key innovations: a weight importance metric to identify critical parameters and a multi-task weight update mask to selectively update these parameters, reducing the dimensionality of the parameter space and mitigating task conflicts. Experiments demonstrate that MaZO achieves state-of-the-art performance, surpassing even multi-task learning methods designed for first-order optimization.

Related papers

Mastering Massive Multi-Task Reinforcement Learning via Mixture-of-Expert Decision Transformer [56.898822179122476]
We propose M3DT, a novel mixture-of-experts (MoE) framework that tackles task scalability by further unlocking the model's parameter scalability.<n> Experimental results show that, by increasing the number of experts, M3DT not only consistently enhances its performance as model expansion on the fixed task numbers, but also exhibits remarkable task scalability, successfully extending to 160 tasks with superior performance.
arXiv Detail & Related papers (2025-05-30T09:08:52Z)
Transforming Vision Transformer: Towards Efficient Multi-Task Asynchronous Learning [59.001091197106085]
Multi-Task Learning (MTL) for Vision Transformer aims at enhancing the model capability by tackling multiple tasks simultaneously.<n>Most recent works have predominantly focused on designing Mixture-of-Experts (MoE) structures and in tegrating Low-Rank Adaptation (LoRA) to efficiently perform multi-task learning.<n>We propose a novel approach dubbed Efficient Multi-Task Learning (EMTAL) by transforming a pre-trained Vision Transformer into an efficient multi-task learner.
arXiv Detail & Related papers (2025-01-12T17:41:23Z)
No More Tuning: Prioritized Multi-Task Learning with Lagrangian Differential Multiplier Methods [10.725249826517734]
In web search, relevance is often prioritized over other metrics, such as click-through rates or user engagement.<n>Existing frameworks pay insufficient attention to the prioritization among different tasks.<n>We introduce a novel multi-task learning framework employing Lagrangian Differential Multiplier Methods for step-wise multi-task optimization.
arXiv Detail & Related papers (2024-12-16T18:58:28Z)
Beyond Task Vectors: Selective Task Arithmetic Based on Importance Metrics [0.0]
This paper introduces textbfunderlineSelective textbfunderlineTask textbfunderlineArithmetic underlinetextbf(STA), a training-free framework designed to enhance multi-task performance through task-specific parameter fusion. Experimental results demonstrate that STA achieves superior multi-task performance across benchmarks and excellent performance in task forgetting.
arXiv Detail & Related papers (2024-11-25T06:59:16Z)
LiNeS: Post-training Layer Scaling Prevents Forgetting and Enhances Model Merging [80.17238673443127]
LiNeS is a post-training editing technique designed to preserve pre-trained generalization while enhancing fine-tuned task performance. LiNeS demonstrates significant improvements in both single-task and multi-task settings across various benchmarks in vision and natural language processing.
arXiv Detail & Related papers (2024-10-22T16:26:05Z)
Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning [50.73666458313015]
Large Language Models (LLMs) have demonstrated significant potential in performing multiple tasks in multimedia applications. MoE has been emerged as a promising solution with its sparse architecture for effective task decoupling. Intuition-MoR1E achieves superior efficiency and 2.15% overall accuracy improvement across 14 public datasets.
arXiv Detail & Related papers (2024-04-13T12:14:58Z)
Multi-Objective Optimization for Sparse Deep Multi-Task Learning [0.0]
We present a Multi-Objective Optimization algorithm using a modified Weighted Chebyshev scalarization for training Deep Neural Networks (DNNs) Our work aims to address the (economical and also ecological) sustainability issue of DNN models, with particular focus on Deep Multi-Task models.
arXiv Detail & Related papers (2023-08-23T16:42:27Z)
Task Aware Feature Extraction Framework for Sequential Dependence Multi-Task Learning [1.0765359420035392]
We analyze sequential dependence MTL from rigorous mathematical perspective. We propose a Task Aware Feature Extraction (TAFE) framework for sequential dependence MTL.
arXiv Detail & Related papers (2023-01-06T13:12:59Z)
Pareto Manifold Learning: Tackling multiple tasks via ensembles of single-task models [50.33956216274694]
In Multi-Task Learning (MTL), tasks may compete and limit the performance achieved on each other, rather than guiding the optimization to a solution. We propose textitPareto Manifold Learning, an ensembling method in weight space.
arXiv Detail & Related papers (2022-10-18T11:20:54Z)
Task Adaptive Parameter Sharing for Multi-Task Learning [114.80350786535952]
Adaptive Task Adapting Sharing (TAPS) is a method for tuning a base model to a new task by adaptively modifying a small, task-specific subset of layers. Compared to other methods, TAPS retains high accuracy on downstream tasks while introducing few task-specific parameters. We evaluate our method on a suite of fine-tuning tasks and architectures (ResNet, DenseNet, ViT) and show that it achieves state-of-the-art performance while being simple to implement.
arXiv Detail & Related papers (2022-03-30T23:16:07Z)
In Defense of the Unitary Scalarization for Deep Multi-Task Learning [121.76421174107463]
We present a theoretical analysis suggesting that many specialized multi-tasks can be interpreted as forms of regularization. We show that, when coupled with standard regularization and stabilization techniques, unitary scalarization matches or improves upon the performance of complex multitasks.
arXiv Detail & Related papers (2022-01-11T18:44:17Z)
Reparameterizing Convolutions for Incremental Multi-Task Learning without Task Interference [75.95287293847697]
Two common challenges in developing multi-task models are often overlooked in literature. First, enabling the model to be inherently incremental, continuously incorporating information from new tasks without forgetting the previously learned ones (incremental learning) Second, eliminating adverse interactions amongst tasks, which has been shown to significantly degrade the single-task performance in a multi-task setup (task interference)
arXiv Detail & Related papers (2020-07-24T14:44:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.