Multitask Multilingual Model Adaptation with Featurized Low-Rank
Mixtures
- URL: http://arxiv.org/abs/2402.17934v1
- Date: Tue, 27 Feb 2024 23:12:45 GMT
- Title: Multitask Multilingual Model Adaptation with Featurized Low-Rank
Mixtures
- Authors: Chu-Cheng Lin and Xinyi Wang and Jonathan H. Clark and Han Lu and Yun
Zhu and Chenxi Whitehouse and Hongkun Yu
- Abstract summary: Featurized Low-rank Mixtures (FLix) is a novel PEFT method designed for effective multitask multilingual tuning.
FLix associates each unique dataset feature, such as the dataset's language or task, with its own low-rank weight update parameters.
Our experiments show that FLix leads to significant improvements over a variety of tasks for both supervised learning and zero-shot settings.
- Score: 46.250932555711486
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adapting pretrained large language models (LLMs) to various downstream tasks
in tens or hundreds of human languages is computationally expensive.
Parameter-efficient fine-tuning (PEFT) significantly reduces the adaptation
cost, by tuning only a small amount of parameters. However, directly applying
PEFT methods such as LoRA (Hu et al., 2022) on diverse dataset mixtures could
lead to suboptimal performance due to limited parameter capacity and negative
interference among different datasets. In this work, we propose Featurized
Low-rank Mixtures (FLix), a novel PEFT method designed for effective multitask
multilingual tuning. FLix associates each unique dataset feature, such as the
dataset's language or task, with its own low-rank weight update parameters. By
composing feature-specific parameters for each dataset, FLix can accommodate
diverse dataset mixtures and generalize better to unseen datasets. Our
experiments show that FLix leads to significant improvements over a variety of
tasks for both supervised learning and zero-shot settings using different
training data mixtures.
Related papers
- Rethinking Data: Towards Better Performing Domain-Specific Small Language Models [0.0]
This paper presents our approach to finetuning a small Language Models (LM)
We achieve this by improving data quality at each stage of the LM training pipeline.
We improve the model generalization ability by merging the models fine-tuned with different parameters on different data subsets.
arXiv Detail & Related papers (2025-03-03T12:19:12Z) - Swift Cross-Dataset Pruning: Enhancing Fine-Tuning Efficiency in Natural Language Understanding [2.379669478864599]
Current cross-dataset pruning techniques for fine-tuning often rely on computationally expensive sample ranking processes.
We propose Swift Cross-Dataset Pruning (SCDP), which uses TF-IDF embeddings with geometric median to rapidly evaluate sample importance.
Experimental results on six diverse datasets demonstrate the effectiveness of our method, spanning various tasks and scales.
arXiv Detail & Related papers (2025-01-05T03:52:04Z) - SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe [30.03925858123481]
We propose SFTMix, a novel recipe that elevates instruction-tuning performance beyond the conventional NTP paradigm.
Based on training dynamics, we argue that examples with different confidence levels should play distinct roles during the instruction-tuning process.
This approach enables SFTMix to significantly outperform NTP across a wide range of instruction-following and healthcare domain-specific SFT tasks.
arXiv Detail & Related papers (2024-10-07T17:52:21Z) - A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback.
First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF.
Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z) - Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes [53.4856038354195]
Pre-trained large language models (LLMs) need fine-tuning to improve their responsiveness to natural language instructions.
FedKSeed employs zeroth-order optimization with a finite set of random seeds.
It significantly reduces transmission requirements between the server and clients to just a few random seeds.
arXiv Detail & Related papers (2023-12-11T13:03:21Z) - SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models [28.764782216513037]
Federated Learning (FL) can benefit from distributed and private data of the FL edge clients for fine-tuning.
We propose a method called SLoRA, which overcomes the key limitations of LoRA in high heterogeneous data scenarios.
Our experimental results demonstrate that SLoRA achieves performance comparable to full fine-tuning.
arXiv Detail & Related papers (2023-08-12T10:33:57Z) - AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning [112.97430455461097]
We propose a general PEFT method that tunes a mixture of adaptation modules introduced in each Transformer layer while keeping most of the PLM weights frozen.
By only tuning 0.1-0.2% of PLM parameters, we show that AdaMix outperforms SOTA parameter-efficient fine-tuning and full model fine-tuning for both NLU and NLG tasks.
arXiv Detail & Related papers (2022-10-31T16:23:36Z) - Attentional Mixtures of Soft Prompt Tuning for Parameter-efficient
Multi-task Knowledge Sharing [53.399742232323895]
ATTEMPT is a new modular, multi-task, and parameter-efficient language model (LM) tuning approach.
It combines knowledge transferred across different tasks via a mixture of soft prompts while keeping original LM unchanged.
It is parameter-efficient (e.g., updates 1,600 times fewer parameters than fine-tuning) and enables multi-task learning and flexible extensions.
arXiv Detail & Related papers (2022-05-24T10:48:33Z) - Parameter-Efficient Sparsity for Large Language Models Fine-Tuning [63.321205487234074]
We propose a.
sparse-efficient Sparse Training (PST) method to reduce the number of trainable parameters during sparse-aware training.
Experiments with diverse networks (i.e., BERT, RoBERTa and GPT-2) demonstrate PST performs on par or better than previous sparsity methods.
arXiv Detail & Related papers (2022-05-23T02:43:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.