Efficient Ensemble for Fine-tuning Language Models on Multiple Datasets
- URL: http://arxiv.org/abs/2505.21930v1
- Date: Wed, 28 May 2025 03:27:08 GMT
- Title: Efficient Ensemble for Fine-tuning Language Models on Multiple Datasets
- Authors: Dongyue Li, Ziniu Zhang, Lu Wang, Hongyang R. Zhang,
- Abstract summary: Existing methods, such as quantized LoRA, are efficient when adapting to a single dataset.<n>We propose an ensemble of multiple smaller adapters instead of a single adapter per task.<n>Our approach provides up to $10%$ higher average test accuracy over QLoRA, with only $9%$ more FLOPs.
- Score: 17.79010397902909
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper develops an ensemble method for fine-tuning a language model to multiple datasets. Existing methods, such as quantized LoRA (QLoRA), are efficient when adapting to a single dataset. When training on multiple datasets of different tasks, a common setup in practice, it remains unclear how to design an efficient adaptation for fine-tuning language models. We propose to use an ensemble of multiple smaller adapters instead of a single adapter per task. We design an efficient algorithm that partitions $n$ datasets into $m$ groups, where $m$ is typically much smaller than $n$ in practice, and train one adapter for each group before taking a weighted combination to form the ensemble. The algorithm leverages a first-order approximation property of low-rank adaptation to quickly obtain the fine-tuning performances of dataset combinations since methods like LoRA stay close to the base model. Hence, we use the gradients of the base model to estimate its behavior during fine-tuning. Empirically, this approximation holds with less than $1\%$ error on models with up to $34$ billion parameters, leading to an estimation of true fine-tuning performances under $5\%$ error while speeding up computation compared to base fine-tuning by $105$ times. When applied to fine-tune Llama and GPT models on ten text classification tasks, our approach provides up to $10\%$ higher average test accuracy over QLoRA, with only $9\%$ more FLOPs. On a Llama model with $34$ billion parameters, an ensemble of QLoRA increases test accuracy by $3\%$ compared to QLoRA, with only $8\%$ more FLOPs.
Related papers
- Complexity-aware fine-tuning [2.0393477576774752]
General-purpose Large Language Models (LLMs) are frequently fine-tuned through supervised fine-tuning (SFT) to enhance performance in specific domains.<n>We propose a novel blueprint for efficient fine-tuning that uses reasoning only for complex data identified by entropy.
arXiv Detail & Related papers (2025-06-26T13:13:24Z) - Ravan: Multi-Head Low-Rank Adaptation for Federated Fine-Tuning [16.99490636203893]
We present textscRavan, an adaptive multi-head LoRA method that balances parameter efficiency and model expressivity.<n>Experiments on vision and language benchmarks show that textscRavan improves test accuracy by 2-8% over prior parameter-efficient baselines.
arXiv Detail & Related papers (2025-06-05T20:28:02Z) - WeightLoRA: Keep Only Necessary Adapters [79.89637596855]
Low-rank adaptation ($texttLoRA$) adds trainable adapters to selected layers.<n>We propose a novel method, $textttWeightLoRA$, which overcomes this issue by adaptive selection of the most critical $textttLoRA$ heads.<n>We conduct experiments for a series of competitive benchmarks and DeBERTa, BART, and Llama models, comparing our method with different adaptive approaches.
arXiv Detail & Related papers (2025-06-03T10:33:16Z) - Decoding-Time Language Model Alignment with Multiple Objectives [116.42095026960598]
Existing methods primarily focus on optimizing LMs for a single reward function, limiting their adaptability to varied objectives.
Here, we propose $textbfmulti-objective decoding (MOD)$, a decoding-time algorithm that outputs the next token from a linear combination of predictions.
We show why existing approaches can be sub-optimal even in natural settings and obtain optimality guarantees for our method.
arXiv Detail & Related papers (2024-06-27T02:46:30Z) - Transfer Q Star: Principled Decoding for LLM Alignment [105.89114186982972]
Transfer $Q*$ estimates the optimal value function for a target reward $r$ through a baseline model.
Our approach significantly reduces the sub-optimality gap observed in prior SoTA methods.
arXiv Detail & Related papers (2024-05-30T21:36:12Z) - On Optimal Caching and Model Multiplexing for Large Model Inference [66.50550915522551]
Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges.
We study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing.
arXiv Detail & Related papers (2023-06-03T05:01:51Z) - PL-$k$NN: A Parameterless Nearest Neighbors Classifier [0.24499092754102875]
The $k$-Nearest Neighbors is one of the most effective and straightforward models employed in numerous problems.
This paper proposes a $k$-Nearest Neighbors classifier that bypasses the need to define the value of $k$.
arXiv Detail & Related papers (2022-09-26T12:52:45Z) - Matching Pursuit Based Scheduling for Over-the-Air Federated Learning [67.59503935237676]
This paper develops a class of low-complexity device scheduling algorithms for over-the-air learning via the method of federated learning.
Compared to the state-of-the-art proposed scheme, the proposed scheme poses a drastically lower efficiency system.
The efficiency of the proposed scheme is confirmed via experiments on the CIFAR dataset.
arXiv Detail & Related papers (2022-06-14T08:14:14Z) - Efficient and robust high-dimensional sparse logistic regression via
nonlinear primal-dual hybrid gradient algorithms [0.0]
We propose an iterative algorithm that provably computes a solution to a logistic regression problem regularized by an elastic net penalty.
This result improves on the known complexity bound of $O(min(m2n,mn2)log (1/epsilon))$ for first-order optimization methods.
arXiv Detail & Related papers (2021-11-30T14:16:48Z) - Improving Robustness and Generality of NLP Models Using Disentangled
Representations [62.08794500431367]
Supervised neural networks first map an input $x$ to a single representation $z$, and then map $z$ to the output label $y$.
We present methods to improve robustness and generality of NLP models from the standpoint of disentangled representation learning.
We show that models trained with the proposed criteria provide better robustness and domain adaptation ability in a wide range of supervised learning tasks.
arXiv Detail & Related papers (2020-09-21T02:48:46Z) - Learning-to-Rank with Partitioned Preference: Fast Estimation for the
Plackett-Luce Model [24.923231199480433]
Given $N$ items with $M$ partitions, calculating the likelihood of data with partitioned preference under the PL model has a time complexity of $O(N+S!)$.
We propose an efficient numerical integration approach for calculating the likelihood and its gradients with a time complexity $O(N+S3)$.
arXiv Detail & Related papers (2020-06-09T06:11:21Z) - Sparse Regression at Scale: Branch-and-Bound rooted in First-Order
Optimization [6.037383467521294]
We present a new exact MIP framework for $ell_0$ regularized regression.
Our framework can scale to $p sim 107$, achieving speedups of at least $5000$x.
We open source the implementation through our toolkit L0BnB.
arXiv Detail & Related papers (2020-04-13T18:45:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.