Related papers: Efficient Model Development through Fine-tuning Transfer

Efficient Model Development through Fine-tuning Transfer

URL: http://arxiv.org/abs/2503.20110v2
Date: Thu, 06 Nov 2025 05:26:34 GMT
Title: Efficient Model Development through Fine-tuning Transfer
Authors: Pin-Jie Lin, Rishab Balasubramanian, Fengyuan Liu, Nikhil Kandpal, Tu Vu,
Abstract summary: We show that transferring diff vectors can significantly improve the performance of the target base model.<n>We demonstrate performance gains on multilingual tasks, with 4.7% and 15.5% improvements on Global MMLU for Malagasy and Turkish.<n>Our experiments suggest that fine-tuning transfer is most effective when source and target models lie in a linearly connected region of parameter space.
Score: 13.244979249153872
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern LLMs struggle with efficient updates, as each new pretrained model version requires repeating expensive alignment processes. This challenge also applies to domain- or languagespecific models, where fine-tuning on specialized data must be redone for every new base model release. In this paper, we explore the transfer of fine-tuning updates between model versions. Specifically, we derive the diff vector (representing the weight changes from finetuning) from one source model version and apply it to the base model of a different target version. Through empirical evaluations on various open-weight model versions, we show that transferring diff vectors can significantly improve the performance of the target base model. For example, transferring the fine-tuning updates from Llama 3.0 8B improves Llama 3.1 8B by 46.9% on IFEval and 15.7% on LiveCodeBench without additional training, even surpassing Llama 3.1 8B Instruct. Furthermore, we demonstrate performance gains on multilingual tasks, with 4.7% and 15.5% improvements on Global MMLU for Malagasy and Turkish, respectively. We observe that these merged models provide stronger initializations for further fine-tuning. Lastly, our controlled experiments suggest that fine-tuning transfer is most effective when source and target models lie in a linearly connected region of parameter space, and we provide a theoretical analysis of our method. Taken together, fine-tuning transfer offers a cost-efficient and practical strategy for continuous LLM development. Our code is available at github.com/pjlintw/finetuning-transfer.

Related papers

Ghost in the Transformer: Tracing LLM Lineage with SVD-Fingerprint [40.05267673405802]
Large Language Models (LLMs) have rapidly advanced and are widely adopted across diverse fields.<n>Many developers choose to fine-tune or modify existing open-source models.<n>Some falsely claim original training despite clear derivation from public models.<n>This raises pressing concerns about intellectual property protection.
arXiv Detail & Related papers (2025-11-09T13:57:59Z)
POME: Post Optimization Model Edit via Muon-style Projection [74.73326657229347]
Post-Optimization Model Edit (POME) enhances the performance of fine-tuned large language models.<n>It uses a muon-style projection to $Delta W$, the difference between the fine-tuned and pretrained weights.<n>As a simple post-processing step, POME is completely decoupled from the training pipeline.
arXiv Detail & Related papers (2025-10-08T04:20:11Z)
Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models [49.911784762244814]
TraceRL is a trajectory-aware reinforcement learning framework for diffusion language models (DLMs)<n>We derive a series of state-of-the-art diffusion language models, namely TraDo.<n>TraDo-8B-Instruct achieves relative accuracy improvements of 6.1% over Qwen2.5-7B-Instruct and 51.3% over Llama3.1-8B-Instruct on mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-09-08T17:58:06Z)
Approximating Language Model Training Data from Weights [70.08614275061689]
We formalize the problem of data approximation from model weights and propose several baselines and metrics.<n>We develop a gradient-based approach that selects the highest-matching data from a large public text corpus.<n>Even when none of the true training data is known, our method is able to locate a small subset of public Web documents.
arXiv Detail & Related papers (2025-06-18T15:26:43Z)
Neural Parameter Search for Slimmer Fine-Tuned Models and Better Transfer [17.463052541838504]
Fine-tuned models often struggle outside their specific domains and exhibit considerable redundancy.<n>Recent studies suggest that combining a pruned fine-tuned model with the original pre-trained model can mitigate interference when merging model parameters across tasks.<n>We introduce a novel method called Neural Pruning (NPS-Pruning) for slimming down fine-tuned models.
arXiv Detail & Related papers (2025-05-24T14:27:20Z)
Shadow-FT: Tuning Instruct Model via Training on Paired Base Model [67.20706292627106]
Large language models (LLMs) consistently benefit from further fine-tuning on various tasks.<n>We propose a novel Shadow-FT framework to tune the Instruct models by leveraging the corresponding Base models.<n>Our proposed Shadow-FT introduces no additional parameters, is easy to implement, and significantly improves performance.
arXiv Detail & Related papers (2025-05-19T05:16:21Z)
Extrapolation Merging: Keep Improving With Extrapolation and Merging [14.786100203787194]
Large Language Models (LLMs) require instruction fine-tuning to perform different downstream tasks.<n>Model merging aims to enhance performance by combining the parameters of different models.<n>We propose Extrapolation Merging, a paradigm that can continue improving model performance without requiring extra computational resources or data.
arXiv Detail & Related papers (2025-03-05T14:28:22Z)
Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains [114.76612918465948]
Large language models (LLMs) have achieved remarkable performance in recent years but are fundamentally limited by the underlying training data. We propose a complementary approach towards self-improvement where finetuning is applied to a multiagent society of language models.
arXiv Detail & Related papers (2025-01-10T04:35:46Z)
Self-Data Distillation for Recovering Quality in Pruned Large Language Models [1.5665059604715017]
One-shot pruning results in significant quality degradation, particularly in tasks requiring multi-step reasoning. To recover lost quality, supervised fine-tuning (SFT) is commonly applied, but it can lead to catastrophic forgetting. In this work, we utilize self-data distilled fine-tuning to address these challenges.
arXiv Detail & Related papers (2024-10-13T19:53:40Z)
MUSCLE: A Model Update Strategy for Compatible LLM Evolution [29.032461144831053]
Large Language Models (LLMs) are regularly updated to enhance performance. Instance-level degradation (instance regression) of performance from one model version to the next can interfere with a user's mental model of the capabilities of a particular language model. We propose a training strategy to minimize the extent of instance regression in model updates.
arXiv Detail & Related papers (2024-07-12T17:12:48Z)
EMR-Merging: Tuning-Free High-Performance Model Merging [55.03509900949149]
We show that Elect, Mask & Rescale-Merging (EMR-Merging) shows outstanding performance compared to existing merging methods. EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance.
arXiv Detail & Related papers (2024-05-23T05:25:45Z)
Fisher Mask Nodes for Language Model Merging [0.0]
We introduce a novel model merging method for Transformers, combining insights from previous work in Fisher-weighted averaging and the use of Fisher information in model pruning. Our method exhibits a regular and significant performance increase across various models in the BERT family, outperforming full-scale Fisher-weighted averaging in a fraction of the computational cost.
arXiv Detail & Related papers (2024-03-14T21:52:26Z)
Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking [53.66999416757543]
We study how fine-tuning affects the internal mechanisms implemented in language models. Fine-tuning enhances, rather than alters, the mechanistic operation of the model.
arXiv Detail & Related papers (2024-02-22T18:59:24Z)
A-SDM: Accelerating Stable Diffusion through Redundancy Removal and Performance Optimization [54.113083217869516]
In this work, we first explore the computational redundancy part of the network. We then prune the redundancy blocks of the model and maintain the network performance. Thirdly, we propose a global-regional interactive (GRI) attention to speed up the computationally intensive attention part.
arXiv Detail & Related papers (2023-12-24T15:37:47Z)
FTFT: Efficient and Robust Fine-Tuning by Transferring Training Dynamics [7.58472343957521]
We show that training dynamics are highly transferable across model sizes and pre-training methods. We propose a novel fine-tuning approach: Fine-Tuning by transFerring Training dynamics (FTFT)
arXiv Detail & Related papers (2023-10-10T12:53:48Z)
How to Fine-tune the Model: Unified Model Shift and Model Bias Policy Optimization [13.440645736306267]
This paper develops an algorithm for model-based reinforcement learning. It unifies model shift and model bias and then formulates a fine-tuning process. It achieves state-of-the-art performance on several challenging benchmark tasks.
arXiv Detail & Related papers (2023-09-22T07:27:32Z)
Towards Efficient Task-Driven Model Reprogramming with Foundation Models [52.411508216448716]
Vision foundation models exhibit impressive power, benefiting from the extremely large model capacity and broad training data. However, in practice, downstream scenarios may only support a small model due to the limited computational resources or efficiency considerations. This brings a critical challenge for the real-world application of foundation models: one has to transfer the knowledge of a foundation model to the downstream task.
arXiv Detail & Related papers (2023-04-05T07:28:33Z)
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing [117.41016786835452]
This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model. vanilla embedding sharing in ELECTRA hurts training efficiency and model performance. We propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics.
arXiv Detail & Related papers (2021-11-18T06:48:00Z)
DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models [152.29364079385635]
As pre-trained models grow bigger, the fine-tuning process can be time-consuming and computationally expensive. We propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights. Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning and (ii) resource-efficient inference.
arXiv Detail & Related papers (2021-10-30T03:29:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.