The Effectiveness of Approximate Regularized Replay for Efficient Supervised Fine-Tuning of Large Language Models
- URL: http://arxiv.org/abs/2512.22337v1
- Date: Fri, 26 Dec 2025 18:55:42 GMT
- Title: The Effectiveness of Approximate Regularized Replay for Efficient Supervised Fine-Tuning of Large Language Models
- Authors: Matthew Riemer, Erik Miehling, Miao Liu, Djallel Bouneffouf, Murray Campbell,
- Abstract summary: LoRA-based supervised fine-tuning can catastrophically degrade model capabilities.<n>Small tweaks to the training procedure with very little overhead can virtually eliminate the problem.
- Score: 17.1510128169152
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although parameter-efficient fine-tuning methods, such as LoRA, only modify a small subset of parameters, they can have a significant impact on the model. Our instruction-tuning experiments show that LoRA-based supervised fine-tuning can catastrophically degrade model capabilities, even when trained on very small datasets for relatively few steps. With that said, we demonstrate that while the most straightforward approach (that is likely the most used in practice) fails spectacularly, small tweaks to the training procedure with very little overhead can virtually eliminate the problem. Particularly, in this paper we consider a regularized approximate replay approach which penalizes KL divergence with respect to the initial model and interleaves in data for next token prediction from a different, yet similar, open access corpus to what was used in pre-training. When applied to Qwen instruction-tuned models, we find that this recipe preserves general knowledge in the model without hindering plasticity to new tasks by adding a modest amount of computational overhead.
Related papers
- Fine-tuning for Better Few Shot Prompting: An Empirical Comparison for Short Answer Grading [0.5825410941577593]
Fine-tuning methods have historically required large-scale compute clusters inaccessible to most users.<n>New closed-model approaches such as OpenAI's fine-tuning service promise results with as few as 100 examples.<n>We evaluate both of these fine-tuning methods, measuring their interaction with few-shot prompting for automated short answer grading.
arXiv Detail & Related papers (2025-08-06T03:52:55Z) - SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation [52.6922833948127]
In this work, we investigate the importance of parameters in pre-trained diffusion models.<n>We propose a novel model fine-tuning method to make full use of these ineffective parameters.<n>Our method enhances the generative capabilities of pre-trained models in downstream applications.
arXiv Detail & Related papers (2024-09-10T16:44:47Z) - Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models.
This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution.
We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z) - FD-Align: Feature Discrimination Alignment for Fine-tuning Pre-Trained
Models in Few-Shot Learning [21.693779973263172]
In this paper, we introduce a fine-tuning approach termed Feature Discrimination Alignment (FD-Align)
Our method aims to bolster the model's generalizability by preserving the consistency of spurious features.
Once fine-tuned, the model can seamlessly integrate with existing methods, leading to performance improvements.
arXiv Detail & Related papers (2023-10-23T17:12:01Z) - An Emulator for Fine-Tuning Large Language Models using Small Language
Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales.
We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.
Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z) - PELA: Learning Parameter-Efficient Models with Low-Rank Approximation [16.9278983497498]
We propose a novel method for increasing the parameter efficiency of pre-trained models by introducing an intermediate pre-training stage.
This allows for direct and efficient utilization of the low-rank model for downstream fine-tuning tasks.
arXiv Detail & Related papers (2023-10-16T07:17:33Z) - Consensus-Adaptive RANSAC [104.87576373187426]
We propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer.
The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer.
arXiv Detail & Related papers (2023-07-26T08:25:46Z) - Taming Small-sample Bias in Low-budget Active Learning [20.900107811622803]
Firth bias reduction can provably reduce the bias during the model training process but might hinder learning if its coefficient is not adaptive to the learning progress.
We propose curriculum Firth bias reduction (CHAIN) that can automatically adjust the coefficient to be adaptive to the training process.
arXiv Detail & Related papers (2023-06-19T16:42:11Z) - Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than
In-Context Learning [81.3514358542452]
Few-shot in-context learning (ICL) incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made.
parameter-efficient fine-tuning offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task.
In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
arXiv Detail & Related papers (2022-05-11T17:10:41Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.