Related papers: Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection

Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection

URL: http://arxiv.org/abs/2502.06042v1
Date: Sun, 09 Feb 2025 21:44:27 GMT
Title: Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection
Authors: Louis Bethune, David Grangier, Dan Busbridge, Eleonora Gualdoni, Marco Cuturi, Pierre Ablin,
Abstract summary: Finetuning a pretrained model to perform unsupervised prediction on data from a target domain presents two challenges.<n>We measure the efficiency of injecting pretraining data into the finetuning data mixture to avoid forgetting and mitigate overfitting.<n>A key practical takeaway from our study is that injecting as little as 1% of pretraining data in the finetuning data mixture prevents the model from forgetting the pretraining set.
Score: 37.65064631532493
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A widespread strategy to obtain a language model that performs well on a target domain is to finetune a pretrained model to perform unsupervised next-token prediction on data from that target domain. Finetuning presents two challenges: (i) if the amount of target data is limited, as in most practical applications, the model will quickly overfit, and (ii) the model will drift away from the original model, forgetting the pretraining data and the generic knowledge that comes with it. We aim to derive scaling laws that quantify these two phenomena for various target domains, amounts of available target data, and model scales. We measure the efficiency of injecting pretraining data into the finetuning data mixture to avoid forgetting and mitigate overfitting. A key practical takeaway from our study is that injecting as little as 1% of pretraining data in the finetuning data mixture prevents the model from forgetting the pretraining set.

Related papers

Early Stopping Against Label Noise Without Validation Data [54.27621957395026]
We propose a novel early stopping method called Label Wave, which does not require validation data for selecting the desired model. We show both the effectiveness of the Label Wave method across various settings and its capability to enhance the performance of existing methods for learning with noisy labels.
arXiv Detail & Related papers (2025-02-11T13:40:15Z)
Upweighting Easy Samples in Fine-Tuning Mitigates Forgetting [15.251425165987987]
Fine-tuning a pre-trained model on a downstream task often degrades its original capabilities.<n>We propose a sample weighting scheme for the fine-tuning data based on the pre-trained model's losses.<n>We empirically demonstrate the efficacy of our method on both language and vision tasks.
arXiv Detail & Related papers (2025-02-05T00:49:59Z)
How Much Do Code Language Models Remember? An Investigation on Data Extraction Attacks before and after Fine-tuning [2.3759432635713895]
We attack both pre-trained and fine-tuned code language models to investigate the extent of data extractability.<n>Fine-tuning requires fewer resources and is increasingly used by both small and large entities for its effectiveness on specialized data.<n>Data carriers and licensing information are the most likely data to be memorized from pre-trained and fine-tuned models, while the latter is the most likely to be forgotten after fine-tuning.
arXiv Detail & Related papers (2025-01-29T09:17:30Z)
The interplay between domain specialization and model size: a case study in the legal domain [8.653321928148547]
We investigate the interplay between domain and model size during continual pre-training under compute-constrained scenarios.<n>Our goal is to identify a compute-efficient training regime for this scenario.<n>As model size increases, the compute-effectiveness gap between specialized and general models widens.
arXiv Detail & Related papers (2025-01-03T19:28:53Z)
Scaling Laws for Precision [73.24325358259753]
We devise "precision-aware" scaling laws for both training and inference. For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data. For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions.
arXiv Detail & Related papers (2024-11-07T00:10:10Z)
Automated Data Augmentation for Few-Shot Time Series Forecasting: A Reinforcement Learning Approach Guided by a Model Zoo [34.40047933452929]
We present a pilot study on using reinforcement learning (RL) for time series data augmentation.<n>Our method, ReAugment, tackles three critical questions: which parts of the training set should be augmented, how the augmentation should be performed, and what advantages RL brings to the process.
arXiv Detail & Related papers (2024-09-10T07:34:19Z)
Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. Existing approaches require re-training models on different data subsets, which is computationally intensive. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z)
TMI! Finetuned Models Leak Private Information from their Pretraining Data [5.150344987657356]
We propose a new membership-inference threat model where the adversary only has access to the finetuned model. We evaluate $textbfTMI$ on both vision and natural language tasks across multiple transfer learning settings. An open-source implementation of $textbfTMI$ can be found on GitHub.
arXiv Detail & Related papers (2023-06-01T22:29:28Z)
Learning to Unlearn: Instance-wise Unlearning for Pre-trained Classifiers [71.70205894168039]
We consider instance-wise unlearning, of which the goal is to delete information on a set of instances from a pre-trained model. We propose two methods that reduce forgetting on the remaining data: 1) utilizing adversarial examples to overcome forgetting at the representation-level and 2) leveraging weight importance metrics to pinpoint network parameters guilty of propagating unwanted information.
arXiv Detail & Related papers (2023-01-27T07:53:50Z)
Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage. We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z)
How to Learn when Data Gradually Reacts to Your Model [10.074466859579571]
We propose a new algorithm, Stateful Performative Gradient Descent (Stateful PerfGD), for minimizing the performative loss even in the presence of these effects. Our experiments confirm that Stateful PerfGD substantially outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2021-12-13T22:05:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.