Related papers: Upweighting Easy Samples in Fine-Tuning Mitigates Forgetting

Upweighting Easy Samples in Fine-Tuning Mitigates Forgetting

URL: http://arxiv.org/abs/2502.02797v1
Date: Wed, 05 Feb 2025 00:49:59 GMT
Title: Upweighting Easy Samples in Fine-Tuning Mitigates Forgetting
Authors: Sunny Sanyal, Hayden Prairie, Rudrajit Das, Ali Kavis, Sujay Sanghavi,
Abstract summary: Fine-tuning a pre-trained model on a downstream task often degrades its original capabilities.<n>We propose a sample weighting scheme for the fine-tuning data based on the pre-trained model's losses.<n>We empirically demonstrate the efficacy of our method on both language and vision tasks.
Score: 15.251425165987987
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fine-tuning a pre-trained model on a downstream task often degrades its original capabilities, a phenomenon known as "catastrophic forgetting". This is especially an issue when one does not have access to the data and recipe used to develop the pre-trained model. Under this constraint, most existing methods for mitigating forgetting are inapplicable. To address this challenge, we propose a sample weighting scheme for the fine-tuning data solely based on the pre-trained model's losses. Specifically, we upweight the easy samples on which the pre-trained model's loss is low and vice versa to limit the drift from the pre-trained model. Our approach is orthogonal and yet complementary to existing methods; while such methods mostly operate on parameter or gradient space, we concentrate on the sample space. We theoretically analyze the impact of fine-tuning with our method in a linear setting, showing that it stalls learning in a certain subspace which inhibits overfitting to the target task. We empirically demonstrate the efficacy of our method on both language and vision tasks. As an example, when fine-tuning Gemma 2 2B on MetaMathQA, our method results in only a $0.8\%$ drop in accuracy on GSM8K (another math dataset) compared to standard fine-tuning, while preserving $5.4\%$ more accuracy on the pre-training datasets. Our code is publicly available at https://github.com/sanyalsunny111/FLOW_finetuning .

Related papers

Context-Free Synthetic Data Mitigates Forgetting [13.825822994127943]
We show that augmenting a fine-tuning dataset with context-free generations mitigates forgetting.<n>We present our results for OLMo-1B for pretrained-only setting and R1-Distill-Llama-8B for the reasoning setting.
arXiv Detail & Related papers (2025-05-20T01:47:31Z)
Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection [37.65064631532493]
Finetuning a pretrained model to perform unsupervised prediction on data from a target domain presents two challenges. We measure the efficiency of injecting pretraining data into the finetuning data mixture to avoid forgetting and mitigate overfitting. A key practical takeaway from our study is that injecting as little as 1% of pretraining data in the finetuning data mixture prevents the model from forgetting the pretraining set.
arXiv Detail & Related papers (2025-02-09T21:44:27Z)
Reducing Bias in Pre-trained Models by Tuning while Penalizing Change [8.862970622361747]
Deep models trained on large amounts of data often incorporate implicit biases present during training time. New data is often expensive and hard to come by in areas such as autonomous driving or medical decision-making. We present a method based on change penalization that takes a pre-trained model and adapts the weights to mitigate a previously detected bias.
arXiv Detail & Related papers (2024-04-18T16:12:38Z)
Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets. We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z)
An Efficient Rehearsal Scheme for Catastrophic Forgetting Mitigation during Multi-stage Fine-tuning [55.467047686093025]
A common approach to alleviate such forgetting is to rehearse samples from prior tasks during fine-tuning.<n>We propose a sampling scheme, textttbf mix-cd, that prioritizes rehearsal of collateral damage'' samples.<n>Our approach is computationally efficient, easy to implement, and outperforms several leading continual learning methods in compute-constrained settings.
arXiv Detail & Related papers (2024-02-12T22:32:12Z)
Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks. We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z)
Task-Robust Pre-Training for Worst-Case Downstream Adaptation [62.05108162160981]
Pre-training has achieved remarkable success when transferred to downstream tasks. This paper considers pre-training a model that guarantees a uniformly good performance over the downstream tasks.
arXiv Detail & Related papers (2023-06-21T07:43:23Z)
Dropout Reduces Underfitting [85.61466286688385]
In this study, we demonstrate that dropout can also mitigate underfitting when used at the start of training. We find dropout reduces the directional variance of gradients across mini-batches and helps align the mini-batch gradients with the entire dataset's gradient. Our findings lead us to a solution for improving performance in underfitting models - early dropout: dropout is applied only during the initial phases of training, and turned off afterwards.
arXiv Detail & Related papers (2023-03-02T18:59:15Z)
Two-Stage Fine-Tuning: A Novel Strategy for Learning Class-Imbalanced Data [11.66734752179563]
Classification on long-tailed distributed data is a challenging problem. Learning on tail classes is especially challenging for the fine-tuning when transferring a pretrained model to a downstream task. We propose a two-stage fine-tuning: we first fine-tune the final layer of the pretrained model with class-balanced reweighting loss, and then we perform the standard fine-tuning.
arXiv Detail & Related papers (2022-07-22T03:39:51Z)
Delving into Sample Loss Curve to Embrace Noisy and Imbalanced Data [17.7825114228313]
Corrupted labels and class imbalance are commonly encountered in practically collected training data. Existing approaches alleviate these issues by adopting a sample re-weighting strategy. However, biased samples with corrupted labels and of tailed classes commonly co-exist in training data.
arXiv Detail & Related papers (2021-12-30T09:20:07Z)
Variational Bayesian Unlearning [54.26984662139516]
We study the problem of approximately unlearning a Bayesian model from a small subset of the training data to be erased. We show that it is equivalent to minimizing an evidence upper bound which trades off between fully unlearning from erased data vs. not entirely forgetting the posterior belief. In model training with VI, only an approximate (instead of exact) posterior belief given the full data can be obtained, which makes unlearning even more challenging.
arXiv Detail & Related papers (2020-10-24T11:53:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.