Influence-driven Curriculum Learning for Pre-training on Limited Data
- URL: http://arxiv.org/abs/2508.15475v2
- Date: Fri, 26 Sep 2025 09:45:06 GMT
- Title: Influence-driven Curriculum Learning for Pre-training on Limited Data
- Authors: Loris Schoenegger, Lukas Thoma, Terra Blevins, Benjamin Roth,
- Abstract summary: We investigate whether curriculum learning becomes competitive if we replace conventional human-centered difficulty metrics with one that more closely corresponds to example difficulty as observed during model training.<n>Models trained on our curricula are able to outperform ones trained in random order by over 10 percentage points in benchmarks.
- Score: 8.8896707993459
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Curriculum learning, a training technique where data is presented to the model in order of example difficulty (e.g., from simpler to more complex documents), has shown limited success for pre-training language models. In this work, we investigate whether curriculum learning becomes competitive if we replace conventional human-centered difficulty metrics with one that more closely corresponds to example difficulty as observed during model training. Specifically, we experiment with sorting training examples by their \textit{training data influence}, a score which estimates the effect of individual training examples on the model's output. Models trained on our curricula are able to outperform ones trained in random order by over 10 percentage points in benchmarks, confirming that curriculum learning is beneficial for language model pre-training, as long as a more model-centric notion of difficulty is adopted.
Related papers
- Your Pretrained Model Tells the Difficulty Itself: A Self-Adaptive Curriculum Learning Paradigm for Natural Language Understanding [53.63482987410292]
We present a self-adaptive curriculum learning paradigm that prioritizes fine-tuning examples based on difficulty scores predicted by pre-trained language models.<n>We evaluate our method on four natural language understanding (NLU) datasets covering both binary and multi-class classification tasks.
arXiv Detail & Related papers (2025-07-13T19:36:17Z) - Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning [23.900888224619]
We show that curriculum learning consistently improves convergence in early and mid-training phases.<n>We identify compression ratio, lexical diversity, and readability as effective difficulty signals across settings.
arXiv Detail & Related papers (2025-06-12T21:06:57Z) - How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics [49.9329723199239]
We propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples.
We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics.
When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset.
arXiv Detail & Related papers (2024-10-04T13:39:21Z) - EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training [79.96741042766524]
We reformulate the training curriculum as a soft-selection function.
We show that exposing the contents of natural images can be readily achieved by the intensity of data augmentation.
The resulting method, EfficientTrain++, is simple, general, yet surprisingly effective.
arXiv Detail & Related papers (2024-05-14T17:00:43Z) - Unlearning Traces the Influential Training Data of Language Models [31.33791825286853]
This paper presents UnTrac: unlearning traces the influence of a training dataset on the model's performance.
We propose a more scalable approach, UnTrac-Inv, which unlearns a test dataset and evaluates the unlearned model on training datasets.
arXiv Detail & Related papers (2024-01-26T23:17:31Z) - An Emulator for Fine-Tuning Large Language Models using Small Language
Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales.
We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.
Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z) - Curriculum-Based Self-Training Makes Better Few-Shot Learners for
Data-to-Text Generation [56.98033565736974]
We propose Curriculum-Based Self-Training (CBST) to leverage unlabeled data in a rearranged order determined by the difficulty of text generation.
Our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.
arXiv Detail & Related papers (2022-06-06T16:11:58Z) - On the Impact of Hard Adversarial Instances on Overfitting in Adversarial Training [70.82725772926949]
Adversarial training is a popular method to robustify models against adversarial attacks.<n>In this work, we investigate this phenomenon from the perspective of training instances.<n>We show that the decay in generalization performance of adversarial training is a result of fitting hard adversarial instances.
arXiv Detail & Related papers (2021-12-14T12:19:24Z) - Token-wise Curriculum Learning for Neural Machine Translation [94.93133801641707]
Existing curriculum learning approaches to Neural Machine Translation (NMT) require sufficient sampling amounts of "easy" samples from training data at the early training stage.
We propose a novel token-wise curriculum learning approach that creates sufficient amounts of easy samples.
Our approach can consistently outperform baselines on 5 language pairs, especially for low-resource languages.
arXiv Detail & Related papers (2021-03-20T03:57:59Z) - Training Data Leakage Analysis in Language Models [6.843491191969066]
We introduce a methodology that investigates identifying the user content in the training data that could be leaked under a strong and realistic threat model.
We propose two metrics to quantify user-level data leakage by measuring a model's ability to produce unique sentence fragments within training data.
arXiv Detail & Related papers (2021-01-14T00:57:32Z) - When Do Curricula Work? [26.072472732516335]
ordered learning has been suggested as improvements to the standard i.i.d. training.
We conduct experiments over thousands of orderings spanning three kinds of learning: curriculum, anti-curriculum, and random-curriculum.
We find that curricula have only marginal benefits, and that randomly ordered samples perform as well or better than curricula and anti-curricula.
arXiv Detail & Related papers (2020-12-05T19:41:30Z) - Dynamic Data Selection for Curriculum Learning via Ability Estimation [6.255759848576057]
We propose replacing difficultys with learned difficulty parameters.
We also propose Dynamic selection for Curriculum Learning via Ability Estimation.
We show that models using learned difficulty and/or ability outperform data-based curriculum learning models on the GLUE classification tasks.
arXiv Detail & Related papers (2020-10-30T20:01:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.